Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Volker Roth*

The problem of detecting "atypical objects" or "outliers" is one of the classical topics in (robust) statistics. Recently, it has been proposed to address this problem by means of one-class SVM classifiers. The main conceptual shortcoming of most one-class approaches, however, is that in a strict sense they are unable to detect outliers, since the expected fraction of outliers has to be specified in advance. The method presented in this paper overcomes this problem by relating kernelized one-class classifica- tion to Gaussian density estimation in the induced feature space. Having established this relation, it is possible to identify "atypical objects" by quantifying their deviations from the Gaussian model. For RBF kernels it is shown that the Gaussian model is "rich enough" in the sense that it asymptotically provides an unbiased estimator for the true density. In or- der to overcome the inherent model selection problem, a cross-validated likelihood criterion for selecting all free model parameters is applied.

1 Introduction

A one-class-classifier attempts to find a separating boundary between a data set and the rest of the feature space. A natural application of such a classifier is estimating a contour line of the underlying data density for a certain quantile value. Such contour lines may be used to separate "typical" objects from "atypical" ones. Objects that look "sufficiently atypical" are often considered to be outliers, for which one rejects the hypothesis that they come from the same distribution as the majority of the objects. Thus, a useful application scenario would be to find a boundary which separates the jointly distributed objects from the outliers. Finding such a boundary defines a classification problem in which, however, usually only sufficiently many labeled samples from one class are available. Usually no labeled samples from the outlier class are available at all, and it is even unknown if there are any outliers present.

It is interesting to notice that the approach of directly estimating a boundary, as opposed to first estimating the whole density, follows one of the main ideas in learning theory which states that one should avoid solving a too hard intermediate problem. While this line of rea- soning seems to be appealing from a theoretical point of view, it leads to a severe problem in practical applications: when it comes to detecting outliers, the restriction to estimating only a boundary makes it impossible to derive a formal characterization of outliers with- out prior assumptions on the expected fraction of outliers or even on their distribution. In practice, however, any such prior assumptions can hardly be justified. The fundamental

problem of the one-class approach lies in the fact that outlier detection is a (partially) unsu- pervised task which has been "squeezed" into a classification framework. The missing part of information has been shifted to prior assumptions which can probably only be justified, if the solution of the original problem was known in advance.

This paper aims at overcoming this problem by linking kernel-based one-class classifiers to Gaussian density estimation in the induced feature space. Objects which have an "unex- pected" high Mahalanobis distance to the sample mean are considered as "atypical objects" or outliers. A particular Mahalanobis distance is considered to be unexpected, if it is very unlikely to observe an object that far away from the mean vector in a random sample of a certain size. We will formalize this concept in section 3 by way of fitting linear models in quantile-quantile plots. The main technical ingredient of our method is the one-class kernel Fisher discriminant classifier (OC-KFD), for which the relation to Gaussian density estimation is shown. From the classification side, the OC-KFD-based model inherits the simple complexity control mechanism by using regularization techniques. The explicit re- lation to Gaussian density estimation, on the other hand, makes it possible to formalize the notion of atypical objects by observing deviations from the Gaussian model. It is clear that these deviations will heavily depend on the chosen model parameters. In order to derive an objective characterization of atypical objects it is, thus, necessary to select a suitable model in advance. This model-selection problem is overcome by using a likelihood-based cross-validation framework for inferring the free parameters.

2 Gaussian density estimation and one-class LDA

Let X denote the n d data matrix which contains the n input vectors xi Rd as rows. It has been proposed to estimate a one-class decision boundary by separating the dataset from the origin [12], which effectively coincides with replicating all xi with opposite sign and separating X and -X. Typically, a -SVM classifier with RBF kernel function is used. The parameter bounds the expected number of outliers and must be selected a priori. The method proposed here follows the same idea of separating the data from their negatively replicated counterparts. Instead of a SVM, however, a Kernel Fisher Discriminant (KFD) classifier is used [7, 10]. The latter has the advantage that is is closely related to Gaussian density estimation in the induced feature space. By making this relation explicit, outliers can be identified without specifying the expected fraction of outliers in advance. We start with a linear discriminant analysis (LDA) model, and then kernels will be introduced.

Let X = (X, -X) denote the augmented (2n d) data matrix which also contains the negative samples -xi. Without loss of generality we assume that the sample mean + x i i > 0, so that the sample means of the positive data and the negative data differ: + = -. Let us now assume that our data are realizations of a normally distributed random variable in d dimensions: X Nd(, ). Denoting by Xc the centered data matrix, the estimator for takes the form ^ W = (1/n)Xc Xc.

The LDA solution B maximizes the between-class scatter with B = ++ + W = 1 - - under the constraint on the within-class scatter . Note that in our special case with X = (X, -X) the usual pooled within-class matrix W simply reduces to the above defined W = (1/n)Xc Xc. Denoting by y = (2, . . . , 2, -2, . . . , -2) a 2n-indicator vector for class membership in class "+" or "-", it is well-known (see e.g. [1]) that the LDA solution (up to a scaling factor) can be found by minimizing a least-squares functional: ^ = arg min y -X 2. In [3] a slightly more general form of the problem is described where the above functional is minimized under a constrained on , which in the simplest case amounts to adding a term to the functional. Such a ridge regression model assumes a penalized total covariance of the form T = (1/(2n)) X X + I = (1/n) X X + I. Defining an n-vector of ones y = (1, . . . , 1) , the solution vector ^

reads ^ = (X X + I)-1X y = (X X + I)-1X y. (1)

According to [3], an appropriate scaling factor is defied in terms of the quantity s2 = (1/n) y ^ y = (1/n) y X ^ , which leads us to the correctly scaled LDA vector = s-1(1 - s2)-1/2 ^ that fulfills the normalization condition W = 1 .

One further derives from [3] that the mean vector of X, projected onto the 1-dimensional LDA-subspace has the coordinate value m+ = s(1 - s2)-1/2, and that the Mahalanobis distance from a vector x to the sample mean + is the sum of the squared Euclidean distance in the projected space and an orthogonal distance term:

```
D(x, +) = ( x - m x)2 + x T -1x. +)2 + D with D = -(1 - s2)( (2)
```

Note that it is the term D which makes the density estimation model essentially different from OC-classification: while the latter considers only distances in the direction of the projection vector , the true density model additionally takes into account the distances in the orthogonal subspace.

Since the assumption X Nd(, ) is very restrictive, we propose to relax it by assuming that we have found a suitable transformation of our input data : Rd Rp, x (x), such that the transformed data are Gaussian in p dimensions. If the transformation is carried out implicitly by introducing a Mercer kernel k(xi, xj), we arrive at an equivalent problem in terms of the kernel matrix K = and the expansion coefficients :

```
^ = (K + I)-1y. (3)
```

From [11] it follows that the mapped vectors can be represented in Rn as (x) = K-1/2k(x), where k(x) denotes the kernel vector k(x) = (k(x, x1), . . . , k(x, xn)) . Finally we derive the following form of the Mahalanobis distances which again consist of the Euclidean distance in the classification subspace plus an orthogonal term:

```
D(x, +) = ( k(x) - m k(x))2 + n(x), +)2 - (1 - s2)( (4)
```

where (x) = (x)( + I)-1(x) = k (x)(K + I)-1K-1k(x), m+ = s(1 - s2)-1/2, s2 = (1/n) y ^ y = (1/n) y K ^ , and = s-1(1 - s2)-1/2 ^ .

Equation (4) establishes the desired link between OC-KFD and Gaussian density estima- tion, since for our outlier detection mechanism only Mahalanobis distances are needed. While it seems to be rather complicated to estimate a density by the above procedure, the main benefit over directly estimating the mean and the covariance lies in the inherent com- plexity regulation properties of ridge regression. Such a complexity control mechanism is of particular importance in highly nonlinear kernel models. Moreover, for ridge-regression models it is possible to analytically calculate the effective degrees of freedom, a quantity that will be of particular interest when it comes to detecting outliers.

3 Detecting outliers

Let us assume that the model is completely specified, i.e. both the kernel function k(, ) and the regularization parameter are fixed. The central lemma that helps us to detect outliers can be found in most statistical textbooks:

Lemma 1. Let X be a Gaussian random variable X Nd(, ). Then (X - ) -1(X - ) follows a chi-square (2) distribution on d degrees of freedom.

For the penalized regression models, it might be more appropriate to use the effective de- grees of freedom df instead of d in the above lemma. In the case of one-class LDA with ridge penalties we can easily estimate it as df = trace(X(X X + I)-1X ), [8], which

for a kernel model translates into df = trace(K(K + I)-1). The intuitive interpretation of the quantity df is the following: denoting by V the matrix of eigenvectors of K and by {i}ni=1 the corresponding eigenvalues, the fitted values ^ y read

```
^ y = V diag {i = i/(i + )} V y. (5)
```

It follows that compared to the unpenalized case, where all eigenvectors vi are constantly weighted by 1, the contribution of the i-th eigenvector vi is down-weighted by a factor i/1 = i. If the ordered eigenvalues decrease rapidly, however, the values i are either close to zero or close to one, and df determines the number of terms that are "essentially different" from zero. The same is true for the orthogonal distance term in eq. (4): note that

(x) = k (x)(K + I)-1K-1k(x) = k V diag i = ((i + )i)-1 V k(x). (6)

Compared to the unpenalized case (the contribution of vi is weighted by -2 i ), the contri- bution of vi is down-weighted by the same factor i/-2 = i i .

From lemma 1 we conclude that if the data are well described by a Gaussian model in the kernel feature space, the observed Mahalanobis distances should look like a sample from a 2-distribution with df degrees of freedom. A graphical way to test this hypothesis is to plot the observed quantiles against the theoretical 2 quantiles, which in the ideal case gives a straight line. Such a quantile-quantile plot is constructed as follows: Let (i) denote the observed Mahalanobis distances ordered from lowest to highest, and pi the cumulative proportion before each (i) given by pi = (i - 1/2)/n. Let further zi = F -1pi denote the theoretical quantile at position pi, where F is the cumulative 2-distribution function. The quantile-quantile plot is then obtained by plotting (i) against zi. Deviations from linearity can be formalized by fitting a linear model on the observed quantiles and calculating confidence intervals around the fit. Observations falling outside the confidence interval are then treated as outliers. A potential problem of this approach is that the outliers themselves heavily influence the quantile-quantile fit. In order to overcome this problem, the use of robust fitting procedures has been proposed in the literature, see e.g. [4]. In the experiments below we use an M-estimator with Huber loss function. For estimating confidence intervals around the fit we use the standard formula (see [2, 5])

```
((i)) = b (2(zi))-1 (pi(1 - pi))/n, (7)
```

which can be intuitively understood as the product of the slope b and the standard error of the quantiles. A 100(1 - )% envelope around the fit is then defined as (i) z/2((i)) where z/2 is the 1 - (1 - )/2 quantile of the standard normal distribution.

The choice of the confidence level is somewhat arbitrary, and from a conceptual viewpoint one might even argue that the problem of specifying one free parameter (i.e. the expected fraction of outliers) has simply been transferred into the problem of specifying another one. In practice, however, selecting is a much more intuitive procedure than guessing the fraction of outliers. Whereas the latter requires problem-specific prior knowledge which is hardly available in practice, the former depends only on the variance of a linear model fit. Thus, can be specified in a problem independent way.

4 Model selection

In our model the data are first mapped into some feature space, in which then a Gaussian model is fitted. Mahalanobis distances to the mean of this Gaussian are computed by evaluating (4). The feature space mapping is implicitly defined by the kernel function, for which we assume that it is parametrized by a kernel parameter . For selecting all free parameters in (4), we are, thus, left with the problem of selecting = (, ) .

The idea is now to select by maximizing the cross-validated likelihood. From a theoret- ical viewpoint, the cross-validated (CV) likelihood framework is appealing, since in [13]

the CV likelihood selector has been shown to asymptotically perform as well as the opti- mal benchmark selector which characterizes the best possible model (in terms of Kullback- Leibler divergence to the true distribution) contained in the parametric family.

For kernels that map into a space with dimension p > n, however, two problems arise: (i) the subspace spanned by the mapped samples varies with different sample sizes; (ii) not the whole feature space is accessible for vectors in the input space. As a consequence, it is difficult to find a "proper" normalization of the Gaussian density in the induced feature space. We propose to avoid this problem by considering the likelihood in the input space rather than in the feature space, i.e. we are looking for a properly normalized density model p(x|) in Rd such that p(x|) has the same contour lines as the Gaussian model in the feature space: p(xi|) = p(xi|) p((xi)|) = p((xj)|). Denoting by Xn = {xi}ni=1 a sample from p(x) from which the kernel matrix K is built, a natural input space model is

```
pn(x|Xn, ) = Z-1 exp{- 1 D(x; X p 2 n, )}, with Z = Rd n(x|Xn, ) dx, (8) where D(x; Xn, ) denotes the (parametrized) Mahalanobis distances (4) of a Gaussian model in the feature space. Note that this density model in the input space has the same form as our Gaussian model in the feature space, except for the different normalization constant Z. Computing this constant Z requires us to solve a normalization integral over the whole d-dimensional input space. Since in general this integral is not analytically tractable for nonlinear kernel models, we propose to approximate Z by a Monte Carlo sampling method. In our experiments, for instance, the VEGAS algorithm [6], which implements a mixed importance-stratified sampling approach, showed to be a reasonable method for up to 10 input dimensions.
```

By using the CV likelihood framework we are guaranteed to (asymptotically) perform as well as the best model in the parametrized family. Thus, the question arises whether the family of densities defined by a Gaussian model in a kernel-induced feature space is "rich enough" such that no systematic errors occur. For RBF kernels, the following lemma pro- vides a positive answer to this question. Lemma 2. Let k(x 2 i, xj ) = exp(- xi - xj /). As 0 , pn(x|Xn, ) converges to a Parzen window with vanishing kernel width: p n n(x|Xn, ) 1 (x - x n i=1 i).

A formal proof is omitted due to space limitations. The basic ingredients of the proof are: (i) In the limit 0 the expansion coefficients approach ^ 1/(1 + )1. Thus, ^ y = K ^ 1/(1 + )1 and s2 1/(1 + ). (ii) D(x; , ) C(x) < , if x {x n i}n i=1, and D(x; , ) , else. Finally pn(x|Xn, , ) 1 (x - x n i=1 i).

Note that in the limit 0 a Parzen window becomes an unbiased estimator for any continuous density, which provides an asymptotic justification for our approach: the cross- validated likelihood framework guarantees us to convergence to a model that performs as well as the best model in our model class as n . The latter, however, is "rich enough" in the sense that it contains models which in the limit 0 converge to an unbiased estimator for every continuous p(x). Since contour lines of pn(x) are contour lines of a Gaussian model in the feature space, the Mahalanobis distances are expected to follow a 2 distribution, and atypical objects can be detected by observing the distribution of the empirical Mahalanobis distances as described in the last section.

It remains to show that describing the data as a Gaussian in a kernel-induced feature space is a statistically sound model. This is actually the case, since there exist decay rates for the kernel width such that n grows at a higher rate as the effective degrees of freedom df : Lemma 3. Let k(x 2 i, xj ) = exp(- xi - xj /) and pn(x|Xn, , ) defined by (8). If 1 decays like O(n-1/2), and for fixed 1, the ratio df /n 0 as n .

A formal proof is omitted due to space limitations. The basic ingredients of the proof are: (i) the eigenvalues i of (1/n)K converge to i as n , (ii) the eigenvalue spectrum of

a Gaussian RBF kernel decays at an exponential-quadratic rate: i exp(-i2), (iii) for n sufficiently large, it holds that n 1/[1 + (/n) exp(n-1/2i2)] n1/2-1 log(n/) i=1 (proof by induction, using the fact that ln(n + 1) - ln(n) 1/(n2 + n) which follows from a Taylor expansion of the logarithm) df (n)/n 0.

Do not remove: This comment is monitored to verify that the site is working properly