{"title": "Asymptotic slowing down of the nearest-neighbor classifier", "book": "Advances in Neural Information Processing Systems", "page_first": 932, "page_last": 938, "abstract": null, "full_text": "Asymptotic slowing down of the \n\nnearest-neighbor classifier \n\nRobert R. Snapp \nCS lEE Department \nUniversity of Vermont \nBurlington, VT 05405 \n\nDemetri Psaltis \n\nElectrical Engineering \n\nCaltech 116-81 \n\nPasadena, CA 91125 \n\nSantosh S. Venkatesh \nElectrical Engineering \nUniversity of Pennsylvania \nPhiladelphia, PA 19104 \n\nAbstract \n\nIf patterns are drawn from an n-dimensional feature space according to a \nprobability distribution that obeys a weak smoothness criterion, we show \nthat the probability that a random input pattern is misclassified by a \nnearest-neighbor classifier using M random reference patterns asymptoti(cid:173)\ncally satisfies \n\nPM(error) \"\" Poo(error) + M2/n' \n\na \n\nfor sufficiently large values of M. Here, Poo(error) denotes the probability \nof error in the infinite sample limit, and is at most twice the error of a \nBayes classifier. Although the value of the coefficient a depends upon the \nunderlying probability distributions, the exponent of M is largely distri(cid:173)\nbution free. We thus obtain a concise relation between a classifier's ability \nto generalize from a finite reference sample and the dimensionality of the \nfeature space, as well as an analytic validation of Bellman's well known \n\"curse of dimensionality.\" \n\n1 \n\nINTRODUCTION \n\nOne of the primary tasks assigned to neural networks is pattern classification. Com(cid:173)\nmon applications include recognition problems dealing with speech, handwritten \ncharacters, DNA sequences, military targets, and (in this conference) sexual iden(cid:173)\ntity. Two fundamental concepts associated with pattern classification are general(cid:173)\nization (how well does a classifier respond to input data it has never encountered \nbefore?) and scalability (how are a classifier's processing and training requirements \naffected by increasing the number of features that describe the input patterns?). \n\n932 \n\n\fAsymptotic Slowing Down of the Nearest-Neighbor Classifier \n\n933 \n\nDespite recent progress, our present understanding of these concepts in the con(cid:173)\ntext of neural networks is obstructed by complexities in the functional form of the \nnetwork and in the classification problems themselves. \n\nIn this correspondence we will present analytic results on these issues for the nearest(cid:173)\nneighbor classifier. Noted for its algorithmic simplicity and nearly optimal perfor(cid:173)\nmance in the infinite sample limit, this pattern classifier plays a central role in the \nfield of pattern recognition. Furthermore, because it uses proximity in feature space \nas a measure of class similarity, its performance on a given classification problem \nshould yield qualitative cues to the performance of a. neural network. Indeed, a \nnearest-neighbor classifier can be readily implemented as a \"winner-take-all\" neural \nnetwork. \n\n2 THE TASK OF PATTERN CLASSIFICATION \n\nWe begin with a formulation of the two-class problem (Duda and Hart, 1973): \n\nLet the labels WI and W2 denote two states of nature, or pattern classes. \nA pattern belonging to one of these two classes is selected, and a vector of \nn features, x, that describe the selected pattern is presented to a pattern \nclassifier. The classifier then attempts to guess the selected pattern's class \nby assigning x to either WI or W2. \n\nAs an example, the two class labels might represent the states benign and malignant \nas they pertain to the diagnosis of cancer tumors; the feature vector could then be \na 1024 x 1024 pixel, real-valued representation of an electron-microscope image. A \npattern classifier can thus be viewed as a mapping from an n-dimensional feature \nspace to the discrete set {WI,W2}, and can be specified by demarcating the regions \nin the n-dimensional feature space that correspond to WI and W2. We define the \ndecision region ni as the set of feature vectors that the pattern classifier assigns to \nWI, with a.n analogous definition for n 2 . A useful figure of merit is the probability \nthat the feature vector of a randomly selected pattern is assigned to the correct \nclass. \n\n2.1 THE BAYES CLASSIFIER \n\nIf sufficient information is available, it is possible to construct an optimal pattern \nclassifier. Let P(wt) and P(W2) denote the prior probabilities of the two states of \nnature. (For our cancer diagnosis problem, the prior probabilities can be estimated \nby the relative frequency of each type of tumor in a large statistical sample.) Fur(cid:173)\nther, let p(x I wI) and p(x I W2) denote the class-conditional probability densities of \nthe feature vector for the two class problem. The total probability density is now \ndefined by p(x) = p(x I WI)P(Wt) + p(x I W2)P(W2), and gives the unconditional \ndistribution of the feature vector. Where p(x) ::J:. 0 we can now use Bayes' rule to \ncompute the posterior probabilities: \nI ) - p(x I wt)P(wt) \n\nP( \n\nand \n\nWI X \n\n-\n\np(x) \n\nThe Bayes classifier assigns an unclassified feature vector x to the class label having \n\n\f934 \n\nSnapp, ~altis, and Venkatesh \n\nthe greatest posterior probability. (If the posterior probabilities happen to be equal, \nthen the class assignment is arbitrary.) With'R,l and'R,2 denoting the two decision \nregions induced by this strategy, the probability of error of the Bayes classifier, PB, \nis just the probability that x is drawn from class Wl but lies in the Bayes decision \nregion 'R,2, or conversely, that x is drawn from class W2 but lies in the Bayes decision \nregion'R,l: \n\nThe reader may verify that the Bayes classifier minimizes the probability of error. \nUnfortunately, it is usually impossible to obtain expressions for the class-conditional \ndensities and prior probabilities in practice. Typically, the available information \nresides in a set of correctly labeled patterns, which we collectively term a training \nor reference sample. Over the last few decades, numerous pattern classification \nstrategies have been developed that attempt to learn the structure of a classification \nproblem from a finite training sample. (The backpropagation algorithm is a recent \nexample.) The underlying hope is that the classifier's performance can be made \nacceptable with a sufficiently large reference sample. In order to understand how \nlarge a sample may be needed, we turn to what is perhaps the simplest learning \nalgorithm of this class. \n\n3 THE NEAREST-NEIGHBOR CLASSIFIER \n\nLet XM = ((xU), 0(1\u00bb), (xC2), O(2\u00bb), ... , (xCM) , OCM\u00bb)} denote a finite reference sam(cid:173)\nple of M feature vectors, xCi) E R n, with corresponding known class assignments, \nOCi) E {Wl, W2}. The nearest-neighbor \nrule assigns each feature vector x to \nclass Wl or W2 as a function of the ref(cid:173)\nerence M -sample as follows: \n\no \n\n\u2022 Identify (x', 0') E XM such that \nIlx-x'lI ~ Ilx-xCi)11 for i ranging \nfrom 1 through M; \n\u2022 Assign x to class (J'. \n\nHere, IIx-YIi = VE7=l(Xj - Yj)2 de(cid:173)\n\nnotes the Euclidean metric in Rn.lThe \nnearest-neighbor rule hence classifies \neach feature vector x according to the \nlabel, (J', of the closest point, x', in \nthe reference sample. As an example, \nwe sketch the nearest-neighbor deci(cid:173)\nsion regions for a two-dimensional clas(cid:173)\nsification problem in Fig. 1. \n\n.~: \" \n\n&~ yi\\:::::: \" \u2022\u2022 \n... '\n\n\" \" \n\n\" \n\n: \n\no \n\no \n\no \n\nFigure 1: The decision regions induced \nby a nearest-neighbor classifier with a \nseven-element reference set in the plane. \n\nlOther metrics, such as the more general Minkowski-r metric, are also possible. \n\n\fAsymptotic Slowing Down of the Nearest-Neighbor Classifier \n\n935 \n\nIt is interesting to consider how the performance of this classifier compares with that \nof a Bayes classifier. To facilitate this analysis, we assume that the reference patterns \nare selected from the total probability density p(x) in a statistically independent \nmanner (i.e., the choice of Xj does not in any way bias the selection of X(j+1) and \n8(j+1\u00bb. Furthermore, let PM(error) denote the probability of error of a nearest(cid:173)\nneighbor classifier working with the reference sample X M, and let P 00 (error) denote \nthis probability in the infinite sample limit (M -+ 00). We will also let S denote \nthe volume in feature space over which p(x) is nonzero. The following well known \ntheorem shows that the nearest-neighbor classifier, with an infinite reference sample, \nis nearly optimal (Cover and Hart, 1967).2 \n\nTheorem 1 For the two-class problem in the infinite sample limit, the probability \nof error of a nearest-neighbor classifier tends toward the value, \n\nPoo(error) = 2 L P(W1 I X)P(W2 I x)p(x) c?x, \n\nwhich is furthermore bounded by the two inequalities, \n\nPB < Poo(error) :s; 2PB(I- PB), \nwhere PB is the probability of error of a Bayes classifier. \n\nThis encouraging result is not so surprising if one considers that, with probability \none, about every feature vector x is centered a ball of radius (: that contains an \ninfinite number of reference feature vectors for every (: > O. The annoying factor of \ntwo accounts for the event that the nearest neighbor to x belongs to the class with \nsmaller posterior probability. \n\n3.1 THE ASYMPTOTIC CONVERGENCE RATE \n\nIn order to satisfactorily address the issues of generalization and scalability for the \nnearest-neighbor classifier, we need to consider the rate at which the performance of \nthe classifier approaches its infinite sample limit. The following theorem applicable \nto nearest-neighbor classification in one-dimensional feature spaces was shown by \nCover (1968). \nTheorem 2 Let p(x I wI) and p(x I W2) have uniformly bounded third derivatives \nand let p(x) be bounded away from zero on S. Then for sufficiently large M, \n\nPM(error) = Poo(error) + 0 (~2) . \n\nNote that this result is also very encouraging in that an order of magnitude increase \nin the sample size, decreases the error rate by two orders of magnitude. \n\nThe following theorem is our main result which extends Cover's theorem to n(cid:173)\ndimensional feature spaces: \n\n20riginally, this theorem was stated for multiclass decision problems; it is here presented \n\nfor the two class problem only for simplicity. \n\n\f936 \n\nSnapp, &altis, and Venkatesh \n\nTheorem 3 Let p(x I wt), p(x I W2), and p(x) satisfy the same conditions as in \nTheorem 2. Then, there exists a scalar a (depending on n) such that \n\nPM(error) I'V Poo(error) + M2/n' \n\na \n\nwhere the right-hand side describes the first two terms of an asymptotic expansion \nin reciprocal powers of M2/n. Explicitly, \n\na = r (1 +~) (r (~+ 1))2/n t f (f3i(X!P)(x) + ~'\"Yii(X)) (p(x\u00bbI-2/n dnx. \n\nmr \n\ni=l is \n\np x \n\n2 \n\nwhere, \n\nPi(X) \n\napex) \n--a;:-\nP( \n\nWI X \n\nP( \n\nWI X \n\nI )f)P(w21 x) \n+ \nI )f)2P(W2 I x) \n\n~ \nUXi \n\n+ \n\nf) 2 \nXi \n\nf)P(WI I x)P( \n\nI ) \n\nW2 X \n\n~ \nUXi \n\nf)2P(WI I x)P( \n\nI ) \nW2 X. \n\nf) 2 \nXi \n\nFor n = 1 this result agrees with Cover's theorem. With increasing n, however, \nthe convergence rate significantly slows down. Note that the constant a depends on \nthe way in which the class-conditional densities overlap. If a is bounded away from \nzero, then for sufficiently small 6 > 0, PM(error) - Poo(error) < 6 is satisfied only \nif M > (a/ 6)n/2 so that the sample size required to achieve a given performance \ncriterion is exponential in the dimensionality of the feature space. The above pro(cid:173)\nvides a sufficient condition for Bellman's well known \"curse of dimensionality\" in \nthis context. \n\nIt is also interesting to note that one can easily construct classification problems for \nwhich a vanishes. (Consider, for example, p(x I wI) = p(x I W2) for all x.) In these \ncases the higher-order terms in the asymptotic expansion are important. \n\n4 A NUMERICAL EXPERIMENT \n\nA conspicuous weakness in the above theorem is the requirement that p(x) be \nbounded away from zero over S. In exchange for a uniformly convergent asymptotic \nexpansion, we have omitted many important probability distributions, including \nnormal distributions. Therefore we numerically estimate the asymptotic behavior \nof PM (error) for a problem consisting of two normally distributed classes in R n : \n\np(x I wd \n\np(x I W2) \n\n(27r0'~)n/2 exp [-2;2 ((Xl - J-L)2 + L7=2 xI)], \n(27r0'~)n/2 exp [- 2;2 ((Xl + J-L)2 + LJ=2 xJ)] . \n\nAssuming that P(wI) = P(W2) = 1/2, we find \n\nPoo(error) = ~e-J1~/2q~ fOO e-:t:~/2q~ sech (J-LX) dx. \n\n0' 27r \n\nio \n\n0'2 \n\n\fAsymptotic Slowing Down of the Nearest-Neighbor Classifier \n\n937 \n\n-1.0 \n\n-2.0 \n\n-----\n......... g \n\na) \n'-' \n~8 \n\n......... \nM g \n\na) \n'-' \n~':E. \n\"-'\" \n~ -3.0 \n~ \n\n-\n\n0 n=l \n+ n=2 \n6 n=3 \n0 n=4 \n<> n=5 \n\n+ \no 0 \n\n+ \n\n0 \n\n0 \n\n-4.0 \n\n0.0 \n\n0.5 \n\n1.0 \n\n1.5 \n\n2.0 \n\n2.5 \n\nloglO(M) \n\nFigure 2: Numerical validation of the nearest-neighbor scaling hypothesis for two \nnormally distributed classes in R n . \n\nFor J1. = (1 = 1, Poo(error) is numerically found to be 0.22480, which is consistent \nwith the Bayes probability of error, PB = (1/2)erfc(I/V2) = 0.15865. (Note that \nthe expression for a given in Theorem 3 is undefined for these distributions.) For \nn ranging from 1 to 5, and M ranging from 1 to 200, three estimates of PM (error) \nwere obtained, each as the fraction of \"failures\" in 160,000 or more Bernoulli trials. \nEach trial consists of constructing a pseudo-random sample of M reference patterns, \nfollowed by a single attempt to correctly classify a random input pattern. These \nestimates of PM are represented in Figure 2 by circular markers for n = 1, crosses \nfor n = 2, etc. The lines in Figure 2 depict the power law \nPM(error) = Poo(error) + bM- 2/ n , \n\nwhere, for each n, b is chosen to obtain an appealing fit. The agreement between \nthese lines and data points suggests that the asymptotic scaling hypothesis of The(cid:173)\norem 3 can be extended to a wider class of distributions. \n\n\f938 \n\nSnapp, Psaltis, and Venkatesh \n\n5 DISCUSSION \n\nThe preceding analysis indicates that the convergence rate of the nearest-neighbor \nclassifier slows down dramatically as the dimensionality of the feature space in(cid:173)\ncreases. This rate reduction suggests that proximity in feature space is a less effec(cid:173)\ntive measure of class identity in higher dimensional feature spaces. It is also clear \nthat some degree of smoothness in the class-conditional densities is necessary, as \nwell as sufficient, for the asymptotic behavior described by our analysis to occur: \nin the absence of smoothness conditions, one can construct classification problems \nfor which the nearest-neighbor convergence rate is arbitrarily slow, even in one di(cid:173)\nmension (Cover, 1968). Fortunately, the most pressing classification problems are \ntypically smooth in that they are constrained by regularities implicit in the laws of \nnature (Marr, 1982). With additional prior information, the convergence rate may \nbe enhanced by selecting a fewer number of descriptive features. \n\nBecause of their smooth input-output response, neural networks appear to use prox(cid:173)\nimity in feature space as a basis for classification. One might, therefore, expect the \nrequired sample size to scale exponentially with the dimensionality of the feature \nspace. Recent results from computational learning theory, however, imply that with \na sample size proportional to the capacity-a combinatorial quantity which is char(cid:173)\nacteristic of the network architecture and which typically grows polynomially in the \ndimensionality of the feature space-one can in principle identify network param(cid:173)\neters (weights) which give (close to) the smallest classification error for the given \narchitecture (Baum and Haussler, 1989). There are two caveats, however. First, \nthe information-theoretic sample complexities predicted by learning theory give no \nclue as to whether, given a sample of the requisite size, there exist any algorithms \nthat can specify the appropriate parameters in a reasonable time frame. Second, \nand more fundamental, one cannot in general determine whether a particular ar(cid:173)\nchitecture is intrinsically well suited to a given classification problem. The best \nperformance achievable may be substantially poorer than that of a Bayes classifier. \nThus, without sufficient prior information, one must search through the space of \nall possible network architectures for one that does fit the problem well. This situ(cid:173)\nation now effectively resembles a non-parametric classifier and the analytic results \nfor the sample complexities of the nearest-neighbor classifier should provide at least \nqualitative indications of the corresponding case for neural networks. \n\nReferences \n\nBaum, E. B. and Haussler, D. (1989), \"What size net gives valid generalization,\" \nNeural Computation, 1, pp. 151-160. \n\nCover, T. M. (1968), \"Rates of convergence of nearest neighbor decision proce(cid:173)\ndures,\" Proc. First Annual Hawaii Conference on Systems Theory, pp. 413-415. \n\nCover, T. M. and P. E. Hart (1967), \"Nearest neighbor pattern classification,\" IEEE \nTrans. Info. Theory, vol. IT-13, pp. 21-27. \n\nDuda, R. O. and P. E. Hart (1973), Pattern Classification and Scene Analysis. New \nYork: John Wiley & Sons. \nMarr, D. (1982), Vision, San Francisco: W. H. Freeman. \n\n\f", "award": [], "sourceid": 332, "authors": [{"given_name": "Robert", "family_name": "Snapp", "institution": null}, {"given_name": "Demetri", "family_name": "Psaltis", "institution": null}, {"given_name": "Santosh", "family_name": "Venkatesh", "institution": null}]}