Part of Advances in Neural Information Processing Systems 11 (NIPS 1998)

*Daphna Weinshall, David Jacobs, Yoram Gdalyahu*

A key question in vision is how to represent our knowledge of previously encountered objects to classify new ones. The answer depends on how we determine the similarity of two objects. Similarity tells us how relevant each previously seen object is in determining the category to which a new object belongs. Here a dichotomy emerges. Complex notions of similar(cid:173) ity appear necessary for cognitive models and applications, while simple notions of similarity form a tractable basis for current computational ap(cid:173) proaches to classification. We explore the nature of this dichotomy and why it calls for new approaches to well-studied problems in learning. We begin this process by demonstrating new computational methods for supervised learning that can handle complex notions of similarity. (1) We discuss how to implement parametric met.hods that represent a class by its mean when using non-metric similarity functions; and (2) We review non-parametric methods that we have developed using near(cid:173) est neighbor classification in non-metric spaces. Point (2) , and some of the background of our work have been described in more detail in [8].

1 Supervised Learning and Non-Metric Distances

How can one represent one 's knowledge of previously encountered objects in order to classify new objects? We study this question within the framework of supel vised learning: it is assumed that one is given a number of training objects, each labeled as belonging to a category; one wishes to use this experience to label new test instances of objects. This problem emerges both in the modeling of cognitive processes and in many practical applications. For example, one might want to identify risky applicants for credit based on past experience with clients who have proven to be good or bad credit risks. Our work is motivated by computer vision applications.

Most current computational approaches to supervised learning suppose that objects can be thought of as vectors of numbers, or equivalently as points lying in an n(cid:173) dimensional space. They further suppose that the similarity between objects can be determined from the Euclidean distance between these vectors, or from some other simple metric. This classic notion of similarity as Euclidean or metric distance leads

Classification in Non-Metric Spaces

839

to considerable mathematical and computational simplification .

However, work in cognitive psychology has challenged such simple notions of sim(cid:173) ilarity as models of human judgment, while applications frequently employ non(cid:173) Euclidean distances to measure object similarity. We consider the need for similar(cid:173) ity measures that are not only non-Euclidean , but that are non-metric. We focus on proposed similarities that violate one requirement of a metric distance, the triangle inequality. This states that if we denote the distance between objects A and B by d(A , B) , then : VA , B , C : d(A, B) + d(B, C) ~ d(A , C) . Distances violating the triangle inequality must also be non-Euclidean.

Data from cognitive psychology has demonstrated that similarity judgments may not be well modeled by Euclidean distances. Tversky [12] has demonstrated in(cid:173) stances in which similarity judgments may violate the triangle inequality. For ex(cid:173) ample, close similarity between Jamaica and Cuba and between Cuba and Russia does not imply close similarity between Jamaica and Russia (see also [10]) . Non(cid:173) metric similarity measures are frequently employed for practical reasons, too (cf. [5]) . In part, work in robust statistics [7] has shown that methods that will survive the presence of outliers, which are extraneous pieces of information or information containing extreme errors, must employ non-Euclidean distances that in fact violate the triangle inequality ; related insights have spurred the widespread use of robust methods in computer vision (reviewed in [5] and [9]).

We are interested in handling a wide range of non-metric distance functions, includ(cid:173) ing those that are so complex that they must be treated as a black box . However, to be concrete, we will focus here on two simple examples of such distances:

median distance: This distance assumes that objects are representable as a set of features whose individual differences can be measured, so that the difference between two objects is representable as a vector: J = (d1 , d2 , .. . dn ). The median distance between the two objects is just the median value in this vector. Similarly, one can define a k-median distance by choosing the k'th lowest element in this list. k(cid:173) median distances are often used in applications (cf. [9]) , because they are unaffected by the exact values of the most extreme differences between the objects . Only these features that are most similar determine its value. The k-median distance can violate the triangle inequality to an arbitrary degree (i.e. , there are no constraints on the pairwise distances between three points) . robust non-metric LP distances: Given a difference vector J, an LP distance has the form:

(1)

and is non-metric for p < 1. Figure 1 illustrates why these distances present significant new challenges in su(cid:173) pervised learning. Suppose that given some datapoints (two in Fig. 1) , we wish to classify each new point as coming from the same category as its nearest neighbor. Then we need to determine the Voronoi diagram generated by our data: a division of the plane into regions in which the points all have the same nearest neighbor. Fig. 1 shows how the Voronoi diagram changes with the function used to compute the distance between datapoints; the non-metric diagrams (rightmost three pictures in Fig. 1) are more complex and more likely to make non-intuitive predictions. In fact , very little is known about the computation of non-metric Voronoi diagrams.

We now describe new parametric methods for supervised learning with non-metric

840

D. Weins hall, D. W Jacobs and Y. Gdalyahu

Figure 1: The Voronoi diagram for two points using, from left to right, p-distances with p = 2 (Euclidean), p = 1 ( Manhattan, which is still metric), the non-metric distances arising from p = 0.5, p = 0.2, and the min (I-median) distance. The min distance in 2-D illustrates the behavior of the other median distances in higher dimensions. The region of the plane closer to one point is shown in black, and closer to the other in white.

distances, and review non-parametric methods that we described in [8].

2 Parametric methods: what should replace the mean

Parametric methods typically represent objects as vectors in a high-dimensional space, and represent classes and the boundaries between them in this space us(cid:173) ing geometric constructions or probability distributions with a limited number of parameters. One can attempt to extend these techniques to specific non-metric distances, such as the median distance , or non-metric LP distances. We discuss the example of the mean of a class below. One can also redefine geometric ob(cid:173) jects such as linear separators, for specific non-metric distances. However, existing algorithms for finding such objects in Euclidean spaces will no longer be directly suitable, nor will theoretical results about such representations hold. Many prob(cid:173) lems are therefore open in determining how to best apply parametric supervised learning techniques to specific non-metric distances.

1

We analyze k-means clustering where each class is represented by its average mem(cid:173) ber; new elements are then classified according to which of these prototypical exam(cid:173) ples is nearest . In Euclidean space, the mean is the point q whose sum of squared distances to all the class members {qdr=l - (2:~1 d(ij, qi)2)2 - is minimized. Suppose now that our data come from a vector space where the correct distance is the LP distance from (1). Using the natural extension of the above definition, we should represent each class by the point ij whose sum of distances to all the class members - (2:~=1 d(ij, qi)P) p - is minimal. It is now possible to show (proof is omitted) that for p < 1 (the non-metric cases), the exact value of every feature of the representative point ij must have already appeared in at least one element in the class. Moreover, the value of these features can be determined separately with complexity O(n 2 ), and total complexity of O(dn 2 ) given d features . ij is therefore determined by a mixture of up to d exemplars, where d is the dimension of the vector space. Thus there are efficient algorithms for finding the "mean" element of a class, even using certain non-metric distances.

1

We will illustrate these results with a concrete example using the corel database, a commercial database of images pre-labeled by categories (such as "lions"), where non-metric distance functions have proven effective in determining the similarity of images [1] . The corel database is very large, making the use of prototypes desirable.

We represent each image using a vector of 11 numbers describing general image properties, such as color histograms, as described in [1] . We consider the Euclidean

Classification in Non-Metric Spaces

841

and L0 5 distances, and their corresponding prototypes: the mean and the LO.5_ prototype computed according to the result above. Given the first 45 classes, each containing 100 images , we found their corresponding prototypes; we then computed the percentage of images in each class that are closest to their own prototype, using either the Euclidean or the L 0.5 distance and one of the two prototypes. The results are the following:

mean d existing features

Do not remove: This comment is monitored to verify that the site is working properly