Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Pierre Moreels, Pietro Perona*

A generative probabilistic model for objects in images is presented. An object consists of a constellation of features. Feature appearance and pose are modeled probabilistically. Scene images are generated by draw- ing a set of objects from a given database, with random clutter sprinkled on the remaining image surface. Occlusion is allowed. We study the case where features from the same object share a common reference frame. Moreover, parameters for shape and appearance den- sities are shared across features. This is to be contrasted with previous work on probabilistic `constellation' models where features depend on each other, and each feature and model have different pose and appear- ance statistics [1, 2]. These two differences allow us to build models containing hundreds of features, as well as to train each model from a single example. Our model may also be thought of as a probabilistic revisitation of Lowe's model [3, 4]. We propose an efficient entropy-minimization inference algorithm that constructs the best interpretation of a scene as a collection of objects and clutter. We test our ideas with experiments on two image databases. We compare with Lowe's algorithm and demonstrate better performance, in particular in presence of large amounts of background clutter.

1 Introduction

There is broad agreement in the machine vision literature that objects and object categories should be represented as collections of features or parts with distinctive appearance and mutual position [1, 2, 4, 5, 6, 7, 8, 9]. A number of ideas for efficient detection algorithms (find instances of a given object category, e.g. faces) have been proposed by virtually all the cited authors, far fewer for recognition (list all objects and their pose in a given image) where matching would ideally take a logarithmic time with respect to the number of avail- able models [3, 4]. Learning of parameters characterizing features shape or appearance is still a difficult area, with most authors opting for heavy human intervention (typically segmentation and alignment of the training examples, although [1, 2, 3] train without su- pervision) and very large training sets for object categories (typically in the order of 10 3 - 104, although [10] recently demonstrated learning categories from 1-10 examples).

This work is based on two complementary efforts: the deterministic recognition system proposed by Lowe [3, 4], and the probabilistic constellation models by Perona and col- laborators [1, 2]. The first line of work has three attractive characteristics: objects are represented with hundreds of features, thus increasing robustness; models are learned from a single training example; last but not least, recognition is efficient with databases of hun- dreds of objects. The drawback of Lowe's approach is that both modeling decisions and algorithms rely on heuristics, whose design and performance may be far from optimal in

Figure 1: Diagram of our recognition model showing database, test image and two competing hy- potheses. To avoid a cluttered diagram, only one partial hypothesis is displayed for each hypothesis. The predicted position of models according to the hypotheses are overlaid on the test image.

some circumstances. Conversely, the second line of work is based on principled proba- bilistic object models which yield principled and, in some respects, optimal algorithms for learning and recognition/detection. Unfortunately, the large number of parameters em- ployed in each model limit in practice the number of features being used and require many training examples. By recasting Lowe's model and algorithms in probabilistic terms, we hope to combine the advantages of both methods. Besides, in this paper we choose to focus on individual objects as in [3, 4] rather than on categories as in [1, 2].

In [11] we presented a model aimed at the same problem of individual object recogni- tion. A major difference with the work described here lies in the probabilistic treatment of hypotheses, which allows us here to use directly hypothesis likelihood as a guide for the search, instead of the arbitrary admissible heuristic required by A*.

2 Probabilistic framework and notations

Each model object is represented as a collection of features. Features are informative parts extracted from images by an interest point operator. Each model is the set of features extracted from one training image of a given object - although this could be generalized to features from many images of the same object. Models are indexed by k and denoted by mk, while indices i and j are used respectively for features extracted from the test image and from model images: fi denotes the i - th test feature, while f k j denotes the j - th feature from the k - th model. The features extracted from model images (training set) form the database. A feature detected in a test image can be a consequence of the presence of a model object in the image, in which case it should be associated to a feature from the database. In the alternative, this feature is attributed to a clutter - or background - detection.

The geometric information associated to each feature contains position information (x and y coordinates, denoted by the vector x), orientation (denoted by ) and scale (denoted by ). It is denoted by Xi = (x, i, i) for test feature fi and X k j = (xk j k j , k j ) for model feature f k j . This geometric information is measured relatively to the standard reference frame of the image in which the feature has been detected. All features extracted from the same image share the same reference frame.

The appearance information associated to a feature is a descriptor characterizing the local image appearance near this feature. The measured appearance information is denoted by

Ai for test feature fi and Akj for model feature f kj. In our experiments, features are detected at multiple scales at the extrema of difference-of-gaussians filtered versions of the image [4, 12]. The SIFT descriptor [4] is then used to characterize the local texture about keypoints.

A partial hypothesis h explains the observations made in a fraction of the test image. It combines a model image mh and a corresponding set of pose parameters Xh. Xh encodes position, rotation, scale (this can easily be extended to affine transformations). We assume independence between partial hypotheses. This requires in particular independence be- tween models. Although reasonable, this approximation is not always true (e.g. a keyboard is likely to be detected close to a computer screen). This allows us to search in parallel for multiple objects in a test image.

A hypothesis H is the combination of several partial hypotheses, such that it explains com- pletely the observations made in the test image. A special notation H 0 or h0 denotes any (partial) hypothesis that states that no model object is present in a given fraction of the test image, and that features that could have been detected there are due to clutter.

Our objective is to find which model objects are present in the test scene, given the ob- servations made in the test scene and the information that is present in the database. In probabilistic terms, we look for hypotheses H for which the likelihood ration LR(H) = P (H|{fi},{fkj}) P (H0|{fi},{fkj}) > 1. This ratio characterizes how well models and poses specified by H explain the observations, as opposed to them being generated by clutter. Using Bayes rules and after simplifications, P (H|{fi}, {f k j }) P ({fi}|{f k j }, H ) P (H ) LR(H) = = (1) P (H0|{fi}, {f kj}) P ({fi}|{f k j }, H0) P (H0) where we used P ({f k j }|H ) = P ({f k j }) since the database observations do not depend on the current hypothesis.

A key assumption of this work is that once the pose parameters of the objects (and thus their reference frames) are known, the geometric configuration and appearance of the test features are independent from each other. We also assume independence between features associated to models and features associated to clutter detections, as well as independence between separate clutter detections. Therefore, P ({fi}|{f k j }, H ) = i P (fi|{f k j }, H ). These assumptions of independence are also made in [13], and undelying in [4].

Assignment vectors v represent matches between features from the test scene, and model features or clutter. The dimension of each assignment vector is the number of test features ntest. Its i - th component v(i) = (k, j) denotes that the test feature fi is matched to fv(i) = f kj, j - th feature from model mk. v(i) = (0, 0) denotes the case where fi is attributed to clutter. The set VH of assignment vectors compatible with a hypothesis H are those that assign test features only to models present in H (and to clutter). In particular, the only assignment vector compatible with h0 is v0 such that i, v0(i) = (0, 0). We obtain P (fi|fv(i), mh, Xh) LR(H) = P (H) P (v|{fk j }, mh, Xh) (2) P (H0) P (f vV i|h0) H hH i|fih

P (H) is a prior on hypotheses, we assume it is constant. The term P (v|{f k j }, mh, Xh) is discussed in 3.1, we now explore the other terms.

P (fi|fv(i), mh, Xh) : fi and fv(i) are believed to be one and the same feature. Differences measured between them are noise due to the imaging system as well as distortions caused by viewpoint or lighting conditions changes. This noise probability p n encodes differences in appearance of the descriptors, but also in geometry, i.e. position, scale, orientation Assuming independence between appearance information and geometry information,

```
pn(fi|f k j , mh, Xh) = pn,A(Ai|Av(i), mh, Xh) pn,X (Xi|Xv(i), mh, Xh) (3)
```

Figure 2: Snapshots from the iterative matching process. Two competing hypotheses are displayed (top and bottom row) a) Each assignment vector contains one assignment, suggesting a transformation (red box) b) End of iterative process. The correct hypothesis is supported by numerous matches and high belief, while the wrong hypothesis has only a weak support from few matches and low belief.

The error in geometry is measured by comparing the values observed in the test image, with the predicted values that would be observed if the model features were to be trans- formed according to the parameters Xh. Let's denote by Xh(xv(i)),Xh(v(i)),Xh(v(i)) those predicted values, the geometry part of the noise probability can be decomposed into

```
pn,X (Xi|Xv(i), h) = pn,x(xi, Xh(xv(i))) pn,(i, Xh(v(i))) pn,(i, Xh(v(i))) (4)
```

P (fi|h0) is a density on appearance and position of clutter detections, denoted by p bg(fi). We can decompose this density as well into an appearance term and a geometry term:

```
pbg(fi) = pbg,A(Ai) pbg,X (Xi) = pbg,A(Ai) pbg,(x)(xi) pbg,(i) pbg,(i) (5)
```

pbg,A, pbg,(x)(xi) pbg,(i), pbg,(i) are densities that characterize, for clutter detections, appearance, position, scale and rotation respectively.

Out of lack of space, and since it is not the main focus of this paper, we will not go into the details of how the "foreground density" p n and the "background density" pbg are learned. The main assumption is that those densities are shared across features, instead of having one set of parameters for each feature as in [1, 2]. This results in an important decrease of the number of parameters to be learned, at a slight cost in the model expressiveness.

3 Search for the best interpretation of the test image

The building block of the recognition process is a question, comparing a feature from a database model with a feature of the test image. A question selects a feature from the database, and tries to identify if and where this feature appears in the test image.

3.1 Assignment vectors compatible with hypotheses

For a given hypothesis H, the set of possible assignment vectors V H is too large for explicit exploration. Indeed, each potential match can either be accepted or rejected, which creates a combinatorial explosion. Hence, we approximate the summation in (2) by its largest term. In particular, each assignment vector v and each model referenced in v implies a set of pose parameters Xv (extracted e.g. with least-squares fitting). Therefore, the term P (v|{f k j }, mh, Xh) from (2) will be significant only when Xv Xh, i.e. when the pose implied by the assignment vector agrees with the pose specified by the partial hypothesis. We consider only the assignment vectors v for which Xv Xh. P (vH|{f k j }, h) is assumed to be close to 1. Eq.(2) becomes

```
P (fi|fv LR(H) P (H) h(i), mh, Xh) P (H0) P (f hH i|h i|f 0) (6) ih
```

Our recognition system proceeds by asking questions sequentially and adding matches to assignment vectors. It is therefore natural to define, for a given hypothesis H and the corresponding assignment vector vH and t ntest, the belief in vH by pn(ft|fv(t), mh , Xh ) B t t 0(vH ) = 1, Bt(vH ) = Bt-1(vH) (7) pbg(ft|h0) The geometric part of the belief (cf.(3)-(5) characterizes how close the pose X v implied by the assignments is to the pose Xh specified by the hypothesis. The geometric component of the belief characterizes the quality of the appearance match for the pairs (f i, fv(i)).

3.2 Entropy-based optimization

Our goal is finding quickly the hypothesis that best explains the observations, i.e. the hy- pothesis (models+poses) that has the highest likelihood ratio. We compute such hypothesis incrementally by asking questions sequentially. Each time a question is asked we update the beliefs. We stop the process and declare a detection (i.e. a given model is present in the image) as soon as the belief of a corresponding hypothesis exceeds a given confidence threshold. The speed with which we reach such a conclusion depends on choosing cleverly the next question. A greedy strategy says that the best next question is the one that takes us closest to a detection decision. We do so by considering the entropy of the vector of beliefs (the vector may be normalized to 1 so that each belief is in fact a probability): the lower the entropy the closer we are to a detection. Therefore we study the following heuristic: The most informative next question is the one that minimizes the expectation of the entropy of our beliefs. We call this strategy `minimum expected entropy' (MEE). This idea is due to Geman et al. [14].

Calculating the MEE question is, unfortunately, a complex and expensive calculation in itself. In Monte-Carlo simulations of a simplified version of our problem we notice that the MEE strategy tends to ask questions that relate to the maximum-belief hypothesis. Therefore we approximate the MEE strategy with a simple heuristic: The next question consists of attempting to match one feature of the highest-belief model; specifically, the feature with best appearance match to a feature in the test image.

3.3 Search for the best hypotheses

In an initialization step, a geometric hash table [3, 6, 7] is created by discretizing the space of possible transformations Note that we add only partial hypotheses in a hypothesis one at a time, which allows us to discretize only the space of partial hypotheses (models + poses), instead of discretizing the space of combinations of partial hypotheses.

Questions to be examined are created by pairing database features to the test features clos- est in terms of appearance. Note that since features encode location, orientation and scale, any single assignment between a test feature and a model feature contains enough infor- mation to characterize a similarity transformation. It is therefore natural to restrict the set of possible transformations to similarities, and to insert each candidate assignment in the corresponding geometric hash table entry. This forms a pool of candidate assignments. The set of hypotheses is initialized to the center of the hash table entries, and their belief is set to 1. The motivation for this initialization step is to examine, for each partial hypothesis, only a small number of candidate matches. A partial hypothesis corresponds to a hash table entry, we consider only the candidate assignments that fall into this same entry.

Each iteration proceeds as follows. The hypothesis H that currently has the highest likeli- hood ratio is selected. If the geometric hash table entry corresponding to the current partial hypothesis h, contains candidate assignments that have not been examined yet, one of them, ( m fi, f h j ) is picked - currently, the best appearance match - and the probabilities p bg(fi) m and pn(fi|f h j , mh, Xh) are computed. As mentioned in 3.1, only the best assignment

Figure 3: Results from our algorithm in various situations (viewpoint change can be seen in Fig.6). Each row shows the best hypothesis in terms of belief. a) Occlusion b) Change of scale.

Figure 4: ROC curves for both experiments. The performance improvement from our probabilistic formulation is particularly significant when a low false alarm rate is desired. The threshold used is the repeatability rate defined in [15]

```
m vector is explored: if pn(fi|f h j , mh, Xh) > pbg(fi) the match is accepted and inserted in m the hypothesis. In the alternative, fi is considered a clutter detection and f h j is a missed detection. The belief B(vH ) and the likelihood ratio LR(H) are updated using (7).
```

After adding an assignment to a hypothesis, frame parameters X h are recomputed using least-squares optimization, based on all assignments currently associated to this hypothe- sis. This parameter estimation step provides a progressive refinement of the model pose parameters as assignments are added. Fig.2 illustrates this process.

The exploration of a partial hypothesis ends when no more candidate match is available in the hash table entry. We proceed with the next best partial hypothesis. The search ends when all test scene features have been matched or assigned to clutter.

4 Experimental results

4.1 Experimental setting

We tested our algorithm on two sets of images, containing respectively 49 and 161 model images, and 101 and 51 test images (sets P M - gadgets - 03 and JP - 3Dobjects - 04 available from http : //www.vision.caltech.edu/html - f iles/arc hive.html). Each model image contained a single object. Test images contained from zero (negative exam- ples) to five objects, for a total of 178 objects in the first set, and 79 objects in the second set. A large fraction of each test image consists of background. The images were taken with no precautions relatively to lighting conditions or viewing angle.

The first set contains common kitchen items and objects of everyday use. The second set (Ponce Lab, UIUC) includes office pictures. The objects were always moved between model images and test images. The images of model objects used in the learning stage were downsampled to fit in a 500 500 pixels box, the test images were downsampled to 800 800 pixels. With these settings, the number of features generated by the features detector was of the order of 1000 per training image and 2000-4000 per test image.

Figure 5: Behavior induced by clutter detections. A ground truth model was created by cutting a rectangle from the test image and adding noise. The recognition process is therefore expected to find a perfect match. The two rows show the best and second best model found by each algorithm (estimated frame position shown by the red box, features that found a match are shown in yellow).

Do not remove: This comment is monitored to verify that the site is working properly