Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Antonio Torralba, Kevin P. Murphy, William Freeman*

We seek to both detect and segment objects in images. To exploit both lo- cal image data as well as contextual information, we introduce Boosted Random Fields (BRFs), which uses Boosting to learn the graph struc- ture and local evidence of a conditional random field (CRF). The graph structure is learned by assembling graph fragments in an additive model. The connections between individual pixels are not very informative, but by using dense graphs, we can pool information from large regions of the image; dense models also support efficient inference. We show how contextual information from other objects can improve detection perfor- mance, both in terms of accuracy and speed, by using a computational cascade. We apply our system to detect stuff and things in office and street scenes. 1 Introduction

Our long-term goal is to build a vision system that can examine an image and describe what objects are in it, and where. In many images, such as Fig. 5(a), objects of interest, such as the keyboard or mouse, are so small that they are impossible to detect just by using local features. Seeing a blob next to a keyboard, humans can infer it is likely to be a mouse; we want to give a computer the same abilities.

There are several pieces of related work. Murphy et al [9] used global scene context to help object recognition, but did not model relationships between objects. Fink and Perona [4] exploited local dependencies in a boosting framework, but did not allow for multiple rounds of communication between correlated objects. He et al [6] do not model connections between objects directly, but rather they induce such correlations indirectly, via a bank of hidden variables, using a "restricted Boltzmann machine" architecture.

In this paper, we exploit contextual correlations between the object classes by introducing Boosted Random Fields (BRFs). Boosted random fields build on both boosting [5, 10] and conditional random fields (CRFs) [8, 7, 6]. Boosting is a simple way of sequentially constructing "strong" classifiers from "weak" components, and has been used for single- class object detection with great success [12]. Dietterich et al [3] combine boosting and 1D CRFs, but they only consider the problem of learning the local evidence potentials; we consider the much harder problem of learning the structure of a 2D CRF.

Standard applications of MRFs/ CRFs to images [7] assume a 4-nearest neighbor grid structure. While successful in low-level vision, this structure will fail in capturing im- portant long distance dependencies between whole regions and across classes. We propose a method for learning densely connected random fields with long range connections. The

topology of these connections is chosen by a weak learner which has access to a library of graph fragments, derived from patches of labeled training images, which reflect typical spatial arrangments of objects (similar to the segmentation fragments in [2]). At each round of the learning algorithm, we add more connections from other locations in the image and from other classes (detectors). The connections are assumed to be spatially invariant, which means this update can be performed using convolution followed by a sigmoid nonlinearity. The resulting architecture is similar to a convolutional neural network, although we used a stagewise training procedure, which is much faster than back propagation.

In addition to recognizing things, such as cars and people, we are also interested in recog- nizing spatially extended "stuff" [1], such as roads and buildings. The traditional sliding window approach to object detection does not work well for detecting "stuff". Instead, we combine object detection and image segmentation (c.f., [2]) by labeling every pixel in the image. We do not rely on a bottom-up image segmentation algorithm, which can be fragile without top-down guidance.

2 Learning potentials and graph structure

A conditional random field (CRF) is a distribution of the form 1 P (S|x) = Z i(Si) i,j (Si, Sj ) i jNi

where x is the input (e.g., image), Ni are the neighbors of node i, and Si are labels. We have assumed pairwise potentials for notational simplicity. Our goal is to learn the local evidence potentials, i, the compatibility potentials , and the set of neighbors Ni.

We propose the following simple approximation: use belief propagation (BP) to estimate the marginals, P (Si|x), and then use boosting to maximize the likelihood of each node's training data with respect to i and .

In more detail, the algorithm is as follows. At iteration t, the goal is to minimize the negative log-likelihood of the training data. As in [11], we consider the per-label loss (i.e., we use marginal probabilities), as opposed to requiring that the joint labeling be correct (as in Viterbi decoding). Hence the cost function to be minimized is

```
Jt = Jti = - bti,m(Si,m) = - bti,m(+1)Si,mbti,m(-1)1-Si,m (1) i m i m i
```

where Si,m {-1, +1} is the true label for pixel i in training case m, Si,m = (Si,m + 1)/2 {0, 1} is just a relabeling, and bti,m = [P (Si = -1|xm, t), P (Si = 1|xm, t)] is the belief state at node i given input image xm after t iterations of the algorithm.

The belief at node i is given by the following (dropping the dependence on case m) bti(1) ti(1) Mti(1) where Mti is the product of all the messages coming into i from all its neighbors at time t and where the message that k sends to i is given by bt (s M t+1(1) = t+1 (1) t+1 (1) = k k) i (2) ki ki k,i(sk, 1) t (sk) kN ik i sk{-1,+1}

where k,i is the compatility between nodes k and i. If we assume that the local potentials have the form t /2 /2 i(si) = [eF t i ; e-F ti ], where F ti is some function of the input data, then: bti(+1) = (F ti + Gti), Gti = log Mti(+1) - log Mti(-1) (3) where (u) = 1/(1 + e-u) is the sigmoid function. Hence each term in Eq. 1 simplifies to a cost function similar to that used in boosting:

```
log Jt +Gt ) i,m i = log 1 + e-Si,m(F ti,m . (4) m
1. Input: a set of labeled pairs {xi,m; Si,m}, bound T Output: Local evidence functions f ti(x) and message update functions gti(bN ). i
2. Initialize: bt=0 i,m = 0; F t=0 i,m = 0; Gt=0 i,m = 0
3. For t=1..T.
(a) Fit local potential fi(xi,m) by weighted LS to Y t +Gt ) i,m i,m = Si,m(1 + e-Si,m(F t i )
(b) .Fit compatibilities gti(bt-1 ) to Y t N i,m by weighted LS. i ,m
(c) Compute local potential F t i,m = F t-1 + f t i,m i (xi,m) (d) Compute compatibilities Gti,m = t gn ) n=1 i (bt-1 Ni,m
(e) Update the beliefs bti,m = (F ti,m + Gti,m)
(f) Update weights wt+1 = bt i,m i,m(-1) bt i,m(+1)
Figure 1: BRF training algorithm.
```

We assume that the graph is very densely connected so that the information that one single node sends to another is so small that we can make the approximation t+1 (+1)/ t+1 (-1) 1. (This is a reasonable approximation in the case of images, ki ki where each node represents a single pixel; only when the influence of many pixels is taken into account will the messages become informative.) Hence

```
bt (s k,m k ) M t+1(+1) s k,i(sk, +1) t (s Gt+1 = log i = log k [-1,+1] i k ) k i (5) M t+1(-1) bt (sk) i k,m k s k,i(sk, -1) k [-1,+1] t (s i k ) k
k,i(sk, +1) bt (s k,m k) log sk[-1,+1] (6) k,i(sk, -1) bt (sk) k sk[-1,+1] k,m
```

With this simplification, Gt+1 (bt i is now a non-linear function of the beliefs Gt+1 i m) at iteration t. Therefore, We can write the beliefs at iteration t as a function of the local evidences and the beliefs at time t - 1: bti(+1) = (F ti(xi,m) + Gti(bt-1 m )). The key idea behind BRFs is to use boosting to learn the G functions, which approximately implement message passing in densely connected graphs. We explain this in more detail below.

2.1 Learning local evidence potentials

Defining F ti(xi,m) = F t-1(x i i,m) + f t i (xi,m) as an additive model, where xi,m are the features of training sample m at node i, we can learn this function in a stagewise fashion by optimizing the second order Taylor expansion of Eq. 4 wrt f ti, as in logitBoost [5]:

```
arg min log Jti arg min wti,m(Y ti,m - fti(xi,m))2 (7) f t f t i i m
```

where Y t +Gt ) i,m i,m = Si,m(1+e-Si,m(F t i ). In the case that the weak learner is a "regression stump", fi(x) = ah(x)+b, we can find the optimal a, b by solving a weighted least squares problem, with weights wti,m = bti(-1) bti(+1); we can find the best basis function h(x) by searching over all elements of a dictionary.

2.2 Learning compatibility potentials and graph structure

In this section, we discuss how to learn the compatibility functions ij, and hence the structure of the graph. Instead of learning the compatibility functions ij, we propose to

```
1. Input: a set of inputs {xi,m} and functions f ti, gti Output: Set of beliefs bi,m and MAP estimates Si,m.
2. Initialize: bt=0 i,m = 0; F t=0 i,m = 0; Gt=0 i,m = 0
3. From t = 1 to T , repeat
(a) Update local evidences F t i,m = F t-1 + f t i,m i (xi,m) (b) Update compatibilities Gti,m = t gn ) n=1 i (bt-1 Ni,m
(c) Compute current beliefs bti,m = (F ti,m + Gti,m)
4. Output classification is Si,m = bti,m > 0.5
Figure 2: BRF run-time inference algorithm.
```

learn directly the function Gt+1 i . We propose to use an additive model for Gt+1 i as we did for learning F : Gt+1 = t gn i,m n=1 i (btm), where btm is a vector with the beliefs of all nodes in the graph at iteration t for the training sample m. The weak learners gn i (btm) can be regression stumps with the form gn i (btm) = a(w btm > ) + b, where a, b, are the parameters of the regression stump, and wi is a set of weights selected from a dictionary. In the case of a graph with weak and almost symmetrical connections (which holds if (s1, s2) 1, for all (s1, s2), which implies the messages are not very informative) we can further simplify the function Gt+1 i by approximating it as a linear function of the beliefs:

```
Gt+1 = i,m k,i btk,m(+1) + k,i (8) kNi
```

This step reduces the computational cost. The weak learners gn i (btm) will also be linear functions. Hence the belief update simplifies to bt+1(+1) = ( i,m i btm + i + F t i,m), which is similar to the mean-field update equations. The neighborhood Ni over which we sum incoming messages is determined by the graph structure, which is encoded in the non-zero values of i. Each weak learner gn i will compute a weighted combination of the beliefs of the some subset of the nodes; this subset may change from iteration to iteration, and can be quite large. At iteration t, we choose the weak learner gti so as to minimize

```
t-1 log Jt +gt(bt-1)+ gn(bt-1)) i m i m i (bt-1) = - log 1 + e-Si,m(F ti,m n=1 m
```

which reduces to a weighted least squares problem similar to Eq. 7. See Fig. 1 for the pseudo-code for the complete learning algorithm, and Fig. 2 for the pseudo-code for run- time inference.

3 BRFs for multiclass object detection and segmentation

With the BRF training algorithm in hand, we describe our approach for multiclass object detection and region-labeling using densely connected BRFs.

3.1 Weak learners for detecting stuff and things

The square sliding window approach does not provide a natural way of working with irreg- ular objects. Using region labeling as an image representation allows dealing with irregular and extended objects (buildings, bookshelf, road, ...). Extended stuff [1] may be a very important source of contextual information for other objects.

```
(a) Examples from the dictionary of about 2000 patches and masks, Ux,y, Vx,y.
(b) Examples from the dictionary of 30 graphs, Wx,y,c. f t=0 f t=1 f t=2 F S + + ... = put thu utO Tr (c) Example feedforward segmentation for screens.
```

Figure 3: Examples of patches from the dictionary and an example of the segmentation obtained using boosting trained with patches from (a).

The weak learners we use for the local evidence potentials are based on the segmentation fragments proposed in [2]. Specifically, we create a dictionary of about 2000 image patches U , chosen at random (but overlapping each object), plus a corresponding set of binary (in- class/ out-of-class) image masks, V : see Fig. 3(a). At each round t, for each class c, and for each dictionary entry, we construct the following weak learner, whose output is a binary matrix of the same size as the image I: v(I) = ((I U ) > ) V > 0 (9) where represents normalized cross-correlation and represents convolution. The in- tuition behind this is that I U will produce peaks at image locations that contain this patch/template, and then convolving with V will superimpose the segmentation mask on top of the peaks. As a function of the threshold , the feature will behave more as a template detector ( 1) or as a texture descriptor ( << 1).

To be able to detect objects at multiple scales, we first downsample the image to scale , compute v(I ), and then upsample the result. The final weak learner does this for multiple scales, ORs all the results together, and then takes a linear transformation. f (I) = ([v(I ) ]) + (10) Fig. 3(c) shows an example of segmentation obtained by using boosting without context. The weak learners we use for the compatibility functions have a similar form: C gc(b) = bc Wc + (11) c=1

where bc is the image formed by the beliefs at all pixels for class c. This convolution corresponds to eq. 8 in which the node i is one pixel x, y of class c. The binary kernels (graph fragments) W define, for each node x, y of object class c, all the nodes from which it will receive messages. These kernels are chosen by sampling patches of various sizes from the labeling of images from the training set. This allows generating complicated patterns of connectivity that reflect the statistics of object co-occurrences in the training set. The overall incoming message is given by adding the kernels obtained at each boosting round. (This is the key difference from mutual boosting [4], where the incoming message is just the output of a single weak learner; thus, in mutual boosting, previously learned inter-class connections are only used once.) Although it would seem to take O(t) time to compute Gt, we can precompute a single equivalent kernel W , so at runtime the overall complexity is still linear in the number of boosting rounds, O(T ). C t C Gtx,y,c = bc nW n c + ndef = b + c W c c=1 n=1 n c=1

```
car car building car road car Road F b=(F+G)
Car car building building building road building
Building x G car road building road road road
y c) A car out of context a) Incoming messages (outside 3rd floor windows) to a car node. b) Compatibilities (W'). is less of a car.
t=1 t=2 t=4 t=20 t=40 Final labeling b(car)
S(all)
```

d) Evolution of the beliefs for the car nodes (b) and labeling (S) for road, building, car.

```
Figure 4: Street scene. The BRF is trained to detect cars, buildings and the road.
```

In Fig. 4(a-b), we show the structures of the graph and the weights W defined by GT for a BRF trained to detect cars, buildings and roads in street scenes.

3.2 Learning and inference

For training we used a labeled dataset of office and street scenes with about 100 images in each set. During the training, in the first 5 rounds we only update the local potentials, to allow local evidence to accrue. After the 5th iteration we start updating also the compatibil- ity functions. At each round, we update only the local potential and compatibility function associated with a single object class that reduces the most the multiclass cost. This allows objects that need many features to have more complicated local potentials.

The algorithm learns to first detect easy (and large) objects, since these reduce the error of all classes the fastest. The easy-to-detect objects can then pass information to the harder ones. For instance, in office scenes, the system first detects screens, then keyboards, and finally computer mice. Fig. 5 illustrates this behavior on the test set. A similar behavior is obtained for the car detector (Fig. 4(d)). The detection of building and road provides strong constraints for the locations of the car.

3.3 Cascade of classifiers with BRFs

The BRF can be turned into a cascade [12] by thresholding the beliefs. Computations can then be reduced by doing the convolutions (required for computing f and g) only in pixels that are still candidates for the presence of the target. At each round we update a binary rejection mask for each object class, Rtx,y,c, by thresholding the beliefs at round t: Rtx,y,c = Rt-1 x,y,c (btx,y,c > tc). A pixel in the rejection mask is set to zero when we can decide that the object is not present (when btx,y,c is below the threshold tc 0), and it is set to 1 when more processing is required. The threshold tc is chosen so that the percentage of missed detections is below a predefined level (we use 1%). Similarity we can define a detection mask that will indicate pixels in which we decide the object is present. The mask is then used for computing the features v(I) and messages G by applying the convolutions only on the pixels not yet classified. We can denote those operators as R and R. This

```
Input image screen mouse Ground truth Output labeling
keyboard
t=5 t=10 t=15 t=25 t=50 b (screen) b (screen) b (screen) b (screen) b (screen) F
G
b (keyboard) b (keyboard) b (keyboard) b (keyboard) b (keyboard) F
G
b (mouse) b (mouse) b (mouse) b (mouse) b (mouse) F
G
1 ROC Screen
Boosting BRF Mouse a under Keyboard re Iteration (t) A 0.5 t=0 t=20 t=50
```

Figure 5: Top. In this desk scene, it is easy to identify objects like the screen, keyboard and mouse, even though the local information is sometimes insufficient. Middle: the evolution of the beliefs (b and F and G) during detection for a test image. Bottom. The graph bellow shows the average evolution of the area under the ROC for the three objects on 120 test images.

results in a more efficient classifier with only a slight decrease of performance. In Fig. 6 we compare the reduction of the search space when implementing a cascade using independent boosting (which reduces to Viola and Jones [12]), and when using BRF's. We see that for objects for which context is the main source of information, like the mouse, the reduction in search space is much more dramatic using BRFs than using boosting alone.

4 Conclusion

The proposed BRF algorithm combines boosting and CRF's, providing an algorithm that is easy for both training and inference. We have demonstrated object detection in cluttered scenes by exploiting contextual relationships between objects. The BRF algorithm is com- putationally efficient and provides a natural extension of the cascade of classifiers by inte- grating evidence from other objects in order to quickly reject certain image regions. The BRF's densely connected graphs, which efficiently collect information over large image regions, provide an alternative framework to nearest-neighbor grids for vision problems.

Do not remove: This comment is monitored to verify that the site is working properly