Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Erik Sudderth, Michael Mandel, William Freeman, Alan Willsky*

We describe a threedimensional geometric hand model suitable for vi- sual tracking applications. The kinematic constraints implied by the model's joints have a probabilistic structure which is well described by a graphical model. Inference in this model is complicated by the hand's many degrees of freedom, as well as multimodal likelihoods caused by ambiguous image measurements. We use nonparametric belief propaga- tion (NBP) to develop a tracking algorithm which exploits the graph's structure to control complexity, while avoiding costly discretization. While kinematic constraints naturally have a local structure, self occlusions created by the imaging process lead to complex interpenden- cies in color and edgebased likelihood functions. However, we show that local structure may be recovered by introducing binary hidden vari- ables describing the occlusion state of each pixel. We augment the NBP algorithm to infer these occlusion variables in a distributed fashion, and then analytically marginalize over them to produce hand position esti- mates which properly account for occlusion events. We provide simula- tions showing that NBP may be used to refine inaccurate model initializa- tions, as well as track hand motion through extended image sequences.

1 Introduction

Accurate visual detection and tracking of threedimensional articulated objects is a chal- lenging problem with applications in humancomputer interfaces, motion capture, and scene understanding [1]. In this paper, we develop a probabilistic method for tracking a geometric hand model from monocular image sequences. Because articulated hand mod- els have many (roughly 26) degrees of freedom, exact representation of the posterior dis- tribution over model configurations is intractable. Trackers based on extended and un- scented Kalman filters [2, 3] have difficulties with the multimodal uncertainties produced by ambiguous image evidence. This has motived many researchers to consider nonparamet- ric representations, including particle filters [4, 5] and deterministic multiscale discretiza- tions [6]. However, the hand's high dimensionality can cause these trackers to suffer catas- trophic failures, requiring the use of models which limit the hand's motion [4] or sophisti- cated prior models of hand configurations and dynamics [5, 6].

An alternative way to address the high dimensionality of articulated tracking problems is to describe the posterior distribution's statistical structure using a graphical model. Graph-

Figure 1: Projected edges (left block) and silhouettes (right block) for a configuration of the 3D structural hand model matching the given image. To aid visualization, the model is also projected following rotations by 35 (center) and 70 (right) about the vertical axis.

ical models have been used to track viewbased human body representations [7], con- tour models of restricted hand configurations [8], viewbased 2.5D "cardboard" models of hands and people [9], and a full 3D kinematic human body model [10]. Because the variables in these graphical models are continuous, and discretization is intractable for threedimensional models, most traditional graphical inference algorithms are inapplica- ble. Instead, these trackers are based on recently proposed extensions of particle filters to general graphs: mean field Monte Carlo in [9], and nonparametric belief propagation (NBP) [11, 12] in [10].

In this paper, we show that NBP may be used to track a threedimensional geometric model of the hand. To derive a graphical model for the tracking problem, we consider a redun- dant local representation in which each hand component is described by its own three dimensional position and orientation. We show that the model's kinematic constraints, including selfintersection constraints not captured by joint angle representations, take a simple form in this local representation. We also provide a local decomposition of the likelihood function which properly handles occlusion in a distributed fashion, a significant improvement over our earlier tracking results [13]. We conclude with simulations demon- strating our algorithm's robustness to occlusions.

2 Geometric Hand Modeling

Structurally, the hand is composed of sixteen approximately rigid components: three pha- langes or links for each finger and thumb, as well as the palm [1]. As proposed by [2, 3], we model each rigid body by one or more truncated quadrics (ellipsoids, cones, and cylin- ders) of fixed size. These geometric primitives are well matched to the true geometry of the hand, allow tracking from arbitrary orientations (in contrast to 2.5D "cardboard" mod- els [5, 9]), and permit efficient computation of projected boundaries and silhouettes [3]. Figure 1 shows the edges and silhouettes corresponding to a sample hand model configu- ration. Note that only a coarse model of the hand's geometry is necessary for tracking.

2.1 Kinematic Representation and Constraints

The kinematic constraints between different hand model components are well described by revolute joints [1]. Figure 2(a) shows a graph describing this kinematic structure, in which nodes correspond to rigid bodies and edges to joints. The two joints connecting the phalanges of each finger and thumb have a single rotational degree of freedom, while the joints connecting the base of each finger to the palm have two degrees of freedom (cor- responding to grasping and spreading motions). These twenty angles, combined with the palm's global position and orientation, provide 26 degrees of freedom. Forward kinematic transformations may be used to determine the finger positions corresponding to a given set of joint angles. While most modelbased hand trackers use this joint angle parameteriza- tion, we instead explore a redundant representation in which the ith rigid body is described by its position qi and orientation ri (a unit quaternion). Let xi = (qi, ri) denote this local description of each component, and x = {x1, . . . , x16} the overall hand configuration.

Clearly, there are dependencies among the elements of x implied by the kinematic con-

```
(a) (b) (c) (d)
```

Figure 2: Graphs describing the hand model's constraints. (a) Kinematic constraints (EK ) de- rived from revolute joints. (b) Structural constraints (ES) preventing 3D component intersections. (c) Dynamics relating two consecutive time steps. (d) Occlusion consistency constraints (EO).

straints. Let EK be the set of all pairs of rigid bodies which are connected by joints, or equivalently the edges in the kinematic graph of Fig. 2(a). For each joint (i, j) EK , define an indicator function K (x i,j i, xj ) which is equal to one if the pair (xi, xj ) are valid rigid body configurations associated with some setting of the angles of joint (i, j), and zero otherwise. Viewing the component configurations xi as random variables, the following prior explicitly enforces all constraints implied by the original joint angle representation:

```
pK(x) K (x i,j i, xj ) (1) (i,j)EK
```

Equation (1) shows that pK (x) is an undirected graphical model, whose Markov structure is described by the graph representing the hand's kinematic structure (Fig. 2(a)).

2.2 Structural and Temporal Constraints

In reality, the hand's joint angles are coupled because different fingers can never occupy the same physical volume. This constraint is complex in a joint angle parameterization, but simple in our local representation: the position and orientation of every pair of rigid bodies must be such that their component quadric surfaces do not intersect.

We approximate this ideal constraint in two ways. First, we only explicitly constrain those pairs of rigid bodies which are most likely to intersect, corresponding to the edges ES of the graph in Fig. 2(b). Furthermore, because the relative orientations of each finger's quadrics are implicitly constrained by the kinematic prior pK (x), we may detect most intersections based on the distance between object centroids. The structural prior is then given by

```
1 ||q p i - qj || > i,j S (x) S (x (x i,j i, xj ) S i,j i, xj ) = (2) 0 otherwise (i,j)ES
```

where i,j is determined from the quadrics composing rigid bodies i and j. Empirically, we find that this constraint helps prevent different fingers from tracking the same image data.

In order to track hand motion, we must model the hand's dynamics. Let xt denote the i position and orientation of the ith hand component at time t, and xt = {xt1, . . . , xt16}. For each component at time t, our dynamical model adds a Gaussian potential connecting it to the corresponding component at the previous time step (see Fig. 2(c)): 16 pT xt | xt-1 = N xt - xt-1; 0, i i i (3) i=1 Although this temporal model is factorized, the kinematic constraints at the following time step implicitly couple the corresponding random walks. These dynamics can be justified as the maximum entropy model given observations of the nodes' marginal variances i.

3 Observation Model

Skin colored pixels have predictable statistics, which we model using a histogram distribu- tion pskin estimated from training patches [14]. Images without people were used to create a histogram model pbkgd of nonskin pixels. Let (x) denote the silhouette of projected hand configuration x. Then, assuming pixels are independent, an image y has likelihood p p skin(u) C (y | x) = pskin(u) pbkgd(v) (4) pbkgd(u) u(x) v(x) u(x)

The final expression neglects the proportionality constant p v bkgd(v), which is inde- pendent of x, and thereby limits computation to the silhouette region [8].

3.1 Distributed Occlusion Reasoning

In configurations where there is no selfocclusion, pC (y | x) decomposes as a product of local likelihood terms involving the projections (xi) of individual hand components [13]. To allow a similar decomposition (and hence distributed inference) when there is occlu- sion, we augment the configuration xi of each node with a set of binary hidden variables zi = {zi } = 0 if pixel u in the projection of rigid body i is occluded (u) u. Letting zi(u) by any other body, and 1 otherwise, the color likelihood (eq. (4)) may be rewritten as 16 p z 16 i(u) p skin(u) C (y | x, z) = = p p C (y | xi, zi) (5) bkgd(u) i=1 u(xi) i=1

Assuming they are set consistently with the hand configuration x, the hidden occlusion variables z ensure that the likelihood of each pixel in (x) is counted exactly once.

We may enforce consistency of the occlusion variables using the following function: 0 if x = 1 (x j occludes xi, u (xj ), and zi(u) j , zi ; x (6) (u) i) = 1 otherwise

Note that because our rigid bodies are convex and nonintersecting, they can never take mutually occluding configurations. The constraint (xj, zi ; x (u) i) is zero precisely when pixel u in the projection of xi should be occluded by xj, but zi is in the unoccluded state. (u) The following potential encodes all of the occlusion relationships between nodes i and j:

```
O (x (x ; x ; x i,j i, zi, xj , zj ) = j , zi(u) i) (xi, zj(u) j ) (7) u These occlusion constraints exist between all pairs of nodes. As with the structural prior, we enforce only those pairs EO (see Fig. 2(d)) most prone to xj occlusion: pO(x, z) O (x i,j i, zi, xj , zj ) (8) (i,j)EO z y i(u) xi Figure 3 shows a factor graph for the occlusion relationships between xi and its neighbors, as well as the observation potential pC (y | xi, zi). x u k The occlusion potential (xj, zi ; x (u) i) has a very Figure 3: Factor graph showing weak dependence on xi, depending only on p(y | xi, zi), and the occlusion con- whether xi is behind xj relative to the camera. straints placed on xi by xj , xk. Dashed lines denote weak dependencies. The 3.2 Modeling Edge Filter Responses plate is replicated once per pixel.
```

Edges provide another important hand tracking cue. Using boundaries labeled in training images, we estimated a histogram pon of the response of a derivative of Gaussian filter steered to the edge's orientation [8, 10]. A similar histogram poff was estimated for filter outputs at

randomly chosen locations. Let (x) denote the oriented edges in the projection of model configuration x. Then, again assuming pixel independence, image y has edge likelihood p 16 p z 16 i(u) p on(u) on(u) E (y | x, z) = = p p E (y | xi, zi) (9) off (u) poff(u) u(x) i=1 u(xi) i=1

where we have used the same occlusion variables z to allow a local decomposition.

4 Nonparametric Belief Propagation

Over the previous sections, we have shown that a redundant, local representation of the geometric hand model's configuration xt allows p (xt | yt), the posterior distribution of the hand model at time t given image observations yt, to be written as 16 p xt | yt pK(xt)pS(xt)pO(xt, zt) pC(yt | xt, zt)p , zt) i i E (yt | xti i (10) zt i=1

The summation marginalizes over the hidden occlusion variables zt, which were needed to locally decompose the edge and color likelihoods. When video frames are observed, the overall posterior distribution is given by

```
p (x | y) p xt | yt pT (xt | xt-1) (11) t=1 Excluding the potentials involving occlusion variables, which we discuss in detail in Sec. 4.2, eq. (11) is an example of a pairwise Markov random field:
p (x | y) i,j (xi, xj) i (xi, y) (12) (i,j)E iV
```

Hand tracking can thus be posed as inference in a graphical model, a problem we propose to solve using belief propagation (BP) [15]. At each BP iteration, some node i V calculates a message m (x ij j ) to be sent to a neighbor j (i) {j | (i, j) E}:

```
mn (x mn-1 (x ij j ) j,i (xj , xi) i (xi, y) ki i) dxi (13) xi k(i)\j
```

At any iteration, each node can produce an approximation ^ p(xi | y) to the marginal distri- bution p (xi | y) by combining the incoming messages with the local observation:

```
^ pn(xi | y) i (xi, yi) mn (x ji i) (14) j(i)
```

For treestructured graphs, the beliefs ^ pn(xi | y) will converge to the true marginals p (xi | y). On graphs with cycles, BP is approximate but often highly accurate [15].

4.1 Nonparametric Representations

For the hand tracking problem, the rigid body configurations xi are sixdimensional con- tinuous variables, making accurate discretization intractable. Instead, we employ nonpara- metric, particlebased approximations to these messages using the nonparametric belief propagation (NBP) algorithm [11, 12]. In NBP, each message is represented using either a samplebased density estimate (a mixture of Gaussians) or an analytic function. Both types of messages are needed for hand tracking, as we discuss below. Each NBP message update involves two stages: sampling from the estimated marginal, followed by Monte Carlo ap- proximation of the outgoing message. For the general form of these updates, see [11]; the following sections focus on the details of the hand tracking implementation.

The hand tracking application is complicated by the fact that the orientation component ri of xi = (qi, ri) is an element of the rotation group SO(3). Following [10], we represent

orientations as unit quaternions, and use a linearized approximation when constructing den- sity estimates, projecting samples back to the unit sphere as necessary. This approximation is most appropriate for densities with tightly concentrated rotational components.

4.2 Marginal Computation

BP's estimate of the belief ^ p(xi | y) is equal to the product of the incoming messages from neighboring nodes with the local observation potential (see eq. (14)). NBP approximates this product using importance sampling, as detailed in [13] for cases where there is no selfocclusion. First, M samples are drawn from the product of the incoming kinematic and temporal messages, which are Gaussian mixtures. We use a recently proposed multi- scale Gibbs sampler [16] to efficiently draw accurate (albeit approximate) samples, while avoiding the exponential cost associated with direct sampling (a product of d M Gaussian mixtures contains M d Gaussians). Following normalization of the rotational component, each sample is assigned a weight equal to the product of the color and edge likelihoods with any structural messages. Finally, the computationally efficient "rule of thumb" heuris- tic [17] is used to set the bandwidth of Gaussian kernels placed around each sample.

To derive BP updates for the occlusion masks zi, we first cluster (xi, zi) for each hand component so that p (xt, zt | yt) has a pairwise form (as in eq. (12)). In principle, NBP could manage occlusion constraints by sampling candidate occlusion masks zi along with rigid body configurations xi. However, due to the exponentially large number of possible occlusion masks, we employ a more efficient analytic approximation.

Consider the BP message sent from xj to (zi, xi), calculated by applying eq. (13) to the occlusion potential (x ; x u j , zi(u) i). We assume that ^ p(xj | y) is well separated from any candidate xi, a situation typically ensured by the kinematic and structural constraints. The occlusion constraint's weak dependence on xi (see Fig. 3) then separates the message computation into two cases. If xi lies in front of typical xj configurations, the BP message j,i(u)(zi ) is uninformative. If x (u) i is occluded, the message approximately equals

```
j,i(u)(zi = 0) = 1 = 1) = 1 - Pr [u (x (u) j,i(u)(zi(u) j )] (15) where we have neglected correlations among pixel occlusion states, and where the prob- ability is computed with respect to ^ p(xj | y). By taking the product of these messages k,i(u)(zi ) from all potential occluders x (u) k and normalizing, we may determine an ap-
```

proximation to the marginal occlusion probability i Pr[z = 0]. (u) i(u)

Because the color likelihood pC (y | xi, zi) factorizes across pixels u, the BP approximation to pC (y | xi) may be written in terms of these marginal occlusion probabilites: p p skin(u) C (y | xi) i + (1 - ) (16) (u) i(u) pbkgd(u) u(xi)

Intuitively, this equation downweights the color evidence at pixel u as the probability of that pixel's occlusion increases. The edge likelihood pE(y | xi) averages over zi similarly. The NBP estimate of ^ p(xi | y) is determined by sampling configurations of xi as before, and reweighting them using these occlusionsensitive likelihood functions.

4.3 Message Propagation

To derive the propagation rule for nonocclusion edges, as suggested by [18] we rewrite the message update equation (13) in terms of the marginal distribution ^ p(xi | y): ^ pn-1(x mn (x i | y) dx ij j ) = j,i (xj , xi) i (17) x mn-1 (x i ji i)

Our explicit use of the current marginal estimate ^ pn-1(xi | y) helps focus the Monte Carlo approximation on the most important regions of the state space. Note that messages sent

```
1 2 1 2 Figure 4: Refinement of a coarse initialization following one and two NBP iterations, both without (left) and with (right) occlusion reasoning. Each plot shows the projection of the five most significant modes of the estimated marginal distributions. Note the difference in middle finger estimates.
```

along kinematic, structural, and temporal edges depend only on the belief ^ p(xi | y) follow- ing marginalization over occlusion variables zi.

Details and pseudocode for the message propagation step are provided in [13]. For kine- matic constraints, we sample uniformly among permissable joint angles, and then use forward kinematics to propagate samples from ^ pn-1(xi | y) /mn-1 (x ji i) to hypothesized configurations of xj. Following [12], temporal messages are determined by adjusting the bandwidths of the current marginal estimate ^ p(xi | y) to match the temporal covariance i. Because structural potentials (eq. (2)) equal one for all state configurations outside some ball, the ideal structural messages are not finitely integrable. We therefore approximate the structural message m (x ij j ) as an analytic function equal to the weights of all kernels in ^ p(xi | y) outside a ball centered at qj, the position of xj.

Do not remove: This comment is monitored to verify that the site is working properly