Part of Advances in Neural Information Processing Systems 18 (NIPS 2005)

*Cristian Sminchisescu, Atul Kanujia, Zhiguo Li, Dimitris Metaxas*

We present a conditional temporal probabilistic framework for recon- structing 3D human motion in monocular video based on descriptors en- coding image silhouette observations. For computational efﬁciency we restrict visual inference to low-dimensional kernel induced non-linear state spaces. Our methodology (kBME) combines kernel PCA-based non-linear dimensionality reduction (kPCA) and Conditional Bayesian Mixture of Experts (BME) in order to learn complex multivalued pre- dictors between observations and model hidden states. This is necessary for accurate, inverse, visual perception inferences, where several proba- ble, distant 3D solutions exist due to noise or the uncertainty of monoc- ular perspective projection. Low-dimensional models are appropriate because many visual processes exhibit strong non-linear correlations in both the image observations and the target, hidden state variables. The learned predictors are temporally combined within a conditional graphi- cal model in order to allow a principled propagation of uncertainty. We study several predictors and empirically show that the proposed algo- rithm positively compares with techniques based on regression, Kernel Dependency Estimation (KDE) or PCA alone, and gives results competi- tive to those of high-dimensional mixture predictors at a fraction of their computational cost. We show that the method successfully reconstructs the complex 3D motion of humans in real monocular video sequences.

1

Introduction and Related Work

We consider the problem of inferring 3D articulated human motion from monocular video. This research topic has applications for scene understanding including human-computer in- terfaces, markerless human motion capture, entertainment and surveillance. A monocular approach is relevant because in real-world settings the human body parts are rarely com- pletely observed even when using multiple cameras. This is due to occlusions form other people or objects in the scene. A robust system has to necessarily deal with incomplete, ambiguous and uncertain measurements. Methods for 3D human motion reconstruction can be classiﬁed as generative and discriminative. They both require a state representation, namely a 3D human model with kinematics (joint angles) or shape (surfaces or joint po- sitions) and they both use a set of image features as observations for state inference. The computational goal in both cases is the conditional distribution for the model state given

image observations.

Generative model-based approaches [6, 16, 14, 13] have been demonstrated to ﬂexibly re- construct complex unknown human motions and to naturally handle problem constraints. However it is difﬁcult to construct reliable observation likelihoods due to the complexity of modeling human appearance. This varies widely due to different clothing and defor- mation, body proportions or lighting conditions. Besides being somewhat indirect, the generative approach further imposes strict conditional independence assumptions on the temporal observations given the states in order to ensure computational tractability. Due to these factors inference is expensive and produces highly multimodal state distributions [6, 16, 13]. Generative inference algorithms require complex annealing schedules [6, 13] or systematic non-linear search for local optima [16] in order to ensure continuing tracking.

These difﬁculties motivate the advent of a complementary class of discriminative algo- rithms [10, 12, 18, 2], that approximate the state conditional directly, in order to simplify inference. However, inverse, observation-to-state multivalued mappings are difﬁcult to learn (see e.g. ﬁg. 1a) and a probabilistic temporal setting is necessary. In an earlier paper [15] we introduced a probabilistic discriminative framework for human motion reconstruc- tion. Because the method operates in the originally selected state and observation spaces that can be task generic, therefore redundant and often high-dimensional, inference is more expensive and can be less robust. To summarize, reconstructing 3D human motion in a

Figure 1: (a, Left) Example of 180o ambiguity in predicting 3D human poses from sil- houette image features (center). It is essential that multiple plausible solutions (e.g. F1 and F2) are correctly represented and tracked over time. A single state predictor will either average the distant solutions or zig-zag between them, see also tables 1 and 2. (b, Right) A conditional chain model. The local distributions p(ytjyt(cid:0)1; zt) or p(ytjzt) are learned as in ﬁg. 2. For inference, the predicted local state conditional is recursively combined with the ﬁltered prior c.f . (1).

conditional temporal framework poses the following difﬁculties: (i) The mapping between temporal observations and states is multivalued (i.e. the local conditional distributions to be learned are multimodal), therefore it cannot be accurately represented using global function approximations. (ii) Human models have multivariate, high-dimensional continuous states of 50 or more human joint angles. The temporal state conditionals are multimodal which makes efﬁcient Kalman ﬁltering algorithms inapplicable. General inference methods (par- ticle ﬁlters, mixtures) have to be used instead, but these are expensive for high-dimensional models (e.g. when reconstructing the motion of several people that operate in a joint state space). (iii) The components of the human state and of the silhouette observation vector ex- hibit strong correlations, because many repetitive human activities like walking or running have low intrinsic dimensionality. It appears wasteful to work with high-dimensional states of 50+ joint angles. Even if the space were truly high-dimensional, predicting correlated state dimensions independently may still be suboptimal.

In this paper we present a conditional temporal estimation algorithm that restricts visual inference to low-dimensional, kernel induced state spaces. To exploit correlations among observations and among state variables, we model the local, temporal conditional distri- butions using ideas from Kernel PCA [11, 19] and conditional mixture modeling [7, 5], here adapted to produce multiple probabilistic predictions. The corresponding predictor is

referred to as a Conditional Bayesian Mixture of Low-dimensional Kernel-Induced Experts (kBME). By integrating it within a conditional graphical model framework (ﬁg. 1b), we can exploit temporal constraints probabilistically. We demonstrate that this methodology is effective for reconstructing the 3D motion of multiple people in monocular video. Our con- tribution w.r.t. [15] is a probabilistic conditional inference framework that operates over a non-linear, kernel-induced low-dimensional state spaces, and a set of experiments (on both real and artiﬁcial image sequences) that show how the proposed framework positively com- pares with powerful predictors based on KDE, PCA, or with the high-dimensional models of [15] at a fraction of their cost.

2 Probabilistic Inference in a Kernel Induced State Space

We work with conditional graphical models with a chain structure [9], as shown in ﬁg. 1b, These have continuous temporal states yt, t = 1 : : : T , observations zt. For compactness, we denote joint states Yt = (y1; y2; : : : ; yt) or joint observations Zt = (z1; : : : ; zt). Learning and inference are based on local conditionals: p(ytjzt) and p(ytjyt(cid:0)1; zt), with yt and zt being low-dimensional, kernel induced representations of some initial model having state xt and observation rt. We obtain zt; yt from rt, xt using kernel PCA [11, 19]. Inference is performed in a low-dimensional, non-linear, kernel induced latent state space (see ﬁg. 1b and ﬁg. 2 and (1)). For display or error reporting, we compute the original conditional p(xjr), or a temporally ﬁltered version p(xtjRt); Rt = (r1; r2; : : : ; rt), using a learned pre-image state map [3].

2.1 Density Propagation for Continuous Conditional Chains

For online ﬁltering, we compute the optimal distribution p(ytjZt) for the state yt, con- ditioned by observations Zt up to time t. The ﬁltered density can be recursively derived as:

p(ytjZt) = Zyt(cid:0)1

p(ytjyt(cid:0)1; zt)p(yt(cid:0)1jZt(cid:0)1)

(1)

We compute using a conditional mixture for p(ytjyt(cid:0)1; zt) (a Bayesian mixture of experts c.f . x2.2) and the prior p(yt(cid:0)1jZt(cid:0)1), each having, say M components. We integrate M 2 pairwise products of Gaussians analytically. The means of the expanded posterior are clus- tered and the centers are used to initialize a reduced M-component Kullback-Leibler ap- proximation that is reﬁned using gradient descent [15]. The propagation rule (1) is similar to the one used for discrete state labels [9], but here we work with multivariate continuous state spaces and represent the local multimodal state conditionals using kBME (ﬁg. 2), and not log-linear models [9] (these would require intractable normalization). This complex continuous model rules out inference based on Kalman ﬁltering or dynamic programming [9].

2.2 Learning Bayesian Mixtures over Kernel Induced State Spaces (kBME)

In order to model conditional mappings between low-dimensional non-linear spaces we rely on kernel dimensionality reduction and conditional mixture predictors. The authors of KDE [19] propose a powerful structured unimodal predictor. This works by decorrelating the output using kernel PCA and learning a ridge regressor between the input and each decorrelated output dimension.

Our procedure is also based on kernel PCA but takes into account the structure of the studied visual problem where both inputs and outputs are likely to be low-dimensional and the mapping between them multivalued. The output variables xi are projected onto the column vectors of the principal space in order to obtain their principal coordinates yi. A

Do not remove: This comment is monitored to verify that the site is working properly