{"title": "Managing Uncertainty in Cue Combination", "book": "Advances in Neural Information Processing Systems", "page_first": 869, "page_last": 878, "abstract": null, "full_text": "Managing Uncertainty in Cue Combination \n\nZhiyong Yang \n\nRichard S. Zemel \n\nDeparbnent of Neurobiology, Box 3209 \n\nDuke University Medical Center \n\nDurham, NC 27710 \nzhyyang@duke.edu \n\nDeparbnent of Psychology \n\nUniversity of Arizona \n\nTucson, AZ 85721 \n\nzemel@u.arizona.edu \n\nAbstract \n\nWe develop a hierarchical generative model to study cue combi(cid:173)\nnation.  The model maps a global shape parameter to local cue(cid:173)\nspecific  parameters,  which in tum generate  an intensity image. \nInferring shape from images is achieved by inverting this model. \nInference produces a probability distribution at each level; using \ndistributions rather than a single value of underlying variables at \neach stage preserves information about the validity  of each local \ncue for the given image.  This allows the model, unlike standard \ncombination models, to adaptively weight each cue based on gen(cid:173)\neral  cue  reliability  and  specific  image context.  We  describe  the \nresults  of a cue combination psychophysics experiment we con(cid:173)\nducted that allows a direct comparison with the model. The model \nprovides a good fit to our data and a natural account for some in(cid:173)\nteresting aspects of cue combination. \n\nUnderstanding cue  combination is  a  fundamental  step in developing  computa(cid:173)\ntional models of visual perception, because many aspects of perception naturally \ninvolve multiple cues, such as binocular stereo, motion, texture, and shading. It is \noften formulated as a problem of inferring or estimating some relevant parameter, \ne.g., depth, shape, position, by combining estimates from individual cues. \nAn important finding  of psychophysical studies of cue combination is that cues \nvary in the degree to which they are used in different visual environments. Weights \nassigned to estimates derived from a particular cue seem to reflect  its estimated \nreliability  in  the  current  scene  and  viewing  conditions.  For  example,  motion \nand stereo are  weighted approximately equally at near distances, but motion is \nweighted more at far distances,  presumably due to distance limits on binocular \ndisparity.3  Experiments have also found these weightings sensitive to image ma(cid:173)\nnipulations; if a cue is weakened, such as by adding noise, then the uncontami(cid:173)\nnated cue is utilized more in making depth judgments.9 A recent study2 has shown \nthat observers can adjust the weighting they assign to a cue based on its relative \nutility for a particular task. From these and other experiments, we can identify two \ntypes of information that determine relative cue weightings:  (1) cue reliability:  its \nrelative utility in the context of the task and general viewing conditions; and (2) \nregion informativeness:  cue information available locally in a given image. \nA central question in computational models of cue combination then concerns how \nthese forms of uncertainty can be combined. We propose a hierarchical generative \n\n\f870 \n\nZ.  Yang and R.  S.  Zemel \n\nmodel. Generative models have a rich history in cue combination, as thel underlie \nmodels of Bayesian perception that have been developed in this area. lO ,  The nov(cid:173)\nelty in the generative model proposed here lies in its hierarchical nature and use \nof distributions throughout, which allows for both context-dependent and image(cid:173)\nspecific uncertainty to be combined in a principled manner. \nOur aims in this paper are dual: to develop a combination model that incorporates \ncue reliability and region informativeness (estimated across and within images), \nand to use this model to account for data and provide predictions for psychophys(cid:173)\nical experiments. Another motivation for the approach here stems from our recent \nprobabilistic framework,11  which posits that every step of processing entails  the \nrepresentation of an entire probability distribution, rather than just a single value \nof the relevant underlying variable(s).  Here we use separate local probability dis(cid:173)\ntributions for each cue estimated directly from an image. Combination then entails \ntransforming representations and integrating distributions across both space and \ncues, taking across- and within-image uncertainty into account. \n\n1  IMAGE GENERATION \n\nIn this paper we study the case of combining shading and texture. Standard shape(cid:173)\nfrom-shading models exclude texture, l, 8  while standard shape-from-texture mod(cid:173)\nels  exclude  shading.7  Experimental  results  and  computational arguments  have \nsupported a strong interaction between these cues}O but no model accounting for \nthis interaction has yet been worked out. \nThe shape used in our experiments is a simple surface: \n\nZ = B(l - x2 ), Ixl  <= 1, Iyl  <= 1 \n\n(1) \n\nwhere Z is the height from the xy plane.  B is the only shape parameter. \nOur image formation model is a hierarchical generative model (see Figure 1). The \ntop layer contains the global parameter B.  The second layer contains local shad(cid:173)\ning and texture parameters S, T  = {Sj, 11}, where i  indexes image regions.  The \ngeneration of local cues from a global parameter is intended to allow local uncer(cid:173)\ntainties to be introduced separately into the cues.  This models specific conditions \nin realistic images, such as shading uncertainty due to shadows or specularities, \nand texture uncertainty when prior assumptions such as isotropicity are violated.4 \nHere we introduce uncertainty by adding independent local noise to the underly(cid:173)\ning shape parameter; this manipulation is less realistic but easier to control. \n\nGlobal Shape (B) \n\nLocal Shading ({S}) \n\n/~ \n~~ Image (I) \n\nLocal Texture ({T}) \n\nFigure  1:  Left:  The  generative model of image formation.  Right:  Two  sample \nimages generated by the image formation procedure.  B  =  1.4 in both.  Left:  0',  = \n0.05,O't = O.  Right: 0',  = O,O't  = 0.05. \n\nThe local cues are sampled from Gaussian distributions:  p(SdB)  =  N(f(B); 0',); \np(7iIB) = N(g(B); O't}.  f(B),g(B)  describe how the local cue parameters depend \n\n\fManaging  Uncertainty in Cue Combination \n\n871 \n\non the shape parameter B, while 0\"8  and O\"t  represent the degree of noise in each \ncue.  In this paper, to simplify the generation process we set f(B)  = g(B)  =  B. \nFrom {Si} and {Ti}, two surfaces are generated; these are essentially two separate \nnoisy  local versions  of B.  The  intensity image combines  these surfaces.  A  set \nof same-intensity texsels  sampled from  a  uniform distribution are mapped onto \nthe  texture  surface,  and  then projected  onto the image plane under orthogonal \nprojection.  The intensity of surface pixels not contained within these texsels are \ndetermined generated from the shading surface using Lambertian shading.  Each \nimage is composed of 10  x  10  non-overlapping regions,  and contains 400  x  400 \npixels. Figure 1 shows two images generated by this procedure. \n\n2  COMBINATION MODEL \n\nWe create a combination, or recognition model by inverting the generative model \nof Figure 1 to infer the shape parameter B from the image. An important aspect of \nthe combination model is the use of distributions to represent parameter estimates \nat each stage. This preserves uncertainty information at each level, and allows it to \nplaya role in subsequent inference. \nThe overall goal of combination is to infer an estimate of B given some image I. We \nderive our main inference equation using a Bayesian integration over distributions: \n\nP(BIl) = J P(BIS, T)P(S, TIl)dSdT \n'\"  IT P(Sdl)P(TiII) \n;(B)P(S, TIB)/ J P(B)P(S, TIB)db '\" n P(SdB)P(TiIB)  (4) \n\nP(BIS, T) \n\nP(S, TIl) \n\n(2) \n\n(3) \n\n\u2022 \n\nTo simplify the two components we have assumed that the prior over B is uniform, \nand that the S, T are conditionally independent given B, and given the image. This \nthird assumption is dubious but is not essential in the model, as discussed below. \nWe now consider these two components in tum. \n\n2.1  Obtaining local cue-specific representations from an image \n\nOne  component  in  the  inference  equation,  P(S, TIl),  describes  local  cue(cid:173)\ndependent  information in  the particular image  I.  We  first  define  intermediate \nrepresentations S, T  that are dependent on shading and texture cues, respectively. \nThe shading representation is  the curvature of a horizontal section:  S = f(B)  = \n2B(1 + 4x2 B2)-3/2.  The texture representation is the cosine of the surface slant: \nT  = g(B)  = (1  + 4x2 B2)-1/2.  Note that these S, T  variables do not match those \nused in the generative model;  ideally we could have used these cue-dependent \nvariables, but generating images from them proved difficult. \nSome image pre-processing must take place in order to estimate values and un(cid:173)\ncertainties for these particular local variables.  The approach we adopt involves a \nsimple statistical matching procedure,  similar to k-nearest neighbors,  applied to \nlocal image patches.  After applying Gaussian smoothing and band-pass filtering \nto the image, two representations of each patch are obtained using separate shad(cid:173)\ning and texture filters.  For shading, image patches are represented by forming a \nhistogram of  ~1; for texture, the patch is represented by the mean and standard \ndeviation of the amplitude of Gabor filter responses at 4 scales and orientations. \nThis representation of a shading patch is  then compared to a database of similar \n\n\f872 \n\nZ.  Yang and R.  S.  Zemel \n\npatch representations.  Entries in the shading database are formed by first select(cid:173)\ning a particular value of B  and (j3'  generating an image patch, and applying the \nappropriate filters. Thus S = f (B) and the noise level (j 3  are known for each entry, \nallowing an estimate of these variables for the new patch to be formed as a linear \ncombination of the entries with similar representations. An analogous procedure, \nutilizing a separate database, allows T  and an uncertainty estimate to be derived \nfor texture. Both databases have 60 different h,  (j pairs, and 10 samples of each pair. \nBased on this procedure we obtain for each image patch mean values Mt, Ml and \nuncertainty values Vi3 ,  Vit  for Si, Tt.  These determine P(IIS), P(IIT), which are \napproximated as Gaussians. Taking into account the Gaussian priors for Si, Tt, \n\nP(Sil!) \n\nP(TtI!) \n\nP(IISi)P(Si) '\" exp(-t(S - Mt)2)exp(-t(S - M~)2)  (5) \n\nP(IITt)P(Tt) '\" exp( -f(T - Ml)2) exp( -f(T - M~)2)  (6) \n\nV,3 \n\nW \n\nV,3 \n\nv,t \n\nNote  that  the  independence  assumption of Equation  3  is  not necessary,  as  the \nmatching procedure could use a single database indexed by both the shading and \ntexture representations of a patch. \n\n2.2  Transforming and combining cue-specific local representations \n\nThe other component of the inference equation describes the relationship between \nthe intermediate, cue-specific representations S, T  and the shape parameter B: \n\nP(SIB) '\" exp(--t(S - f(B))2)  ;  P(TIB) '\" exp(-t(T - g(B))2) \n\n(7) \n\nV;3 \n\nv;t \n\nThe two parameters Vb3, V:  in this equation describe the uncertainty in the relation(cid:173)\nship between the intermediate parameters S, T  and B; they are invariant across \nspace.  These two, along with the parameters of the priors-M~, M~, V~, Vt-are \nthe free  parameters of this  model.  Note that this combination model neatly ac(cid:173)\ncounts for both types  of cue validity we identified:  the variance  in P(SIB)  de(cid:173)\nscribes the general  uncertainty  of a  given cue,  while  the local variance  in P(SI!) \ndescribes the image-specific uncertainty of the cue. \nCombining Equations 3-7, and completing the integral in Equation 2, we have: \n\nP( BII) - exp [ -~ ~ .tI( B)' + .,g(B)' - 2.,f(B) - 2 \u2022\u2022 g( B) 1  (8) \n\nThus our model infers from  any image a mean U  and variance E2  for B  as non(cid:173)\nlinear combinations of the cue estimates, taking into account the various forms of \nuncertainty. \n\n3  A CUE COMBINATION PSYCHOPHYSICS EXPERIMENT \n\nWe have conducted psychophysical experiments using stimuli generated by the \nprocedure described above. In each experimental trial, a stimulus image and four \n\n\fManaging Uncertainty in Cue Combination \n\n873 \n\nviews of a  mesh surface are  displayed  side-by-side  on a  computer screen.  The \nsubject's task is  to manipulate the curvature of the mesh to match the stimulus. \nThe final  shape of the mesh surface describes  the subject's estimate of the shape \nparameter B on that trial. The subject's variance is computed across repeated trials \nwith an identical stimulus.  In a  given block of trials,  the  stimulus may contain \nonly shading information (no texture elements), only texture information uniform \nshading), or both.  The local cue noise  \u00ab(i$' (it) is zero in some blocks, non-zero in \nothers. The primary experimental findings (see Figure 2) are: \n\n\u2022  Shape from shading alone produces underestimates of B. Shape from tex(cid:173)\n\nture alone also leads to underestimation, but to a lesser degree. \n\n\u2022  Shape from both cues leads to almost perfect estimation, with smaller vari(cid:173)\nance than shape from either cue alone. Thus cue enhancement-more accu(cid:173)\nrate and robust judgements for stimuli containing multiple cues than just \nindividual cues-applies to this paradigm. \n\n\u2022  The variance of a subject's estimation increases with B. \n\u2022  Noise  in  either shading  or  texture  systematically biases  the estimation \n\nfrom the true values: the greater the noise level, the greater the bias. \n\n\u2022  Shape from both cues is more robust against noise than shape from either \n\ncue alone, providing evidence of another form of cue enhancement. \n\n2 \n\n.. \ng 1.5 \n\"\" \nE \n.~ \nW \n\n1 \n\n-/.-\n\n~.-r \n\n--l.-I-\n\n.-t \n\n2 \n\n.. \nc:  1.5 \n0 \n'\" \n,.. \nE \n<II \nw \n\n1 \n\nr' \n\nt \n\n2 \n\n~ 1.5 \n~ \n';i \nw \n\n1 \n\n7 \n\n( \n\n/1-\n1,r \n\n1 \nStimulus \n\n2 \n\n1 \nStimulus \n\n2 \n\n1 \nStimulus \n\n2 \n\n1.8 \n\n1.6 \n\n~ \n!  1.4 \n,. \nw \n\n1.2 \n\n0.8 \n\n, ,1 \n\n\"~I' \n\"r' \n,r \n,l' \n\n, \n\n1.8 \n\n1.6 \n\n~ \n! 1.4 \n\nw \n\n1.2 \n\n1.5 \n\nSlimulus \n\n1.5 \n\nStimulus \n\nFigure  2:  Means  and standard errors  are  shown for  the shape matching exper(cid:173)\niment,  for  different  values  of  B,  under different  stimulus conditions.  rop: No \nnoise in local shape parameters.  Left:  Shape from shading alone.  Middle:  Shape \nfrom texture alone. Right: Shape from shading and texture. BOTIOM: Shape from \nshading and texture. Left:  (i$  = 0.05, (it = O.  Right:  (i$  = a, (it = 0.05. \n\n4  MODELING RESULTS \n\nThe model was ~rained using a subset of data from these experiments.  The error \ncriteria  was mean relative  error  (M RE) between the model outputs  (U, E)  and \n\n\f874 \n\nZ.  Yang  and R.  S.  Zemel \n\nB \n1.4 \n1.6 \n1.4 \n1.6 \n1.2 \n1.4 \n\nO\"s \n0.10 \n0.10 \n0.05 \n0.05 \n0 \n0 \n\nO\"t \n0 \n0 \n0 \n0 \n0.05 \n0.05 \n\ndata (U/E) \n1.18/0.072 \n1.34/0.075 \n1.32/0.042 \n1.52/0.049 \n1.20/0.052 \n1.36/0.062 \n\nmodel (U /E) \n\n1.20/0.06 \n1.35/0.063 \n1.4/0.067 \n1.46/0.069 \n1.14/0.056 \n1.30/0.063 \n\nTable 1:  Data versus model predictions on images outside the training class.  The \nfirst  column of means and variances are from  the experimental data, the second \ncolumn from the model. \n\nexperimental data (subject mean and variance on the same image).  The six free \nparameters of the model were described as the sum of third order polynomials of \nlocal S, T  and the noise levels. Gradient descent was used to train the model. \nThe model was trained and tested on three different subsets of the experimental \ndata.  When trained on data in which only B  varied, the model output accurately \npredicts unseen experimental data of the same type.  When the data varied in B \nand O\"s  or O\"t,  the model outputs agree very well with subject data (M RE ,....,  5 -\n8%).  When trained on data where all three variables vary, the model fits the data \nreasonably well (M RE ,....,  8 -13%). For a model of the first type, Figure 3 compares \nmodel predictions to data from within the same set, while Table 1 shows model \noutputs and subject responses for test examples from outside the training class. \n\n1.6r---------:\"\"\"\"\"\"\"' \n\n1.5 \n\ng1.4 \n\n~ .. \n\n~ 1.3 \nItl \n\n1.2 \n\n1.1 \n\n1.2  1.3  1.4  1.5  1.6 \n\nStimulus \n\nFigure  3:  Model performance on data in which O\"s  =  O,O\"t  =  0.10.  Upper line: \nperfect estimation. Lower line: experimental data. Dashed line:  model prediction. \n\nThe  model  accounts  for  some  important  aspects  of  cue  combination.  Trained \nmodel parameters reveal  that the  texture  prior is  considerably  weaker than the \nshading prior, and texture has a more reliable relationship with B.  Consequently, \nat equal noise levels texture outweighs shading in the combination model.  These \nfactors account for the degree of underestimation found in each single-cue experi(cid:173)\nment, and the greater accuracy (i.e., enhancement) with combined-cues. Our stud(cid:173)\nies also reveal a  novel form  of cue interaction:  for some image patches,  esp.  at \nhigh curvature and noise levels,  shading information becomes hannful,  i.e.,  cur(cid:173)\nvature estimation becomes less  reliable  when shading information is  taken into \naccount. Note that this differs from cue veto, in that texture does not veto shading. \nFinally, the primary contribution of our model lies in its ability to predict the effect \nof continuous within-image variation in cue reliability on combination.  Figure 4 \nshows how the estimation becomes more accurate and less variable with increas-\n\n\fManaging  Uncertainty in Cue Combination \n\n875 \n\ning certainty in shading infonnation.  Standard cue combination models cannot \nproduce similar behavior, as they do not estimate within-image cue reliabilities. \n\n0.9 \n\n0.1 \n\n0.8 \n\n,- _  _  _  _  _  _  _  _  8=1.6 \n\n(I)  0.85 \n'\"  ;:-C~U \n:; \nE \nm \nc \n.S! \n~ 0.75  ~8=1.8  ~ 0.07 \n1ii \nw  0.7 \n\n80.08 \n.c: \n\n0.09 \n\nc: '\" \n\n0.06 \n\n0 \n\n0 \n\n0.65 \n0 \n\n0.5 \n\n1.5 \n\n0.05 \n0 \n\n0.5 \n\nC? \n0 \n'D \n\n.po \n\n'\" \n'~t1:1 \n\n8=1.4 \n\n1.5 \n\nFigure 4:  Mean (left) and variance (right) of model output as a function of \"is, for \ndifferent values of B.  Here Us  =  0.15, Ut =  0, all model parameters held constant. \n\n5  CONCLUSION \n\nWe have proposed a hierarchical generative model to study cue combination.  In(cid:173)\nferring parameters from images is achieved by inverting this modeL Inference pro(cid:173)\nduces probability distributions at each level:  a set of local distributions, separately \nrepresenting each cue,  are combined to fonn a distribution over a relevant scene \nvariable.  The model naturally handles variations in cue reliability, which depend \nboth on spatially local image context and general cue characteristics. This fonn of \nrepresentation, incorporating image-specific cue utilities, makes this model more \npowerful than standard combination models.  The model provides  a good fit  to \nour psychophysics results on shading and texture combination and an account for \nseveral aspects of cue combination; it also provides predictions for hGW  varying \nnoise levels, both within and across images, will effect combination. \nWe are extending this work in a number of directions.  We are conducting exper(cid:173)\niments to obtain local shape estimates from  subjects.  We  are  conSidering better \nways to extract local representations and distributions over them directly from an \nimage, and methods of handling natural outliers such as shadows and occlusion. \n\nReferences \n\n[1]  Hom, B. K. P. (1977). Understanding image intensities. AI 8, 201-231. \n[2]  Jacobs, R. A. &  Fine I. (1999). Experience-dependent integration of texture and motion cues to depth. Vis.  Res., \n\n[3]  Johnston, E. B., Cumming, B. G., & Landy, M. S. (1994) . Integration of depth modules: Stereopsis and texture. \n\n39, 4062-4075. \n\nVis. Res. 34, 2259-2275. \n\n[4]  Knill,  D. C. (1998). Surface orientation  from  texture:  ideal observers,  generic observers and the information \n\ncontent of texture cues. Vis.  Res.  38, 1655-1682. \n\n[5]  Knill, D. c., Kersten, D.,  &  Mamassian P.  (1996). Implications of a Bayesian formulation of visual information \nfor processing for psychophysiCS. In Perception as Bayesian Inference, D. C. Knill and W. Richards (Eds.), 239-286, \nCambridge Univ Press. \n\n[6]  Landy, M. S., Maloney, L. T., Johnston, E. B.,  &  Young, M. J. (1995). Measurement and modeling of depth cue \n\ncombination: In defense of weak fusion. Vis.  Res. 35,389-412. \n\n[7]  Malik, J.  &  Rosenholtz, R.  (1997). Computing local surface orientation and shape from texture for curved sur(cid:173)\n\nfaces. I/CV 23, 149-168. \n\n[8]  Pentland, A. (1984). Local shading analysis. IEEE PAM!, 6, 170-187. \n[9]  Young, M.J., Landy, MS.,  &  Maloney, L.T. (1993). A perturbation analysis of depth perception from combina(cid:173)\n\ntions of texture and motion cues. Vis. Res. 33, 2685-2696. \n\n[10]  Yuille, A. & Bulthoff, H. H. (1996). Bayesian decision theory and psychophysiCS.  In Perception as Bayesian Infer(cid:173)\n\nence, D. C. Knill and W.  Richards (Eds.), 123-16}, Cambridge Univ Press. \n\n[11]  Zemel, R. S., Dayan, P., & Pouget, A. (1998) . Probabilistic interpretation of population codes. Neural Computa(cid:173)\n\ntion, 403-430. \n\n\f\fPART VIII \n\nApPLICATIONS \n\n\f\f", "award": [], "sourceid": 1742, "authors": [{"given_name": "Zhiyong", "family_name": "Yang", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}]}