{"title": "Illumination and View Position in 3D Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 404, "page_last": 411, "abstract": null, "full_text": "Illumination and  View Position in  3D Visual \n\nRecognition \n\nAmnon  Shashua \n\nM.LT.  Artificial  Intelligence Lab.,  NE43-737 \n\nand  Department of Brain  and Cognitive Science \n\nCambridge,  MA  02139 \n\nAbstract \n\nIt is  shown  that  both  changes  in  viewing  position  and  illumination  con(cid:173)\nditions  can  be  compensated for,  prior  to recognition,  using  combinations \nof images  taken  from  different  viewing  positions  and  different  illumina(cid:173)\ntion  conditions.  It is  also  shown  that,  in  agreement  with  psychophysical \nfindings,  the  computation  requires  at  least  a  sign-bit  image  as  input  -\ncontours alone  are not sufficient. \n\n1 \n\nIntroduction \n\nThe  task  of visual  recognition  is  natural  and  effortless  for  biological  systems,  yet \nthe  problem  of recognition  has  been  proven  to  be  very  difficult  to  analyze  from \na  computational  point  of view.  The  fundamental  reason  is  that  novel  images  of \nfamiliar  objects  are often  not sufficiently similar  to previously seen  images  of that \nobject.  Assuming  a  rigid  and  isolated  object  in  the  scene,  there  are  two  major \nsources  for  this  variability:  geometric  and  photometric.  The  geometric  source  of \nvariability comes from  changes of view position.  A 3D object can be viewed from a \nvariety of directions,  each resulting with a different 2D  projection.  The difference is \nsignificant, even for  modest changes in viewing positions, and can  be demonstrated \nby  superimposing  those  projections  (see  Fig.  4,  first  row  second  image).  Much \nattention  has  been  given  to  this  problem  in  the  visual  recognition  literature  ([9], \nand references therein), and recent results show that one can compensate for changes \nin  viewing  position  by  generating novel  views from  a  small  number of model views \nof the object  [10,  4,  8]. \n\n404 \n\n\fIllumination and View  Position  in 3D Visual Recognition \n\n405 \n\nFigure  1:  A  'Mooney' image.  See  text for  details. \n\nThe photometric source of variability comes from  changing illumination  conditions \n(positions  and  distribution  of light  sources  in  the  scene).  This  has  the  effect  of \nchanging  the  brightness  distribution  in  the  image,  and  the  location  of shadows \nand specular reflections.  The traditional  approach  to this problem is  based on  the \nnotion of edge detection.  The idea is that discontinuities in image brightness remain \nstable  under  changes  of illumination  conditions.  This  invariance  is  not  complete \nand furthermore  it  is  an open question  whether this kind  of contour information  is \nsufficient, 01\u00b7  even  relevant, for  purposes of visual  recognition. \n\nConsider the image in  Fig.  1,  adopted from  Mooney's Closure  Faces Test [6].  Most \nobservers show  no difficulty  in  interpreting the shape of the object from  the right(cid:173)\nhand image,  but cannot identify the object  when  presented with only the contours. \nAlso,  many of the contours are shadow contours and therefore critically rely  on  the \ndirection  of light source.  In  Fig.  2 four  frontal  images  of a  doll  from  four  different \nillumination  conditions  are  shown  together  with  their  intensity  step  edges.  The \nchange  in  the  contour  image  is  significant  and  is  not  limited  to shadow  contours \nsome  object  edges  appear  or  disappear  as  a  result  of the  change  in  brightness \n-\ndistribution.  Also shown in Fig. 4 is a sign-bit image of the intensity image followed \nby a  convolution  with  a  Difference  of Gaussians.  As  with  the Mooney  image,  it  is \nconsiderably more difficult  to interpret the image of a  complex object with only the \nzero-crossing (or level-crossing) contours than  when  the sign-bits are  added. \n\nIt seems,  therefore,  that  a  successful  recognition  scheme  should  be  able  to  cope \nwith changes in  illumination conditions, as  well as  changes in  viewing  positions, by \nworking wit.h  a richer source of information than just contours (for  a different point \nof view,  see  [1]).  The minimal  information  that seems to be sufficient,  at  least  for \ncoping with  the photometric problem,  is  the sign-bit image. \n\nThe  approach  to  visual  recognition  in  this  study  is  in  line  with  the  'alignment' \napproach [9]  and is also inspired by the work of Ullman and Basri [10]  who show that \nthe geometric source of variability can be handled by matching the novel projection \nto  a  linear  combination  of a  small  number  of previously  seen  projections  of  that \nobject.  A recognition scheme that can handle  both the geometric and  photometric \nsources of variability is suggested by introducing three new results:  (i) any image of a \nsurface with a linear reflectance function  (including Lambertian and Phong's model \nwithout  point  specularities)  can  be  expressed  as  a  linear  combination  of a  fixed \nset  of three  images  of that  surface  taken  under  different  illumination  conditions, \n(ii) from  a computational standpoint, the coefficients are better recovered using the \n\n\f406 \n\nShashua \n\nsign-bit image rather than the contour image, and (iii) one can compensate for both \nchanges in  viewing  position  and  illumination  conditions  by  using  combinations of \nimages taken from  different viewing  positions and different illumination conditions. \n\n2  Linear Combination of Images \n\nWe start  by  assuming  that  view  position  is  fixed  and  the  only  parameter  that  is \nallowed to change is the positions and distribution oflight sources.  The more general \nresult  that includes  changes in  viewing  positions  will  be discussed  in section  4. \n\nProposition  1  All possible  images  of a surface,  with  a  linear reflectance function, \ngenerated  by  all possible  illumination  conditions  (positions  and  distribution  of light \nsources)  are  spanned  by  a  linear  combination  of images  of the  8urface  taken  from \nindependent illumination  conditions. \n\nProof:  Follows  directly  from  the general  result  that if /j (x),  x E Rk,  j  =  1, ... , k, \nare  k  linear  functions,  which  are  also  linearly  independent,  then  for  any  linear \nfunction  f(x),  we  have that  f(x) =  Lj aj!i(x), for  some  constants aj. 0 \nThe simplest  case  for  which  this  result  holds  is  the  Lambertian  reflectance  model \nunder  a  point  light  source  (observed  independently  by  Yael  Moses,  personal  com(cid:173)\nmunication).  Let  r  be an object point projecting to p .  Let  nr  represent the normal \nand  albedo  at  r  (direction  and  magnitude),  and  s  represent  the  light  source  and \nits  intensity.  The  brightness  at  p  under  the  Lambertian  model  is  I(p)  =  nr  . 8, \nand  because  8  is  fixed  for  all  point p,  we  have  I(p)  =  al II (p)  + a2h(p) + a313(p) \nwhere  Ij(p)  is  the brightness under  light source  8j  and  where  81,82,83  are linearly \nindependent.  This  result  generalizes,  in  a  straightforward  manner,  to  the  case  of \nmultiple light sources  as  well. \n\nThe  Lambertian  model  is  suitable  for  matte  surfaces,  i.e.  surfaces  that  diffusely \nreflect  incoming  light  rays.  One  can  add  a  'shininess'  component  to  account  for \nthe  fact  that  for  non-ideal  Lambertian surfaces,  more  light  is  reflected  in  a  direc(cid:173)\nIn  Phong's  model  of \ntion  making  an  equal  angle  of  incidence  with  reflectance. \n.  h)C  where  h  is  the  bisector  of  8  and \nreflectance  [7]  this  takes  the  form  of (n r \nthe  viewer's direction  v.  The  power  constant c  controls the  degree  of sharpness of \nthe  point specularity,  therefore  outside  that  region  one  can  use  a  linear  version of \nPhong's  model  by  replacing  the  power  constant  with  a  multiplicative  constant,  to \nget  the following function:  I(p) =  nr . [8 + p( v + 8)].  As before, the bracketed vector \nis  fixed  for  all image  points and therefore the linear  combination result  holds. \n\nThe linear  combination  result  suggests  therefore  that  changes  in  illumination  can \nbe  compensated for,  prior  to recognition,  by selecting three points (that are  visible \nto  8,81,82,83)  to solve  for  aI, a2, a3  and  then  match  the  novel  image  I  with  I' = \nLj aj I j .  The two images should match along all  points p whose object points rare \nvisible  to 81, S2, 83  (even  if nr  \u00b78 <  0,  i.e.  p  is  attached-shadowed);  approximately \nmatch  along  points  for  which  nr  . Sj  <  0,  for  some  j  (Ij(p)  is  truncated  to zero, \ngeometrically  8  is projected onto the subspace spanned  by  the remaining basis light \n\nsources)  and  not  match  along  points that  are  cast-shadowed  in  I  (nr  . 8  > \u00b0 but \n\nr  is  not  visible  to  8  because  of self occlusion).  Coping  with  cast-shadows  is  an \nimportant task,  but is  not in the scope of this  paper . \n\n\fIllumination and View  Position  in 3D Visual  Recognition \n\n407 \n\nFigure  2:  Linear  combination  of  model  images  taken  from  the  same  viewing  positIOn \nand  under  different  illumination  conditions.  Row  1,2:  Three  model  images  taken  under \na  varying  point  light  source,  and  the  input  image,  and  their  brightness  edges.  Row  3: \nThe image generated  by  the linear  combination  of the  model  images,  its  edges,  and  the \ndifference edge  image between  the input  and generated image. \n\nThe linear combination result also implies that, for  the purposes of recognition, one \ndoes  not  need  to recover shape or  light source direction  in order to compensate for \nchanges in  hrightness distribution  and attached shadows.  Experimental  results,  on \na  non-ideal  Lambertian surface,  are shown in  Fig.  2. \n\n3  Coefficients fronl  Contours and Sign-bits \n\nMooney  pictures,  such  as  in  Fig.  1,  demonstrate  that  humans  can  cope  well  with \nsituations of varying illumination  by  using only limited information from  the input \nimage,  namely  the sign-bits,  yet  are  not  able  to do  so  from  contours  alone.  This \nobservation can  be  predicted from  a  computational standpoint, as shown below. \n\nProposition 2  The  coejJiczents  that  span  an  image  I  by  the  basis  of three  other \nimages,  as  descnbed  in  proposition  1,  can  be  solved,  up  to  a  common  scale  factor, \n\n\f408 \n\nShashua \n\nFigure 3:  Compensating for  both changes in  view and illumination.  Row 1:  Three model \nimages, one of which  is  taken from  a different viewing direction  (23 0  apart), and the input \nimage  from  a  novel  viewing  direction  (in  between  the  model  images)  and  illumination \ncondition.  Row 2:  difference image between the edges of the input image (shown separately \nin  Fig.  4)  and  the  edges  of the view  transformed  first  model  image  (first  row,  lefthand), \nthe final  generated  image (linear combination of the three transformed  model  images), its \nedges, and  the difference  image  between edges of input  and  generated  image. \n\nzero-crossings  or level-crossings. \n\nfrom  just  the  contours  of I  -\nProof:  Let aj  be the coefficients that span I  by the basis images  Ij, j  = 1,2,3, i.e. \nI  = Lj aj Ij.  Let  f, J;  be  the  result  of applying  a  Difference  of Gaussians  (DOG) \noperator,  with  the same  scale,  on  images  I, Ij ,  j  = 1,2,3.  Since  DOG  is  a  linear \noperator we  have that f  =  Lj aj J;.  Since J(p)  =  0  along zero-crossing  points p of \nI,  then  by  taking  any  three  zero-crossing  points,  which  are  not on  a  cast-shadow \nborder,  we  get  a  homogeneous set of equations from  which  aj  can  be solved  up  to \na  common scale factor. \n\nSimilarly, let  k  be an unknown threshold applied  to I.  Therefore,  along level  cross(cid:173)\nings of I  we have k = Lj aj Ij ,  hence 4  level-crossing points,  that  are  visible  to all \nfour  light sources,  are sufficient  to solve for  aj  and  k.  D \nThis result  is  in  accordance with  what  is  known from image  compression literature \nof reconstructing  an  image,  up  to  a  scale  factor,  from  contours  alone  [2].  In  both \ncases, here and in image compression, this result may be difficult to apply in practice \nbecause  the contours are  required  to be given at sub-pixel accuracy.  One  can relax \nthe  accuracy  requirement by using the gradients along the contours -\na  technique \nthat  works  well  in  practice.  Nevertheless,  neither  gradients  nor  contours  at  sub(cid:173)\npixel accuracy are provided by Mooney  pictures,  which  leaves us with the sign- bits \nas  the source of information for  solving for  the  coefficients. \n\n\fIllumination and View Position  in  3D Visual  Recognition \n\n409 \n\nFigure 4:  Compensating  for  changes  in  viewing  position  and illumination  from  a  single \nview  (model  images  are  all  from  a  single  viewing  position).  Model  images  are  the  same \nas  in  Fig.  2,  input  image  the  same  as  in  Fig.  3.  Row  1:  edges  of input  image,  overlay \nof input  edge  image  and  edges  of first  model  image,  overlay  with  edges  of the  2D  affine \ntransformed  first  model  image, sign-bit input image with  marked 'example' locations  (16 \nof them).  Row  2:  linear  combination  image  of the  2D  affine  transformed  model  images, \nthe final  generated image,  its edges,  overlay with  edges of the input image. \n\nProposition 3  Solving  for  the  coefficients  from  the  sign- bit  image  of I  is  equtv(cid:173)\nalent  to  solving  for  a  separating  hyperplane  in  3D  in  which  image  points  serve  as \n'examples '. \nProof:  Let  z(p) = (II, 12, hf be  a  vector function  and w  = (aI, a2, a3)T  be  the \nunknown  weight  vector.  Given  the  sign-bit  image  j  of  I,  we  have  that  for  every \npoint  p,  excluding  zero-crossings,  the  scalar  product  wT z(p)  is  either  positive  or \nnegative.  In  this  respect ,  one  can  consider  points  in  j  as  'examples'  in  3D space \nand the coefficients  aj  as a  vector norma)  to the separating hyperplane. 0 \nA similar result can be obtained for the case of a  thresholded image.  The separating \nhyperplane in that case is defined in 4D, rather than 3D. Many schemes for finding a \nseparating hyperplane have  been described in  Neural Network literature (see  [5]  for \nreview)  and  in  Discriminant  Analysis  literature  ([3],  for  example).  Experimental \nresults  shown  in  the  next  section  show  that  10-20  points,  distributed  over  the \nentire object,  are sufficient to produce results that are  indistinguishable from  those \nobtained from  an exact solution. \n\nBy using the sign-bits instead of the zero-crossing  contours we  are  trading a  unique \n(up  to  a  scale  factor),  but  unstable,  solution  for  an  approximate,  but stable,  one. \nAlso, by taking the sample  points relatively far away from  the contours (in order  to \nminimize  the chance of error)  the scheme can tolerate  a  certain degree of misalign-\n\n\f410 \n\nShashua \n\nment  between  the  basis  images  and  the  novel  image.  This  property  will  be  used \nin one of the schemes, described  below, for  combining changes of viewing  positions \nand illumination  conditions. \n\n4  Changing Illumination and  Viewing  Positions \n\nIn  this section,  the recognition scheme is generalized to cope  with  both  changes in \nillumination  and  viewing  positions.  Namely,  given  a  set  of images  of an  object  as \na  model and  an input image viewed from a  novel viewing  position and taken  under \na  novel illumination  condition  we  would like to generate an  image, from  the model, \nthat is  similar  to the input image. \n\nProposition 4  Any set  of three  images,  satisfying  conditions  of proposition  1,  of \nan  object  can  be  used to  compensate for  both  changes  in  view and  illumination. \n\nProof:  Any  change in  viewing  position  will  induce  both  a  change in the location \nof  points  in  the  image,  and  a  change  in  their  brightness  (because  of  change  in \nviewing  angle  and change in angle between light source and surface normal).  From \nproposition  1,  the  change  in  brightness  can  be  compensated  for  provided  all  the \nimages  are  in  alignment.  What  remains,  therefore,  is  to  bring  the  model  images \nand the  input  image into alignment. \nCase  1:  If each of the three model images  is  viewed from a  different position,  then \nthe  remaining  proof follows  directly  from  the result  of Ullman  and  Basri  [10]  who \nshow  that  any  view  of an  object  with  smooth  boundaries,  undergoing  any  affine \ntransformat.ion  in space,  is spanned by three views  of the object. \n\nCase  2:  If only  two  of the model  images  are  viewed  from  different  positions,  then \ngiven  full  correspondence  between  all  points  in  the two  model  views  and  4  corre(cid:173)\nsponding  points  with  the  input  image,  we  can  transform  all  three  model  images \nto  align  wit.h  the  input  image  in  the  following  way.  The  4  corresponding  points \nbetween  the input  image  and  one of the  model  images  define  three  corresponding \nvectors (taking one of the corresponding points, say 0,  as an origin) from which a 2D \naffine  transformation,  ma.trix  A  and vector w,  can be recovered.  The result,  proved \nin  [8],  is  tha.t for  every point p'  in the input image who is in  correspondence with  p \nin  the model  image  we  have that p' =  [Ap + 0' - Ao] + apw.  The  parameter a p  is \ninvariant to any affine transformation in space, therefore is also invariant to changes \nin  viewing position.  One can, therefore, recover ap  from the known correspondence \nbetween two model images and use that to predict the location p'.  It can be shown \nthat  this scheme  provides  also  a  good  approximation  in  the  case  of objects  with \nsmooth  boundaries  (like  an egg or a  human head,  for  details see  [8]). \n\nCase  3:  All  three  model  images  are  from  the  same  viewing  position.  The  model \nimages  are first  brought  into  'rough  alignment'  (term adopted from  (10))  with  the \ninput image  by applying the transformation Ap + 0' - Ao + w  to all points p  in each \nmodel image.  The remaining displacement between  the transformed  model  images \nand  the input  image is  (ap  -\nl)w which can  be shown to be bounded by the depth \nvariation of the surface  [8].  (In  case  the object  is  not  sufficiently  fiat,  more  than \n4  points  may  be  used  to define  local  transformations  via  a  triangulation  of those \npoints).  The linear  combination  coefficients  are  then  recovered  using  the sign-bit \n\n\fIllumination and View  Position  in  3D Visual Recognition \n\n411 \n\nscheme described  in  the  previous  section.  The three  transformed  images  are  then \nlinearly  combined  to  create a  new  image  that  is  compensated  for  illumination  but \nis still displaced from  the input image.  The displacement can be recovered by using \na  brightness correlation scheme along  the direction  w  to find  Q p  - 1 for  each point \np.  (for details,  see  [B]).  0 \nExperimental results of the last  two schemes are shown  in  Figs.  3  and  4.  The four \ncorresponding  points, required for  view  compensation,  were chosen manually along \nthe  tip  of eyes,  eye-brow  and  mouth  of the  doll.  The  full  correspondence  that  is \nrequired between the third model view and the other two in scheme 2 above, was es(cid:173)\ntablished by first  taking two pictures of the third view, one from a novel illumination \ncondition  and  the  other  from  a  similar  illumination  condition  to one  of the  other \nmodel images.  Correspondence was then determined by using the scheme described \nin  [B].  The  extra  picture  was  then  discarded.  The  sample  points  for  the  linear \ncombination were  chosen automatically by selecting  10  points in smooth brightness \nregions.  The sample points using  the sign-bit scheme  were  chosen  manually. \n\n5  Summary \n\nIt  has  been  shown  that  the effects  photometry and  geometry  in  visual  recognition \ncan be decoupled and compensated for  prior to recognition.  Three new results were \nshown:  (i)  photometric effects  can  be  compensated  for  using  a  linear  combination \nof images,  (ii)  from  a  computational standpoint,  contours  alone  are  not  sufficient \nfor  recognition,  and  (iii) geometrical effects can be compensated for  from any set of \nthree images,  from  different  illuminations,  of the object. \n\nAcknowledgments \n\nI thank Shimon Ullman for  his advice and support.  Thanks to Ronen Basri, Tomaso \nPoggio,  Whitman  Richards  and  Daphna  Weinshall  for  many  discussions.  A.S.  is \nsupported by NSF  grant IRI-B900267. \n\nReferences \n\n[1]  Cavana.gh,P.  Proc.  19th  ECVP,  Andrei, G.  (Ed.),  1990. \n[2]  Curtis,S.R and  Oppenheim,A.V.  in  Whitman,R. and  Ullman,S.  (eds.)  Image  Under(cid:173)\n\nstanding 1989.  pp.92-110,  Ablex,  NJ  1990. \n\n[3]  Duda,R.O.  and  Hart,P.E.  pattern  classification  and scene  analysis.  NY,  Wiley  1973. \n[4]  Edelman,S.  and  Poggio,T.  Massachusetts  Institute  of  Technology,  A.I.  Memo  1181, \n\n1990 \n\n[5]  Lippmann,R.P.  IEEE  ASSP Magazine,  pp.4-22,  1987. \n[6]  Mooney,C.M.  Can.  1.  Psychol.  11:219-226,  1957. \n[7]  Phong,B.T.  Comm.  A CM,  18,  6:311-317,  1975. \n[8]  Shashua,A.  Massachusetts  Institute  of Technology,  A.I.  Memo  1927,  1991 \n[9]  Ullman,S.  Cognition,32:193-254,  1989. \n[10]  Ullman,s. and Basri,R.  Massachusetts  Institute of Technology,  A.I.  Memo 1052,  1989 \n\n\f", "award": [], "sourceid": 463, "authors": [{"given_name": "Amnon", "family_name": "Shashua", "institution": null}]}