{"title": "SEEMORE: A View-Based Approach to 3-D Object Recognition Using Multiple Visual Cues", "book": "Advances in Neural Information Processing Systems", "page_first": 865, "page_last": 871, "abstract": null, "full_text": "SEEMORE:  A  View-Based Approach to \n3-D  Object Recognition  Using Multiple \n\nVisual  Cues \n\nBartlett W.  Mel \n\nDepartment of Biomedical Engineering \n\nUniversity of Southern  California \n\nLos  Angeles,  CA 90089 \n\nmel@quake.usc.edu \n\nAbstract \n\nA  neurally-inspired  visual  object  recognition  system  is  described \ncalled  SEEMORE,  whose  goal  is  to  identify  common objects  from \na  large  known  set-independent  of  3-D  viewiag  angle,  distance, \nand  non-rigid  distortion.  SEEMORE's  database  consists  of 100  ob(cid:173)\njects  that  are  rigid  (shovel),  non-rigid  (telephone  cord),  articu(cid:173)\nlated  (book), statistical (shrubbery),  and complex (photographs of \nscenes).  Recognition results  were  obtained using a  set of 102 color \nand shape feature channels within a simple feedforward network ar(cid:173)\nchitecture.  In  response  to  a  test  set  of 600  novel  test  views  (6  of \neach object)  presented individually in color video images, SEEMORE \nidentified  the  object correctly  97% of the time (chance is 1%)  using \na  nearest  neighbor  classifier.  Similar levels  of performance  were \nobtained for  the  subset  of 15  non-rigid objects.  Generalization be(cid:173)\nhavior reveals emergence  of striking natural category structure  not \nexplicit in  the  input feature  dimensions. \n\n1 \n\nINTRODUCTION \n\nIn natural contexts, visual object recognition in humans is remarkably fast, reliable, \nand viewpoint invariant.  The present approach to object recognition is  \"view-based\" \n(e.g. see  [Edelman and Bulthoff,  1992]), and has been guided by three main dogmas. \n\nFirst,  the  \"natural\"  object  recognition  problem faced  by  visual  animals involves  a \nlarge  number  of objects  and  scenes,  extensive  visual  experience,  and  no  artificial \n\n\f866 \n\nB. W.MEL \n\ndistinctions among object classes,  such  as rigid, non-rigid, articulated, etc. \n\nSecond,  when  an  object  is  recognized  in  the  brain,  the  \"heavy  lifting\"  is  done  by \nthe  first  wave  of action  potentials  coursing  from  the  retina  to  the  inferotemporal \ncortex  (IT)  over  a  period  of 100  ms  [Oram and Perrett,  1992].  The computations \ncarried out during this time can be modeled as a shallow but very wide feedforward \nnetwork  of simple image filtering  operations.  Shallow means few  processing  levels, \nwide means a sparse, high-dimensional representation combining cues from multiple \nvisual submodalities, such  as color,  texture,  and contour  [Tanaka et al.,  1991]. \n\nThird,  more complicated processing  mechanisms, such  as  those  involving focal  at(cid:173)\ntention, segmentation, binding, normalization, mental rotation, dynamic links, parts \nrecognition,  etc.,  may exist  and  may enhance recognition  performance  but are not \nnecessary  to  explain  rapid,  robust  recognition  with  objects  in  normal  visual  situ(cid:173)\nations. \nIn  this  vein,  the main goal of this project has been to explore the  limits of perform(cid:173)\nance of a shallow-but very wide-feedforward network of simple filtering operations \nfor  viewpoint-invariant  3-D  object  recognition,  where  the  filter  \"channels\"  them(cid:173)\nselves have been loosely modeled after the shape- and color-sensitive visual response \nproperties seen in the higher levels of the primate visual system [Tanaka et al.,  1991]. \nArchitecturally similar approaches to vision have been most often applied in the do(cid:173)\nmain of optical character recognition  [Fukushima et al.,  1983,  Le  Cun et al.,  1990]. \nSEEMORE'S architecture is also similar in spirit to the color histogramming approach \nof [Swain and Ballard,  1991],  but includes spatially-structured features  that provide \nalso for  shape-based generalization. \n\nFigure 1:  The database includes  100 objects of many different  types,  including rigid \n(soup  can),  non-rigid  (necktie),  statistical  (bunch  of grapes),  and  photographs  of \ncomplex indoor and outdoor scenes. \n\n\fSEEMORE:  A View-based Approach to 3-D Object Recognition \n\n867 \n\n2  SEEMORE'S VISUAL WORLD \n\nSEEMORE's  database contains  100  common 3-D  objects  and  photogaphs  of scenes, \neach  represented  by a set of pre-segmented color video images (fig.  1).  The training \nset consisted of 12-36 views of each object  as follows.  For rigid objects,  12  training \nviews were  chosen  at roughly 60\u00b0  intervals in depth  around the viewing sphere,  and \neach  view was then scaled to yield  a total of three images at 67%,  100%, and 150%. \nImage  plane orientation was  allowed  to  vary  arbitrarily.  For  non-rigid  objects,  12 \ntraining views  were chosen  in random poses. \n\nDuring  a  recognition  trial,  SEEMORE  was  required  to  identify novel  test  images of \nthe database objects.  For rigid objects,  test images were  drawn from  the  viewpoint \ninterstices  of the  training set,  excluding  highly foreshortened  views  (e.g.  bottom of \ncan).  Each test  view  could therefore  be  presumed to  be correctly  recognizable,  but \nnever  closer than roughly 30->  in orientation in depth  or  22%  in scale  to  the  nearest \ntraining view  of the  object,  while position and orientation in the  image plane could \nvary arbitrarily.  For non-rigid objects,  test images consisted of novel random poses. \nEach test  view  depicted  the isolated object  on a  smooth background. \n\n2.1  FEATURE  CHANNELS \n\nSEEMORE's  internal  representation  of  a  view  of  an  object  is  encoded  by  a  set \nof  feature  channels.  The  ith  channel  is  based  on  an  elemental  nonlinear  filter \nfi(z, y, (h, (J2,  .\u2022 . ),  parameterized  by  position  in  the  visual  field  and  zero  or  more \ninternal degrees of freedom.  Each channel is by design relatively sensitive to changes \nin the  image that are strongly related  to object identity, such  as the  object's shape, \ncolor, or texture,  while remaining relatively insensitive to changes in the  image that \nare  unrelated to object identity, such as are caused  by changes in the object's pose. \nIn  practice, this invariance is achieved in a straightfOl'ward way for each channel by \nsubsampling and summing the  output of the elemental channel filter  over  the  entire \nvisual  field  and  one  or  more  of its  internal  degrees  of freedom,  giving  a  channel \noutput Fi  = Lx,y,(h , .. . fiO.  For example, a particular shape-sensitive channel might \n\"look\"  for  the  image-plane projections of right-angle corners,  over  the entire visual \nfield,  360\u00b0  of rotation  in  the  image  plane,  30\u00b0  of rotation in  depth,  one  octave  in \nscale,  and  tolerating  partial occlusion  and/or slight misorientation of the  elemental \ncontours  that define  the right angle.  In general,  then,  Fi  may be  viewed  as a  \"cell\" \nwith a large receptive field  whose output is an estimate of the  number of occurences \nof distal feature  i in the  workspace  over  a large  range of viewing  parameters. \n\nSEEMORE'S  architecture  consists  of  102  feature  channels,  whose  outputs  form  an \ninput  vector  to a  nearest-neighbor  classifer.  Following the  design of the  individual \nchannels, the channel vector F  =  {FI, ... F102}  is (1)  insensitive to changes in image \nplane  position  and  orientation of the  object,  (2)  modestly  sensitive  to  changes  in \nobject scale, orientation in depth,  or non-rigid deformation, but (3)  highly sensitive \nto object  \"quality\"  as  pertains to object identity.  Within this representation,  total \nmemory storage for  all views  of an object ranged  from 1,224 to 3,672 integers. \n\nAs  shown  in fig . 2,  SEEMORE's  channels fall  into  in  five  groups:  (1)  23  color chan(cid:173)\nnels,  each  of which  responds  to  a  small blob of color  parameterized  by  \"best\"  hue \nand saturation, (2)  11  coarse-scale  intensity corner channels parameterized by open \nangle,  (3)  12 \"blob\" features,  parameterized by the shape (round and elongated) and \n\n\f868 \n\nB. W.MEL \n\nsize  (small,  medium,  and  large)  of bright and  dark  intensity  blobs,  (4)  24  contour \nshape features,  including straight angles, curve segments of varying radius,  and par(cid:173)\nallel  and oblique line combinations, and  (5)  16 shape/texture-related features  based \non  the  outputs  of Gabor functions  at 5  scales  and 8  orientations.  The  implement(cid:173)\nations  of  the  channel  groups  were  crude,  in  the  interests  of achieving  a  working, \nmultiple-cue system with minimal development time.  Images were  grabbed using an \noff-the-shelf Sony S-Video  Camcorder and Sun Video digitizing board. \n\nColors \n\nAngles \n\nBlobs \n\nContours \n\no. \ne. \n\noe .e oe \n00 \u2022\u2022 \u2022\u2022 \u2022\u2022 \u2022\u2022 oe \n\no \n\n0.1 \no  0 \nc \n\n=:> \n\nGabor-Based Features \n\n-\n\nsin2 +cos2 \n\n./\" 1:  -\nenergy @ scale i \n..........  0 2  _  energy variance \n\n@scalei \n\n6 \n\n0  45  90 \n\n<30 \n>30 \n\nFigure 2:  SEEMORE's  102 channels fall into 5 groups,  sensitive to  (1)  colors,  (2)  in(cid:173)\ntensity corners,  (3) circular and elongated intensity blobs, (4) contour shape features, \nand  (5)  16 oriented-energy and relative-orientation features  based on the  outputs of \nGabor functions  at several scales and orientations. \n\n3  RECOGNITION \n\nSEEMORE's  recognition  performance was  assesed  quantitatively  as  follows.  A  test \nset  consisting of 600  novel  views  (100  objects  x  6 views)  was culled from  the  data(cid:173)\nbase,  and  presented  to  SEEMORE  for  identification.  It was  noted  empirically that \na  compressive  transform  on  the  feature  dimensions  (histogram  values)  led  to  im(cid:173)\nproved  classification  performance;  prior to  all  learning  and  recognition  operations, \n\n\fSEEMORE:  A View-based Approach to  3-D Object Recognition \n\n869 \n\nFigure  3:  Generalization  using  only  shape-related  channels.  In  each  row,  a  novel \ntest  view  is  shown  at  far  left.  The  sequence  of best  matching  training  views  (one \nper object)  is shown to  right, in order of decreasing  similarity. \n\ntherefore,  each  feature  value  was  replaced  by  its  natural  logarithm  (0  values  were \nfirst  replaced  with  a small positive constant to prevent the logarithm from  blowing \nup).  For each test view, the city-block distance was computed to every training view \nin  the  database  and  the  nearest  neighbor  was  chosen  as  the  best  match.  The log \ntransform of the feature dimension:.;  thus tied this distance to the ratios of individual \nfeature  values  in two images rather  than their differences. \n\n4  RESULTS \n\nRecognition time on a  Sparc-20 was  1-2  minutes per view;  the  bulk of the  time was \ndevoted  to shape  processing,  with under  2 seconds required for  matching. \n\nRecognition results  are reported  as the proportion of test  views  that were  correctly \nclassified.  Performance using  all  102 channels for  the  600  novel object views  in the \nintact test  set  was  96.7%;  the  chance  rate of correct  classification  was  1%.  Across \nrecognition  conditions,  second-best  matches  usually  accounted  for  approximately \nhalf the  errors.  Results  were  broken  down  in  terms  of the  separate  contributions \nto recognition performance of color-related  vs.  shape-related feature  channels.  Per(cid:173)\nformance  using  only  the  23  color-related  channels  was  87.3%,  and  using  only  the \n79 shape-related channels was 79.7%.  Remarkably, very similar performance figures \nwere obtained for the subset of 90 test views of the non-rigid objects, which included \nseveral scarves,  a  bike chain, necklace,  belt, sock,  necktie,  maple-leaf cluster,  bunch \nof grapes,  knit  bag,  and  telephone  cord.  Thus,  a  novel  random configuration of a \ntelephone cord was  as easily recognized  as a  novel view  of a  shovel. \n\n\f870 \n\nB.  W.MEL \n\n5  GENERALIZATION  BEHAVIOR \n\nNumerical indices of recognition performance are useful, but do not explicitly convey \nthe  similarity  structure  of  the  underlying  feature  space.  A  more  qualitative  but \nextremely informative representation  of system performance lies  in the  sequence  of \nimages in  order  of increasing  distance  from  a  test  view.  Records  of this  kind  are \nshown in fig.  3 for  trials in which only shape-related channels were  used.  In each,  a \ntest  view is shown at the far left,  and the  ordered  set of nearest  neighbors is shown \nto the  right.  When a  test view's nearest  neighbor  (second image from left)  was not \nthe  correct  match, the  trial was classified  as an error. \n\nAs shown in row (1), a view of a book is judged most similar to a series of other books \n(or the bottom of a rectangular cardboard box)---each a view of a rectangular object \nwith high-frequency surface markings.  A similar sequence can be seen in subsequent \nrows for  (2)  a series of cans, each a right cylinder with detailed surface markings,  (3) \na  series  of smooth,  not-quite-round objects,  (4)  a  series  of photographs of complex \nscenes,  and  (5)  a  series  of dinosaurs  (followed  by  a  teddy  bear).  In certain  cases, \nSEEMORE'S  shape-related  similarity metric was  more difficult  to  visually  interpret \nor verbalize  (last  two rows),  or was different from  that of a human observer. \n\n6  DISCUSSION \n\nThe  ecology  of natural  object  vision  gives  rise  to  an  apparent  contradiction:  (i) \ngeneralization  in  shape-space  must  in  some  cases  permit  an  object  whose  global \nshape has been grossly perturbed to be matched to itself, such as the various tangled \nforms of a telephone cord, but (ii) quasi-rigid basic-level shape categories (e.g.  chair, \nshoe,  tree)  must be preserved  as well,  and distinguished from  each  other. \n\nA partial It wi uti on to this conundrum lies in the  observation tbat locally-cumputed \nshape statistics are in large part preserved  under the global shape deformations that \nnon-rigid common objects  (e.g. scarf,  bike-chain) typically undergo.  A feature-space \nrepresentation  with  an  emphasis  on  locally-derived  shape  channels  will  therefore \nexhibit a significant degree of invariance to global nonrigid shape deformations.  The \ndefinition of shape similarity embodied in the  present  approach is that two objects \nare  similar if  they  contain  similar  profiles  (histograms)  of their  shape  measures, \nwhich emphasize locality.  One  way of understanding  the  emergence  of global shape \ncategories,  then,  such as  \"book\",  \"can\",  \"dinosaur\", etc.,  is to view each as a set of \ninstances of a single canonical object whose local shape statistics remain quasi-stable \nas  it is warped  into various global forms.  In many cases,  particularly within  rigid \nobject categories,  exemplars may share longer-range shape statistics as well. \n\nIt  is useful  to consider  one further  aspect  of SEEMORE'S  shape representation,  per(cid:173)\ntaining  to  an  apparent  mismatch  between  the  simplicity of the  shape-related  fea(cid:173)\nture  channels  and  the  complexity  of the  shape  categories  that  can  emerge  from \nthem.  Specifically, the  order of binding of spatial relations within SEEMORE's  shape \nchannels  is  relatively  low,  i.e.  consisting  of  single  simple  open  or  closed  curves, \nor  conjunctions  of two  oriented  contours  or  Gabor  patches.  The  fact  that  shape \ncategories,  such  as  \"photographs  of rooms\",  or  \"smooth  lumpy  objects\",  cluster \ntogether  in a  feature  space  of such  low  binding  order  would  therefore  at first  seem \nsurprising.  This phenomenon relates  closely  to  the notion of \"wickelfeatures\"  (see \n[Rumelhart and McClelland,  1986], ch.  18), in which features  (relating to phonemes) \n\n\fSEEMORE:  A View-based Approach to 3-D Object Recognition \n\n871 \n\nthat  bind spatial  information only locally  are  nonetheless  used  to  represent  global \npatterns  (words)  with little  or no residual ambiguity. \n\nThe pre segmentation of objects is a simplifying assumption that is clearly invalid in \nthe  real world.  The advantage of the  assumption from a methodological perspective \nis  that  the  object  similarity structure  induced  by  the  feature  dimensions  can  be \nstudied independently from the problem of segmenting or indexing objects imbedded \nin complex scenes.  In continuing work,  we  are  pursuing a leap to sparse  very-high(cid:173)\ndimensional space  (e.g.  10,000 dimensions),  whose  advantages for  classification  in \nthe  presence  of noise  (or  clutter)  have  been  discussed  elsewhere  [Kanerva,  1988, \nCalifano and  Mohan,  1994]. \n\nAcknowledgements \n\nThanks to J6zsef Fiser for useful discusf!ions and for development of the Gabor-based \nchannel set,  to  Dan Lipofsky and Scott  Dewinter for  helping  in  the  construction of \nthe  image database,  and  to Christof Koch  for  providing  support  at  Caltech  where \nthis work  was initiated.  This work  was funded  by the Office  of Naval Research,  and \nthe  McDonnell-Pew  Foundation. \n\nReferences \n\n[Califano and  Mohan,  1994]  Califano, A.  and  Mohan,  R.  (1994).  Multidimensional \n\nindexing for  recognizing  visual shapes.  IEEE  Trans.  on  PAMI,  16:373-392. \n\n[Edelman and Bulthoff,  1992]  Edelman, S.  and Bulthoff,  H.  (1992).  Orientation de(cid:173)\npendence in the  recognition of familiar and novel  views  of three-dimensional ob(cid:173)\njects.  Vision  Res.,  32:2385-2400. \n\n[Fukushima et  al.,  1983]  Fukushima,  K.,  Miyake,  S.,  and  Ito,  T.  (1983).  Neocog(cid:173)\nnitron:  A  neural  network  model  for  a  mechanism of visual  pattern  recognition. \nIEEE  Trans.  Sys.  Man  &  Cybernetics, SMC-13:826-834. \n\n[Kanerva,  1988]  Kanerva,  P.  (1988).  Sparse  distributed memory.  MIT Press,  Cam(cid:173)\n\nbridge,  MA. \n\n[Le  Cun et  al.,  1990]  Le  Cun, Y.,  Matan, 0., Boser,  B., Denker, J., Henderson,  D., \nHoward,  R.,  Hubbard,  W.,  Jackel,  L.,  and  Baird,  H.  (1990).  Handwritten  zip \ncode  recognition  with  multilayer  networks.  In  Proc.  of the  10th  Int.  Conf.  on \nPatt.  Rec.  IEEE  Computer Science  Press. \n\n[Oram and  Perrett,  1992]  Oram, M.  and Perrett,  D.  (1992).  Time course  of neural \nresponses  discriminating different  views  of the  face  and  head.  J.  Neurophysiol., \n68(1) :70-84. \n\n[Rumelhart and McClelland,  1986]  Rumelhart, D.  and McClelland, J.  (1986).  Par(cid:173)\n\nallel  distributed processing.  MIT Press,  Cambridge,  Massachusetts. \n\n[Swain  and Ballard,  1991]  Swain,  M.  and  Ballard,  D.  (1991).  Color indexing.  Int. \n\nJ.  Computer  Vision,  7:11-32. \n\n[Tanaka et al.,  1991]  Tanaka,  K.,  Saito,  H.,  Fukada,  Y.,  and  Moriya,  M.  (1991). \nCoding  visual  images  of objects  in  the  inferotemporal  cortex  of  the  macaque \nmonkey.  J.  Neurophysiol.,  66:170-189. \n\n\f\fPART VIII \n\nAPPLICATIONS \n\n\f\f", "award": [], "sourceid": 1045, "authors": [{"given_name": "Bartlett", "family_name": "Mel", "institution": null}]}