{"title": "TRAFFIC: Recognizing Objects Using Hierarchical Reference Frame Transformations", "book": "Advances in Neural Information Processing Systems", "page_first": 266, "page_last": 273, "abstract": null, "full_text": "266 \n\nZemel, Mozer and Hinton \n\nTRAFFIC:  Recognizing  Objects  Using \n\nHierarchical  Reference  Frame  Transformations \n\nRichard S.  Zemel \nComputer Science  Dept. \nUniversity of Toronto \nToronto, ONT M5S  lA4 \n\nMichael  C.  Mozer \n\nComputer Science  Dept. \nUniversity of Colorado \nBoulder,  CO 80309-0430 \n\nGeoffrey E.  Hinton \nComputer Science  Dept. \nUniversity of Toronto \nToronto, ONT M5S  lA4 \n\nABSTRACT \n\nWe  describe a model that can recognize  two-dimensional shapes in \nan  unsegmented  image,  independent  of their orientation,  position, \nand scale.  The model,  called TRAFFIC, efficiently  represents  the \nstructural  relation  between  an  object  and  each  of its  component \nfeatures  by  encoding  the fixed  viewpoint-invariant transformation \nfrom the feature's reference frame to the object's in the weights of a \nconnectionist  network.  Using  a  hierarchy  of such  transformations, \nwith increasing complexity of features  at each successive  layer,  the \nnetwork  can  recognize  multiple objects  in parallel.  An implemen(cid:173)\ntation  of TRAFFIC  is  described,  along with  experimental  results \ndemonstrating  the  network's  ability to  recognize  constellations  of \nstars in  a viewpoint-invariant manner. \n\nINTRODUCTION \n\n1 \nA  key  goal  of machine  vision  is  to  recognize  familiar  objects  in  an  unsegmented \nimage,  independent  of their  orientation,  position,  and  scale.  Massively  parallel \nmodels  have  long  been  used  for  lower-level  vision  tasks,  such  as  primitive feature \nextraction and stereo depth.  Models addressing  \"higher-level\"  vision have generally \nbeen restricted to pattern matching types of problems, in which much of the inherent \ncomplexity of the  domain has been  eliminated or ignored. \n\nThe  complexity of object  recognition  stems  primarily from  the  difficult  search  re(cid:173)\nquired  to find  the correspondence  between  features  of candidate objects  and image \n\n\fTRAFFIC:  Recognizing Objects \n\n267 \n\nfeatures.  Images contain spurious features,  which  do  not correspond  to any object \nfeatures;  objects  in  an  image  may  have  missing  or  occluded  features;  and  noisy \nmeasurements  make  it  impossible  to  align  object  features  to  image  features  ex(cid:173)\nactly.  These  problems are  compounded  in realistic  domains,  where  images are not \nsegmented  and normalized and the number of candidate objects is  large. \n\nIn this paper,  we  present a structured,  general model of object recognition - called \nTRAFFIC (a loose  acronym for  \"transforming feature instances\") - that addresses \nthese difficult problems through a combination of strategies.  First, we  directly build \nconstraints on  the spatial relationships  between  features  of an object  directly  into \nthe architecture of a  connectionist network.  We  thereby limit the space of possible \nmatches  by  constructing  only  plausible  assignments  of image features  to  objects. \nSecond,  we  embed  this  construction  into a  hierarchical  architecture,  which  allows \nthe network to handle  unsegmented,  non-normalized images,  and  also  allows for  a \nwide range of candidate objects.  Third, we  allow TRAFFIC to discover  the critical \nspatial  relationships  among  features  through  training  on  examples  of the  target \nobjects in various poses. \n\n2  MODEL HIGHLIGHTS \nThe  following sections  outline  the  three  fundamental  aspects  of TRAFFIC.  For  a \nmore  complete  discussion  of the details of TRAFFIC, see  (Zemel,  1989). \n\n2.1  ENCODING  STRUCTURAL RELATIONS \n\nThe first  key aspect of TRAFFIC concerns its encoding and use of the fixed spatial \nrelations  between  a  rigid object  and each  of its  component  features.  If we  assume \nthat  each  feature  has  an  intrinsic  reference  frame,  then  for  a  rigid  object  and  a \nparticular feature  of that  object,  there  is  a  fixed  viewpoint-independent  transfor(cid:173)\nmation from the feature's  reference  frame to the object's.  This transformation can \nbe  used  to  predict  the  object's  reference  frame  from  the  feature's.  To  recognize \nobjects,  TRAFFIC takes  advantage of the fact  that all features  of the same object \nwill predict the identical reference frame for  that object (the  \"viewpoint consistency \nconstraint\"  (Lowe,  1987)). \n\nEach  reference  frame  transformation  can  be  expressed  as  a  matrix  multiplication \nthat  is  efficiently  implemented  in  a  connectionist  network.  Consider  a  two-layer \nnetwork,  with one layer containing units representing  particular features,  the other \ncontaining units  representing  objects.  For  two-dimensional shapes,  each  feature  is \ndescribed  by  a  set  of four  instantiation  units.  These  real-valued  units  represent \nthe  parameter  values  associated  with  the  feature:  (x,y)-position,  orientation,  and \nscale.  The objects  have  a set of instantiation units  as  well.  The units representing \nparticular features  are  connected  to  the  units  representing  each  object  containing \nthat  feature,  thereby  assigning  each  feature-object  pair  its  own  set  of  weighted \nconnections.  The fixed  matrix that describes  the transformation from the feature's \nintrinsic  reference  frame  to  the object's  can  be directly  implemented  in  the  set  of \nweights  connecting  the instantiation units of the feature  and the object. \n\n\f268 \n\nZemel, Mozer and Hinton \n\nWe  can describe any instantiation, or any transformation between instantiations, as \na vector of four parameters.  Let Pif = (xif' Yif,  cif, s;,f) specify the refere:p.ce frame \nof the feature with respect to the image, where xif and Yif represent the coordinates \nof the feature origin relative to the image frame,  cif  and sif represent the scale and \nangle  of the  feature  frame  w.r.t. \nthe  image  frame.  Rather  than  encoding  these \nvalues  directly,  cif  represents  the  product of the scale  and the cosine  of the angle, \nwhile  sif  represesents  the product of the scale  and the sine  of the  angle. 1  Let  Pio \n=  (Xio,  Yio,  Ciol  Sio),  specify  the reference  frame  of the object  with respect  to  the \nimage.  Finally,  let  Pfo  =  (xfol  Yfo,  cfol  sfo) specify  the  transformation from  the \nreference frame  of the object to that of the feature. \n\nEach of these sets of parameters can be placed into a  transformation matrix which \nconverts  points in  one  reference  frame  to points in another.  We  can express  Pif  as \nthe matrix Iif, a  transformation from the feature frame to the image frame: \n\nXif  ] \nYif \n1 \n\nLikewise,  we  can express  Pfo  as  the matrix Tfo,  a  transformation from  the  object \nto feature  frame,  and  Pio  as  Iio, a  transformation from  the object  to image frame. \nBecause Tfo is fixed for a given feature-object pair and Iif is derived from the image, \nIio can easily  be computed  by composing these  two transforms:  Iio = 1';,f Tf o. \nThe  four  parameters  underlying  Iio  can  then  be  extracted,  which  results  in  the \nfollowing four  equations for  Pio: \n\nXio \n\nYio \n\nCio \n\nSi,o \n\nCi,fX fo + Si,fYfo + Xif \n-SifXfo + Ci,fYfo + Yi,f \nCi,fCfo  - Si,fSfo \nCi,fSfo + S;,fCfo \n\nThis  transformation  is  easily  implemented  in  a  network  by  connecting  the  units \nrepresenting  Pi,f  to the units representing  P;,o  with the appropriate weights  (Figure \n1).  In this manner, TRAFFIC directly encodes  the reference  frame transformation \nfrom  a  feature  to  an  object  in  the  connections  from  the  set  of units  representing \nthe feature's  reference  frame  to  units  representing  the  object's  frame.  The speci(cid:173)\nfication  of an  object's  reference  frame  can  therefore  be derived  directly from  each \nof its  component  features  on  the  basis  of the  structural  relationship  between  the \nfeature  and  the  object.  Because  each feature  of an  object should predict  the same \nreference  frame  parameters for  the object,  we  can determine  whether  the object  is \nreally present in the image by checking to see if the various features  make identical \n\n1 We represent angles by their sines and cosines to avoid the discontinuities involved in repre(cid:173)\nsenting orientation by a  single number and to  eliminate the non-linear step of computing sin Bil \nfrom Bi/.  Note that we represent the four degrees of freedom in the instantiation parameters using \nfour units; a  neurally plausible extension to this scheme which does not require single units with \narbitrary precision could allocate a  pool of units to each of these parameters. \n\n\fTRAFFIC:  Recognizing Objects \n\n269 \n\nFigure 1:  The matrix TJo  is  a  fixed  coordinate transformation from  the reference \nframe  of feature  f to  the  reference  frame  of object  o.  This  figure  shows  how  TJo \ncan  be  built  into  the  weights  connecting  the  object-instantiation  units  and  the \nfeature-instantiation units. \n\npredictions.  In  Section  2.3  we  discuss  how  the  object  instantiation  is  formed  in \ncases  where  the object parameters predicted  by the features  do not agree perfectly. \n\n2.2  FEATURE ABSTRACTION  HIERARCHY \n\nTRAFFIC  recursively  extends  the  notion  of reference  frame  ~ransformations be(cid:173)\ntween features  and  objects  in  a  hierarchical  architecture.  It is  impractical  to hope \nthat  any  network  will  be  able  to directly  map  low-level  input features  to complex \nobjects.  The  input  features  must  be  simple  enough  to  be  easily  extracted  from \nimages  without  relying  on  sophisticated  segmentation  and  interpretation.  If they \nare simple,  however,  they  will  be unable  to uniquely predict  the object's reference \nframe,  since  a  complex object  may contain many copies  of a single simple feature. \n\nTo  address  this  problem,  we  adopt  a  hierarchical  approach,  introducing  several \nlayers of intermediate features  between  the input and output layers.  In each  layer, \nseveral  features  are  grouped  together  to form  an  'object'  in  the  layer  above;  this \n'object'  then  serves  as  a  feature  for  'objects'  in  the  next  layer.  The  lowest  layer \ncontains simple features,  such  as edges  and various corner types.  The objects to be \nrecognized  appear at the top of the hierarchy - the output layer of the network. \n\nThis composition hierarchy builds up a description of objects by selectively grouping \nsets  of features,  forming an increasingly  abstract set of features.  The power of this \nrepresentation  comes  in  the sharing of a  set  of features  in  one  layer  by  objects  in \nthe layer  above. \n\nTo  represent  multiple  features  of the  same  type  simultaneously,  we  carve  up  the \nimage  into  spatially-contiguous  regions,  each  allowing  the  representation  of  one \n\n\f270 \n\nZemel, Mozer and Hinton \n\ninstance  of each  feature.  The  network  can  thus  represent  several  instances  of a \nfeature  type simultaneously, provided they lie  in  different  regions. \n\nWe  tailor the  regions  to  the abstraction  hierarchy  as  follows.  In  the lowest  layers, \nthe features  are  simple and  numerous,  so  we  need  many regions,  but  with  only a \nfew  feature  types  per  region.  In  upper layers of the hierarchy,  the features  become \nincreasingly  complex  and  span  a  larger  area  of the  image;  the  number  of feature \ntypes  increases  and the regions  become  larger,  while  the instantiation units retain \naccurate viewpoint information.  In the highest layer, there is a single region, and it \nspans the entire original image.  At this level, the network can recognize and specify \nparameters for  a  single instance of each  object it has been  trained on. \n\n2.3  FORMING OBJECT  HYPOTHESES \n\nThe  third  key  aspect  of TRAFFIC  is  its  method  of combining  information from \nfeatures  to  determine  both  an  object's  reference  frame  and  an  overall estimate of \nthe  likelihood  that  the  object  is  actually  present  in  the  image.  This  likelihood, \ncalled  the object's  confidence,  is  represented  by  an additional unit associated  with \neach  object. \n\nEach feature individually predicts the object's reference frame, and TRAFFIC forms \na single vector of object instantiation-parameters by averaging the predicted instan(cid:173)\ntiations,  weighted  by  the  confidence  of their  corresponding features. 2  Every  set of \nunits  representing  an  object  is  sensitive  to feature  instances  appearing  in  a  fixed \narea of the  image - the  receptive  field  of the  object.  The confidence  of the  object \nis  then  a  function  of the  confidence  of the  features  lying  in  its  receptive  field,  as \nwell  as  the  variance  of their  predictions,  because  low  variance  indicates  a  highly \nself-consistent  object instantiation. \n\nOnce  the  network  has  been  defined  -\nthe  regions,  receptive  fields,  and  feature \ntypes  specified  at  each  level,  and  the  reference  frame  transformations encoded  in \nthe  weights  - recognition occurs  in  a  single bottom-up  pass  through  the network. \nTRAFFIC accepts  as  input a set of simple features  and a  description of their pose \nin  the  image.  At  each  layer  in  turn,  the  network  forms  many  candidate  object \ninstantiations from  the  set  of feature  instantiations  in  the  layer  below,  and  then \nsuppresses  the  object  instantiations that  are  not  consistently  predicted  by  several \nof  their  component  features.  At  the  output  level  of the  network,  the  confidence \nunit of each object describes  the likelihood that that object is in the image, and its \ninstantiation units specify its pose. \n\nIMPLEMENTING TRAFFIC \n\n3 \nThe domain we selected for study involves the recognition of constellations of stars. \nThis  problem  has  several  interesting  properties: \nthe  image  is  by  nature  unseg-\n\n2This averaging technique contains an implicit assumption that the maximum expected devia(cid:173)\n\ntion of a  prediction from the actual value is  a  function of the number of features, and that there \nwill always be enough good values to smooth out any large deviations.  We are currently exploring \nimproved methods of forming object hypotheses. \n\n\fTRAFFIC:  Recognizing Objects \n\n271 \n\nmented;  there  are  many false  partial matches;  no  bottom-up  cues  suggest  a  natu(cid:173)\nral frame  of reference;  and  it requires  the  ability  to  perform  2-D  transformation(cid:173)\ninvariant recognition. \n\nEach  image  contains  the  set  of visible  stars  in  a  region  of the  sky.  The  input \nto  TRAFFIC  is  a  set  of features  that  represent  triples  of stars  in  particular  con(cid:173)\nfigurations.  This  input  is  computed  by  first  dividing  the  image  into  regions  and \nextracting  every  combination  of three  stars  within  each  region.  The  star  triplets \n(more  precisely,  the  inner  angles  of the  triangles  formed  by  the  triplets)  are  fed \ninto an unsupervised  competitive-learning network  whose  task is  to categorize  the \nconfiguration  as  one  of a  small number  of types  - the  primitive feature  types  for \nthe input layer  of TRAFFIC. \n\nThe architecture  we  implemented had  an input layer,  two intermediate  layers,  and \nan output layer.3  Eight constellations were  to be recognized,  each  represented  by  a \nsingle unit in the output layer.  We  used  a simple unsupervised  learning scheme  to \ndetermine  the feature types in the intermediate layers of the hierarchy,  working up \nsequentially  from  the  input layer.  During  an  initial phase  of training,  the  system \nsamples  many  regions  of the  sky  at random,  creating  features  at  one  layer  corre(cid:173)\nsponding  to  the  frequently  occurring  combinations  of features  in  the  layer  below. \nThis scheme forms flexible intermediate representations  tailored to the domain, but \nnot hand-coded  for  the particular object set. \n\nThis sampling method determined the connection weights through the intermediate \nlayers of the network.  Back  propagation was  then used  to set  the weights  between \nthe  penultimate layer  and  the  output  layer. 4  The entire network  could  have  been \ntrained using back propagation, but the combined unsupervised-supervised  learning \nmethod we  used  is  much simpler and  quicker,  and worked  well for  this problem. \n\n4  EXPERIMENTAL RESULTS \nWe have run several experiments to test the main properties ofthe network, detailed \nfurther in  (Zemel,  1989).  Each  image used  in training and  testing contained one of \nthe eight  target  constellations,  along with other nearby  stars. \n\nThe first  experiment tested  the basic recognition capability of the system,  as well  as \nits ability to learn useful connections between objects and features.  The training set \nconsisted  of a  single  view  of each  constellation.  The second  experiment  examined \nthe  network's  ability  to  recognize  a  constellation  independent  of its  position  and \norientation in  the  image.  We  expanded  the  set  of training images  to  include four \ndifferent  views  of each  of the eight  constellations, in various  positions  and  orienta(cid:173)\ntions.  The  test  set  contained  two  novel  views  of the  eight  constellations.  In  both \nexperiments,  the network quickly \u00ab  150 epochs)  learned  to identify the target ob(cid:173)\nject.  Learning  was  slower  in the  second  experiment,  but the network  performance \n\n3The  details  of  the  network,  such  as  the number of regions  and feature  types  per layer,  the \n\nnumber of connections, etc., are discussed in (Zemel, 1989). \n\n4 In this implementation, we  used a  less efficient method of encoding the transformations than \n\nthe method discussed in Section 2.1, but both versions perform the same transformations. \n\n\f272 \n\nZemel, Mozer and Hinton \n\nwas  identical for  the training and testing images. \n\nThe third experiment tested the network's ability not only to recognize an instance \nof a  constellation, but to correctly specify  its reference  frame.  In most simulations, \nthe network produced a  correct description of the target object instantiation across \nthe training and testing images. \n\nA  final  experiment confirmed  that the network did  not recognize  an instance of an \nobject when the features of the object were  present in the input but were not in the \ncorrect  relation to one another.  The confidence  level of the target object decreased \nproportionately as random noise was added to the instantiation parameters of input \nfeatures.  This shows  that  the upper  layers  of the  network  perform the  important \nfunction  of detecting  the  spatial  relations  of features  from  non-local  areas  of the \nImage. \n\n5  RELATED  WORK \nTRAFFIC  resembles  systems  based  on  the  Hough  transform  (Ballard,  1981;  Hin(cid:173)\nton,  1981)  in  that evidence  from  various  feature  instances  is  combined  using  the \nviewpoint  consistency  constraint.  However,  while  these  Hough  transform  models \nneed  a  unit for  every  possible  viewpoint of an object,  TRAFFIC reduces  hardware \nrequirements  by  using  real-valued  units  to  represent  viewpoints.s  TRAFFIC  also \nresembles the approach of (Mjolsness,  Gindi and Anandan,  1989), which relies on a \nlarge optimization search to simultaneously find  the best set of object instantiations \nand  viewpoint  parameters  to  fit  the  image  data.  The  TRAFFIC  network  carries \nout a  similar type of search,  but the limited connectivity and hierarchical architec(cid:173)\nture  of the  network  constrains  the  search.  The feature  abstraction  hierachy  used \nin  TRAFFIC  is  common  to  many  recognition  systems.  The  pattern  recognition \ntechnique  known  as  hierarchical  synthesis  (Barrow,  Ambler  and  Burstall,  1972), \nemploys  a  similar architecture,  as  do  several  connectionist  models  (Denker  et  al., \n1989;  Fukushima,  1980;  Mozer,  1988).  Each  of these  systems  achieve  position(cid:173)\nand rotation-invariance by removing position information in the upper layers of the \nhierarchy.  The  TRAFFIC  hierarchy,  on  the  other  hand,  maintains  and  manipu(cid:173)\nlates  accurate  viewpoint  information throughout,  allowing  it  to  consider  relations \nbetween  features  in non-local areas of the image. \n\n6  CONCLUSIONS  AND  FUTURE WORK \nThe  experiments  demonstrate  that  TRAFFIC  is  capable  of recognizing  a  limited \nset  of two-dimensional  objects  in  a  viewpoint-independent  manner  based  on  the \nstructural  relations  among  components  of the  objects.  We  are  currently  testing \nthe network's ability to perform multiple-object recognition and its robustness with \nrespect  to  noise  and  occlusion.  We  are  also  currently  developing  a  probabilistic \nframework  for  combining  the  various  predictions  to  form  the  most  likely  object \n\n5Many  other  recognition systems,  such as  Lowe's  SCERPO  system  (1985),  represent  object \n\nreference frame information as sets of explicit parameters. \n\n\fTRAFFIC:  Recognizing Objects \n\n273 \n\ninstantiation hypothesis.  This probabilistic framework may increase the robustness \nof the model and allow it to handle deviations from object rigidity. \n\nAnother extension to TRAFFIC we  are currently exploring concerns the creation of \na  pre-processing  network  to specify  reference  frame  information for  input features \ndirectly  from  a  raw  image.  We  train  this  network  using  an  unsupervised  learn(cid:173)\ning  method  based  on  the  mutual  information between  neighboring  image  patches \n(Becker  and Hinton, 1989).  Our aim is  to apply this method to learn the mappings \nfrom features  to objects throughout the network hierarchy. \n\nAcknowledgements \n\nThis research was supported by grants from the Ontario Information Technology Research Center, \ngrant  87-2-36 from  the  Alfred  P.  Sloan  foundation,  and  a  grant  from  the  James  S.  McDonnell \nFoundation to Michael Mozer. \n\nReferences \nBallard,  D.  H.  (1981).  Generalizing  the Hough  transform  to  detect  arbitrary shapes.  Pattern \n\nRecognition,  13(2):111-122. \n\nBarrow,  H.  G.,  Ambler,  A.  P.,  and  Burst all,  R.  M.  (1972).  Some  techniques  for  recognising \nstructures in pictures.  In Frontiers  of Pattern  Recognition.  Academic Press, New  York, NY. \n\nBecker, S. and Hinton, G. E. (1989).  Spatial coherence as an internal teacher for a  neural network. \n\nTechnical Report Technical Report CRG-TR-89-7, University of Toronto. \n\nBolles,  R.  C.  and  Cain,  R.  A.  (1982).  Recognizing  and  locating  partially visible  objects:  The \n\nlocal-feature-focus method.  International  Journal  of Robotics  Research,  1(3):57-82. \n\nDenker,  J.  S.,  Gardner,  W.  L.,  Graf,  H.  P.,  Henderson,  D.,  Howard,  R.  E.,  Hubbard,  W.,  D., \nJ.  L.,  Baird,  H.  S.,  and Guyon,  I.  (1989).  Neural  network recognizer for  hand-written zip \ncode digits.  In Touretzky, D.  S., editor, Advances  in  neural  information  processing  systems \nI,  pages 323-331, San Mateo, CA.  Morgan Kaufmann Publishers, Inc. \n\nFukushima, K.  (1980).  Neocognitron:  A  self-organizing neural network model for a  mechanism of \n\npattern recognition unaffected by shift in position.  Biological  Cybernetics,  36:193-202. \n\nHinton,  G.  E.  (1981).  A  parallel computation that  assigns canonical object-based frames  of ref(cid:173)\nerence.  In  Proceedings  of the  7th  International  Joint  Conference  on  Artificial  Intelligence, \npages 683-685, Vancouver, BC,  Canada. \n\nHuttenlocher, D. P. and Ullman, S.  (1987).  Object recognition using alignment.  In First  Interna(cid:173)\n\ntional  Conference  on  Computer  Vision,  pages 102-111, London, England. \n\nLowe,  D.  G.  (1985).  Perceptual  Organization  and  Visual Recognition.  Kluwer  Academic Publish(cid:173)\n\ners, Boston. \n\nLowe,  D.  G.  (1987).  The  viewpoint  consistency constraint.  International  Journal  of  Computer \n\nVision,  1:57-72. \n\nMjolsness, E., Gindi, G., and Anandan, P. (1989).  Optimization in model matching and perceptual \n\norganization.  Neural  Computation,  1:218-299. \n\nMozer,  M.  C.  (1988).  The  perception  of  multiple  objects:  A  parallel,  distributed  processing \napproach. Technical Report 8803, University of California, San Diego, Institute for Cognitive \nScience. \n\nZemel,  R.  S.  (1989).  TRAFFIC:  A  connectionist model of object recognition.  Technical Report \n\nTechnical Report CRG-TR-89-2, University of Toronto. \n\n\f", "award": [], "sourceid": 241, "authors": [{"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}