{"title": "Learning to Segment Images Using Dynamic Feature Binding", "book": "Advances in Neural Information Processing Systems", "page_first": 436, "page_last": 443, "abstract": null, "full_text": "Learning to Segment Images \n\nUsing Dynamic Feature Binding \n\nMichael C.  Moser \n\nDept. of Compo  Science  & \nInst. of Cognitive Science \nUniversity  of Colorado \nBoulder,  CO 80309-0430 \n\nRichard S.  Zemel \n\nDept. of Compo  Science \nUniversity  of Toronto \n\nToronto, Ontario \nCanada  M5S  lA4 \n\nMarlene Behrmann \nDept. of Psychology & \n\nFaculty of Medicine \nUniversity  of Toronto \n\nToronto, Ontario \nCanada  M5S  lAl \n\nAbstract \n\nDespite  the fact  that complex visual  scenes  contain multiple,  overlapping \nobjects,  people  perform  object  recognition  with  ease  and accuracy.  One \noperation  that facilitates  recognition  is  an early  segmentation  process  in \nwhich  features of objects are grouped and labeled  according  to which  ob(cid:173)\nject they belong.  Current computational systems  that perform this oper(cid:173)\nation  are based  on predefined  grouping heuristics.  We  describe  a  system \ncalled  MAGIC  that  learn.  how  to  group  features  based  on  a  set  of pre(cid:173)\nsegmented  examples.  In many cases,  MAGIC  discovers  grouping heuristics \nsimilar to those previously proposed, but it also has the capability of find(cid:173)\ning  nonintuitive  structural regularities  in images.  Grouping is  performed \nby  a  relaxation  network  that  aUempts  to  dynamically  bind  related  fea(cid:173)\ntures.  Features  transmit  a  complex-valued  signal  (amplitude  and  phase) \nto one another;  binding  can thus be represented  by phase locking  related \nfeatures.  MAGIC'S  training  procedure is  a  generalization  of recurrent  back \npropagation to complex-valued units. \n\nWhen a  visual image contains multiple,  overlapping objects,  recognition is  difficult \nbecause features in the image are not grouped according to which object they belong. \nWithout the capability to form such groupings, it would be necessary to undergo a \nmassive search through all subsets of image features.  For this reason,  most machine \nvision  recognition  systems include  a  component  that  performs feature  grouping or \nimage  .egmentation (e.g.,  Guzman,  1968; Lowe,  1985;  Marr,  1982). \n\n436 \n\n\fLearning to Segment Images  Using Dynamic Feature Binding \n\n437 \n\nA  multitude of heuristics  have been  proposed for  segmenting images.  Gestalt  psy(cid:173)\nchologists have explored how people group elements of a display and have suggested \na range of grouping principles that govern human perception (Rock &z:  Palmer, 1990). \nComputer vision  researchers  have  studied  the  problem from  a  more computation(cid:173)\nal perspective.  They have investigated  methods of grouping  elements  of an image \nbased on nonaccidental regularitie..-feature combinations that are unlikely to occur \nby chance  when  several  objects are juxtaposed,  and are  thus indicative  of a  single \nobject  (Kanade,  1981; Lowe  &z:  Binford,  1982). \nIn  these  earlier  approaches,  the  researchers  have  hypothesized  a  set  of grouping \nheuristics  and then  tested  their  psychological  validity  or computational utility.  In \nour work,  we  have taken an  adaptive approach  to the  problem of image segmenta(cid:173)\ntion  in  which  a  system  learns  how  to  group features  based  on  a  set  of examples. \nWe  call  the  system  MAGIC,  an  acronym  for  multiple-object  !daptive  grouping  of \nimage ~omponents.  In many cases  MAGIC  discovers  grouping  heuristics similar  to \nthose proposed in earlier work, but it also has the capability offinding nonintuitive \nstructural regularities  in images. \n\nMAGIC  is  trained  on a  set of presegmented  images containing  multiple objects.  By \n\"presegmented,\"  we  mean that each  image feature  is  labeled  as to which  object it \nbelongs.  MAGIC  learns  to detect  configurations  of the image  features  that  have  a \nconsistent labeling in relation to one another across the training examples.  Identify(cid:173)\ning  these configurations allows  MAGIC  to then label features in novel,  unsegmented \nimages in a  manner consistent  with  the training examples. \n\n1  REPRESENTING FEATURE LABELINGS \n\nBefore  describing  MAGIC,  we  must  first  discuss  a  representation  that  allows  for \nthe labeling  of features.  Von  der  Malsburg  (1981),  von der  Malsburg  &z:  Schneider \n(1986), Gray et al.  (1989), and Eckhorn et al.  (1988), among others, have suggested \na biologically plausible mechanism of labeling through temporal correlations among \nneural signals, either the relative timing of neuronal spikes or the synchronization of \noscillatory activities in the nervous system.  The key idea here is that each processing \nunit conveys not just an activation value-average firing frequency in neural terms(cid:173)\nbut also a  second,  independent  value  which  represents  the relative  phcue  of firing. \nThe  dynamic grouping  or  binding of a  set  of features  is  accomplished  by aligning \nthe phases of the features.  Recent  work  (Goebel,  1991;  Hummel &z:  Biederman,  in \npress)  has used this notion of dynamic binding for grouping image features,  but has \nbeen  based on relatively  simple,  predetermined  grouping heuristics. \n\n2  THE DOMAIN \n\nOur initial  work  has been  conducted in  the  domain of two-dimensional  geometric \ncontours,  including  rectangles,  diamonds,  crosses,  triangles,  hexagons,  and  octa(cid:173)\ngons.  The  contours  are  constructed  from  four  primitive  feature  types-oriented \nline  segments  at  0\u00b0,  45\u00b0,  90\u00b0,  and  135\u00b0-and are laid  out  on  a  15  X  20  grid.  At \neach location on the grid are units,  called  feature  unib, that detect  each of the four \nprimitive feature  types.  In our present  experiments,  images contain two contours. \nContours are not permitted to overlap in their  activation of the same feature unit. \n\n\f438 \n\nMozer,  Zemel, and Behrmann \n\nhidden \n\nlayer __  r \n\nFigure 1:  The architedure of MAGIC.  The lower layer contains the feature units; the \nupper layer contains the hidden units.  Each layer is  arranged in a spatiotopic array \nwith a  number of different feature types at each position in the array.  Each plane in \nthe feature  layer corresponds  to a  different  feature  type.  The grayed hidden  units \nare reciprocally  conneded to all features in the corresponding grayed region  of the \nfeature layer.  The lines  between layers represent  projections in both directions. \n\n3  THE ARCHITECTURE \n\nThe input to MAGIC is  a  paUern of activity over the feature units indicating which \nfeatures are present in an image.  The initial phases ofthe units are random.  MAGIC'S \ntask is  to assign appropriate phase values to the units.  Thus, the network performs \na  type of paUern completion. \nThe network architedure consists of two layers of units, as shown in Figure 1.  The \nlower  (input)  layer  contains  the  feature  units,  arranged in  spatiotopic arrays with \none array per feature type.  The upper layer contains hidden units that help to align \nthe phases of the feature units; their response properties are determined by training. \nEach hidden  unit is  reciprocally  conneded to the units in a  local  spatial region  of \nall feature arrays.  We refer to this region as a patch; in our current simulations, the \npatch has dimensions 4 x  4.  For each patch there is a  corresponding fixed-size  pool \nof hidden  units.  To achieve  uniformity  of response across  the image,  the pools  are \narranged in a  spatiotopic array in which  neighboring  pools  respond  to neighboring \npatches and the weights of all pools are consbained  to be the same. \nThe feature  units activate the hidden units,  which in turn feed  back to the feature \nunits.  Through a  relaxation process,  the system settles on an assignment  of phases \nto the features. \n\n\fLearning to  Segment Images  Using Dynamic Feature Binding \n\n439 \n\n4  NETWORK DYNAMICS \n\nFormally,  the response  of each feature unit  i,  ~i, is  a  complex value in polar form, \n(<<li, pil,  where  \u00abli  is  the amplitude or activation and Pi  is  the phase.  Similarly,  the \nresponse of each  hidden unit ;, 11;,  has components (b;, q;).  The weight  connecting \nunit  i  to  unit  ;,  wiiJ  is  also  complex  valued,  having  components  (Pii,8ii ).  The \nactivation  rule  we  propose  is  a  generalization  of the  dot  product  to  the  complex \ndomain: \n\nneti \n\nx\u00b7wi \nEi~iWii \n([(Ei\u00ablip;i cos(Pi - 8;i\u00bb2 + (Ei\u00abliPii sin(pi - 8ii\u00bb2] ! , \n\nt \nan \n\n-1  [Ei\u00ablip;iSin(Pi - 8ii )]) \n\nEi\u00abliP;i COS(pi  - 8;i) \n\nwhere  net;  is  the  net  input  to  hidden  unit;.  The  net  input  is  passed  through \na  squashing  nonlinearity  that maps the amplitude  of the  response  from  the  range \no -+ 00 to 0 -+ 1 but leaves  the phase unaffected: \n\n1Ii \n\nneti  (1 _  e-Inetjl:l) . \nInet;1 \n\nThe :Bow  of activation from  the  hidden  layer  to the feature  layer follows  the same \ndynamics,  although  in  the  current  implementation  the amplitudes  of the features \nare clamped,  hence  the top-down How affects only the phases.  One could imagine a \nmore general  architecture  in  which  the relaxation  process  determined  not  only  the \nphase values,  but cleaned  up noise  in the feature  amplitudes as well. \n\nThe intuition  underlying  the activation  rule is  as follows.  The activity of a  hidden \nunit,  b;,  should  be monotonically related  to how  well  the feature  response  pattern \nmatches the hidden unit weight vector, just as in the standard real-valued activation \nrule.  Indeed,  one  can  readily  see  that if the feature  and  weight  phases  are  equal \n(Pi  = 8;i),  the  rule  for  bi  reduces  to  the  real-valued  case.  Even  if the  feature \nand  weight  phases  differ  by  a  constant  (Pi  = 8i i  + e),  b;  is  unaffected.  This  is \na  critical  property  of the  activation  rule:  Because  ab.olute  phase  values  have  no \ninhinsic meaning, the response of a unit should depend only on the relative phases. \nThe  activation  rule  achieves  this  by  essentially  ignoring  the  average  difference  in \nphase between the feature units and the weights.  The hidden phase, q;,  reHects  this \naverage difference. \n\n5  LEARNING ALGORITHM \n\nDuring  training,  we  would  like  the  hidden  units  to learn  to  detect  configurations \nof features  that reliably  indicate phase relationships  among the features.  We have \nexperimented  with  a  variety  of training  algorithms.  The one  with  which  we  have \nhad greatest  success  involves  running  the network for  a  fixed  number  of iterations \nand, after each iteration, attempting to adjust the weights so that the feature phase \npattern will  match a  target  phase pattern.  Each training  hial proceeds  as follows: \n\n\f440 \n\nMozer, Zemel,  and Behrmann \n\n1.  A  training  example is  generated  at  random.  This involves  selecting  two  con(cid:173)\n\ntours  and  instantiating  them in an image.  The features  of one  contour  have \ntarget phase 0\u00b0 and the features of the other contour have target phase  180\u00b0. \n2.  The training  example is  presented  to  MAGIC  by  clamping  the  amplitude  of a \nfeature unit to 1.0 ifits corresponding image feature is present, or 0.0 otherwise. \nThe phases ofthe feature units are set to random values in the range 0\u00b0 to 360\u00b0. \n3.  Activity is allowed to :flow from the feature units to the hidden  units and back \nto  the  feature  units.  Because  the  feature  amplitudes  are  clamped,  they  are \nunaffected. \n\n4.  The new  phase pattern over  the feature  units is  compared to the target  phase \n\npattern  (see  step  I), and an error measure is  computed: \n\nE  =  -(Et(l( cos(Pi - Pi))2 - (Eta. sin(Pi  - Pi))2, \n\nwhere p is  the target phase pattern.  This error ignores  the absolute difference \nbetween  the target and actual phases.  That is,  E  is minimized  when Pi - Pi  is \na  constant for  all i, regardless  of the value of Pi  - Pi. \n\n5.  Using  a  generalization  of back propagation to complex valued units,  error gra(cid:173)\ndients are computed for  the feature-to-hidden  and hidden-to-feature  weights. \n\n6.  Steps 3-5 are repeated for a maximum of 30 iterations.  The trial is terminated \n\nif the error increases  on five  consecutive iterations. \n\n7.  Weights are updated by an amount proportional to the average error gradient \n\nover iterations. \n\nLearning  is  more robust  when  the feature-to-hidden  weights  are constrained  to  be \nsymmetric  with  the  hidden-to-feature  weights.  For  complex  weights,  symmetry \nmeans that  the  weight  from  feature  unit  i  to  hidden  unit  j  is  the complex  conju(cid:173)\ngate of the weight  from  hidden  unit  j  to feature  unit  i.  Weight  symmetry ensures \nthat  MAGIC  will  converge  to  a  fixed  point.  (The  proof is  based  on  discrete-time \nupdate and a two-layer architecture with sequential layer updates and no intralayer \nconnections. ) \nSimulations reported  below  use a  learning  rate of .005 for  the amplitudes and 0.02 \nfor  the phases.  About  10,000 learning  trials  are  required  for  stable  performance, \nalthough  MAGIC  rapidly  picks up on the most salient aspects of the domain. \n\n6  SIMULATION RESULTS \n\nWe  trained  a  network  with  20  hidden  units  per  pool  on images  containing  either \ntwo  rectangles,  two diamonds,  or a  rectangle  and a  diamond.  The  shapes  were  of \nvarying size and appeared in various locations.  A subset of the resulting weights are \nshown in  Figure  2.  Each hidden unit attempts to detect  and reinstantiate  activity \npatterns that match its weights.  One clear and prevalent  pattern in the  weights  is \nthe  collinear  arrangement  of segments  of a  given  orientation,  all  having  the  same \nphase value.  When a hidden unit having weights of this form responds to a  patch of \nthe feature array, it tries align the phases of the patch with the phases of its weight \nvector.  By synchronizing the phases of features,  it acts to group the features.  Thus, \none can interpret  the weight  vectors as the rules  by which features  are grouped. \n\n\fLearning to Segmem Images  Using Dynamic Feature Binding \n\n441 \n\n:GOO'\"  \"::: \n\n,',' \n\n. \n\n..:: . .  -::. \n\n':-. \n\n..;:. \n\n'::: \n\n'GJ:::OO::' \n'QQG)\n.;.. \n\n,',' \n\n.:;\" \n\n.;::-\n\n\":'8' \n\n,',' \n\n.;:.-\n\nPhase Spectrum \n\n'~G: \nV \n\nFigure  2:  Sample  of feature-to-hidden  weights  learned  by  MAGIC.  The  area  of a \ncircle  represents the amplitude of a weight,  the orientation of the internal tick mark \nrepresents  the  phase  angle.  The  weights  are  arranged  such  that  the  connections \ninto each  hidden  unit are presented  on a  light  gray background.  Each  hidden  unit \nhas a  total of 64  incoming  weights--t x  4 locations  in  its  receptive  field  and four \nfeature  types  at  each  location.  The  weights  are  further  grouped  by  feature  type, \nand for  each  feature  type they  are arranged in a  4  X 4  pattern  homologous to the \nimage patch itself. \n\nWhereas traditional grouping principles indicate the conditions under which features \nshould be bound together as part of the same object, the grouping principles learned \nby  MAGIC  also  indicate  when  features  should  be  segregated  into  different  objects. \nFor example, the weights of the vertical and horizontal segments are generally  1800 \nout of phase with the diagonal segments.  This allows MAGIC to segregate the vertical \nand horizontal features of a  rectangle from the diagonal features of a  diamond.  We \nhad  anticipated  that  the  weights  to  each  hidden  unit  would  contain  two  phase \nvalues  at most  because  each  image patch contains  at most  two objects.  However, \nsome  units  make  use  of three  or  more  phases,  suggesting  that  the  hidden  unit  is \nperforming several distinct functions.  As is the usual case with hidden unit weights, \nthese  patterns are difficult  to interpret. \n\nFigure  3  presents  an  example  of the  network  segmenting  an  image.  The  image \ncontains two diamonds.  The top left  panel shows the features of the diamonds and \ntheir  initial  random  phases.  The  succeeding  panels  show  the  network's  response \nduring the relaxation process.  The lower  right panel shows the network response at \nequilibrium.  Features of each object  have been assigned  a  uniform  phase,  and the \ntwo objects are 1800  out of phase.  The task here may appear simple,  but it is  quite \nchallenging  due to the illusory  diamond generated by the overlapping  diamonds. \n\n\f442 \n\nMozer,  Zemel,  and Behrmann \n\n''''\u00b7''''tiIi'\u00b7'''' \n\n. . ~ \n, \n,..\" \n\"'\" \n<:::. ... \n..... , .. :;::.  \" \n' \"   ~:~::-4$s.,' \n\n\" \n\n# \n\n' . \n\n..\u2022. \n. :.; . \n\n\u2022 \"  .... \n, \n. ;::'  .. \n.-?  \" \n, \n\u2022 \n\" \n~ ... \n, ,;  '::! \u2022 \n\u2022 \n\" \n\u2022  ~ \n\n.:::. \n.::: .  ,; \n\n~~ \n\n,; \n\n~ \n\n;:::: \n\nIteration 0 \n\nIteration 2 \n\nIteration 4 \n\nIteration 6 \n\nIteration 10 \n\nIteration 25 \n\nFigure  3:  An  example  of MAGIC  segmenting  an image.  The  \"iteration\"  refers  to \nthe number of times activity has flowed  from the feature  units to the hidden  units \nand back.  The phase value of a feature is  represented  by a  gray level.  The periodic \nphase continuum can only be approximated by the linear gray level continuum, but \nthe basic information is  conveyed  nonetheless. \n\n7  CURRENT DIRECTIONS \n\nWe are currently  extending  MAGIC  in several  diredions,  which  we  outline here. \n\n\u2022  A natural principle for  the hierarchical decomposition of objects emerges from \nthe relative frequency  of feature configurations during training.  More frequent \nconfigurations result  in a  robust  hidden representation,  and hence the features \nforming  these configurations  will  be tightly coupled.  A  coarse  quantization of \nphases  will  lead  to  parses  of the  image  in  which  only  the  highest  frequency \nconfigurations  are considered  as  \"objeds.\"  Finer  quantizations  will  lead  to a \nfurther decomposition of the image.  Thus, the continuous phase representation \nallows  for  the construdion of hierarchical  descriptions  of objeds. \n\n\u2022  Spatially  local  grouping  principles  are  unlikely  to  be sufficient  for  the  image \nsegmentation task.  Indeed,  we  have encountered  incorred  solutions  produced \nby  MAGIC  that  are  locally  consistent  but  globally  inconsistent.  To solve  this \nproblem,  we  are investigating  an architecture  in which  the image is  processed \nat several spatial scales  simultaneously. \n\n\u2022  Simulations are also  underway to examine MAGIC'S performance on real-world \nimages-overlapping handwriUen leUers and digits-where it is  somewhat less \nclear to which  types of paUerns the hidden  units  should respond. \n\n\u2022  Zemel,  Williams, and Mozer  (to appear) have proposed a  mathematical frame(cid:173)\n\nwork that-with slight  modifications  to the model-allow it  to be interpreted \n\n\fLearning to Segment Images  Using Dynamic Feature  Binding \n\n443 \n\nas  a  mean-field approximation to a  stochastic phase model . \n\n\u2022  Behrmann,  Zemel,  and Mozer  (to appear) are conducting psychological exper(cid:173)\n\niments to examine whether limitations of the model match human limitations. \n\nAcknowledgements \n\nThis research was supported by NSF Presidential Young  Investigator award ffiI-9058450, \ngrant  90-21  from the  James S.  McDonnell  Foundation, and DEC external research grant \n1250  to MM, and by a  National Sciences and Engineering Research Council Postgraduate \nScholarship  to RZ.  Our thanks  to Paul Smolensky, Chris Williams, Geoffrey  Hinton, and \nJiirgen Schmidhuber for  helpful comments regarding  this  work. \n\nReferences \n\nEckhorn,  R.,  Bauer,  R.,  Jordan,  W.,  Brosch,  M.,  Kruse,  W.,  Munk,  M.,  &;  Reitboek,  H. \nJ. (1988).  Coherent oscillations:  A mechanism of feature linking in the visual cortex? \nBiological Cybernetic8,  60,  121-130. \n\nGoebel, R. (1991).  An oscillatory neural network model of visual attention, pattern recog(cid:173)\n\nnition, and response generation.  Manuscript in preparation. \n\nGray,  C.  M.,  Koenig,  P.,  Engel,  A.  K.,  &;  Singer,  W.  (1989).  Oscillatory  responses  in \ncat visual cortex exhibit intercolumnar synchronization which reflects global stimulus \nproperties.  Nature  (London), 338, 334-337. \n\nGuzman, A. (1968).  Decomposition of a visual scene into three-dimensional bodies.  AFIPS \n\nFall  Joint  Computer  Conference,  33,  291-304. \n\nHummel, J. E., &;  Biederman, r.  (1992).  Dynamic binding  in a  neural network for  shape \n\nrecognition.  P8ychological Review.  In Press. \n\nKanade,  T.  (1981).  Recovery  of the  three-dimensional  shape  of an  object  from a  single \n\nview.  Artificial Intelligence, 17, 409-460. \n\nLowe,  D.  G.  (1985).  Perceptual  Organization  and  Vi8ual  Recognition.  Boston:  Kluwer \n\nAcademic Publishers. \n\nLowe,  D.  G.,  &;  Binford,  T.  O.  (1982).  Segmentation and  aggregation:  An  approach  to \nfigure-ground phenomena.  In Proceeding8 of the  DARPA IU Work8hop (pp.  168-178). \nPalo Alto, CA: (null). \n\nMarr,  D.  (1982).  Vi8ion.  San Francisco:  Freeman. \n\nRock,  I.,  &;  Palmer, S.  E. (1990).  The legacy  of Gestalt psychology.  Scientific American, \n\n!63, 84-90. \n\nvon  der  Malsburg,  C.  (1981).  The  correlation  theory  of brain  function (Internal  Report \n81-2).  Goettingen:  Department of Neurobiology, Max Planck Intitute for  Biophysical \nChemistry. \n\nvon der Malsburg,  C.,  &;  Schneider, W. (1986).  A neural cocktail-party processor.  Biolog(cid:173)\n\nical  Cybernetic8,  54,  29-40. \n\n\f", "award": [], "sourceid": 540, "authors": [{"given_name": "Michael", "family_name": "Mozer", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Marlene", "family_name": "Behrmann", "institution": null}]}