{"title": "Recurrent Eye Tracking Network Using a Distributed Representation of Image Motion", "book": "Advances in Neural Information Processing Systems", "page_first": 380, "page_last": 387, "abstract": null, "full_text": "A  Neural  Network for  Motion  Detection of \n\nDrift-Balanced  Stimuli \n\nHilary  Tunley* \n\nSchool of Cognitive and  Computer Sciences \n\nSussex  University \nBrighton,  England. \n\nAbstract \n\nThis  paper  briefly  describes  an  artificial  neural  network  for  preattentive \nvisual processing.  The network is  capable of determiuing image motioll in \na type of stimulus which defeats most popular methods of motion detect.ion \n- a  subset  of second-order  visual  motion stimuli known  as  drift-balanced \nstimuli(DBS). The processing st.ages of the network described in this paper \nare  integratable into  a  model  capable  of simultaneous motion extractioll. \nedge  detection,  and  the determination of occlusion. \n\n1 \n\nINTRODUCTION \n\nPrevious  methods  of  motion  detection  have  generally  been  based  on  one  of \ntwo  underlying  approaches:  correlation;  and  gradient-filter.  Probably  the  best \nknown  example  of  the  correlation  approach  is  th(!  Reichardt  movement  detEctor \n[Reiehardt 1961].  The  gradient-filter (GF) approach underlies  the  work of AdElson \nand  Bergen  [Adelson  1985],  and Heeger  [Heeger  L9H8],  amongst others. \nThese  motion-detecting  methods  eannot  track  DBS,  because  DBS  Jack  essential \ncomponellts  of  information  needed  by  such  methods.  Both  the  correlation  and \nGF  approaches  impose  constraints  on  the  input  stimuli.  Throughout  the  image \nsequence,  correlation  methods  require  information  that  is  spatiotemporally corre(cid:173)\nlatable;  and  GF  motion  detectors  assume  temporally  constant  spatial  gradi,'nts. \n\n\"Current  address:  Experimental  Psychology,  School  of  Biological  Sciences,  Sussex \n\nUniversity. \n\n714 \n\n\fA Neural  Network for  Motion  Detection of Drift-Balanced Stimuli \n\n715 \n\nThe  network discussed  here  does  not impose such  constraints.  Instead,  it  extracts \nmotion  energy  and  exploits  the  spatial  coherence  of movement  (defined  more for(cid:173)\nmally in  the  Gestalt  theory  of common fait  [Koffka 1935])  to  achieve tracking. \n\nThe remainder of this paper discusses  DBS  image sequences,  then correlation meth(cid:173)\nods,  then  GF  methods in  more  detail,  followed  by  a  qualitative description  of this \nnetwork which  can process  DBS. \n\n2  SECOND-ORDER AND  DRIFT-BALANCED STIMULI \n\nThere  has  been  a  lot  of recent  interest  in  second-order  visual stimuli, and  DBS  in \nparticular  ([Chubb  1989,  Landy  1991]).  DBS  are stimuli which give a  clear  percept \nof directional motion, yet Fourier analysis reveals  a  lack of coherent motion energy, \nor energy  present  in  a  direction opposing that of the displacement  (hence  the  term \n'drift-balanced ').  Examples of DBS  include  image sequences  in  which  the  contrast \npolarity of edges  present  reverses  between frames. \n\nA  subset  of DBS,  which  are  also  processpd  by  the  network,  are  known  as  micro(cid:173)\nbalanced  stimuli  (MBS).  MBS  cont,ain  no  correlatable  features  and  are  drift(cid:173)\nbalanced  at  all  scales.  The  MBS  image sequences  used  for  this  work  were  created \nfrom  a  random-dot  image  in  which  an  area  is  successively  shifted  by  a  constant \ndisplacement between each frame  and  sim ultaneously re-randomised. \n\n3  EXISTING  METHODS  OF  MOTION  DETECTION \n\n3.1  CORRELATION METHODS \n\nCorrelation methods perform a local cross-correlation in  image space:  the matching \nof features  in  local neighbourhoods  (depending  upon displacement/speed)  between \nimage  frames  underlies  the  motion  detection.  Examples  of  this  method  include \n[Van  Santen  1985J.  Most  correlation  models  suffer  from  noise  degradation  in  that \nany noise features  extracted  by the edge  detection  are  available for  spurious  corre(cid:173)\nlation. \n\nThere  has been  much recent  debate questioning the validity of correlation methods \nfor  modelling human motion  detection  abilit.ies.  In  addition  to  DBS,  there  is  also \nincreasing  psychophysical evidence  ([Landy  1991,  Mather  1991])  which  correlation \nmethods cannot  account for. \n\nThese factors  suggest  that correlation  techniques  are  not suitable for  low-level mo(cid:173)\ntion  processing  where  no  information  is  available  concerning  what  is  moving  (as \nwith  MBS).  However,  correlation  is  a  more  plausible  method  when  working  with \nhigher  level  constructs  such  as  tracking in  model-based vision  (e.g.  [Bray  1990]), \n\n3.2  GRADIENT-FILTER (GF)  METHODS \n\nGF methods  use  a  combination of spatial filtering  to determine edge  positions and \ntemporal filtering to determine whether such edges are moving.  A common assump(cid:173)\ntion used  by G F  methods is  that spatial gradients are  constant.  A recent method by \nVerri  [Verri  1990],  for  example, argu es  that flow  det.ection  is  based  upon  the notion \n\n\f716 \n\nTunley \n\n-\n\nModel \n\n\u2022 \u2022  \n\n\u2022 \n\n\u2022 \u2022   \u2022 \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022  ~ . \u2022 \u2022 \u2022 \u2022 \n\n\u2022 \u2022  \n\n\u2022 \n\n\u2022  T  \u2022 \u2022  \n\nR: \n\nM: \n\n0: \n\nE: \n\nReceptor  UnIts  - Detect  temporal \nchanges  In  IMage  intensit~ \n(polarIty-independent) \n\nMotion  Units  - Detect \ndistribution  of  change \niniorMtlon \n\nOcclusIon  Units  - Detect \nchanges  In  .otlon \ndIstribution \n\nEdge  Units  - Detect  edges \ndlrectl~ from  occluslon \n\nFigure  1:  The  Network (Schematic) \n\nof tracking spatial gradient magnitude and/or  direction,  and  that  any  variation in \nthe  spatial  gradient  is  due  to  some  form  of  motion  deformation  - i.e.  rotation, \nexpansion  or  shear.  Whilst  for  scenes  containing  smooth  surfaces  this  is  a  valid \napproximation, it is  not the  case  for  second-order stimuli such as  DBS. \n\n4  THE NETWORK \n\nA  simplified  diagram  illustrating  the  basic  structure  of the  network  (based  upon \nearlier  work  ([Tunley  1990,  Tunley  1991a,  Tunley  1991b])  is  shown  in  Figure  1 \n( the  edge  detection  stage  is  discussed  elsewhere  ([Tunley  1990,  Tunley 1991 b, \nTunley  1992]). \n\n4.1 \n\nINPUT  RECEPTOR UNITS \n\nThe  units  in  the  input  layer  respond  to  rectified  local  changes  in  image  intensity \nover  time.  Each unit has  a variable adaption  rate,  resulting  in  temporal sensitivity \n- a fast  adaption rate gives a high temporal filtering rate.  The main advantages for \nthis  temporal averaging processing  are: \n\n\u2022  Averaging  removes  the  D.C.  component  of  image  intensity.  This  elimi(cid:173)\n\nnates  problematic  gain  for  motion  in  high  brightness  areas  of  the  image. \n[Heeger  1988] . \n\n\u2022  The random nature  of DBS/MBS generation cannot guarantee that each pixel \nchange  is  due  to  local  image  motion.  Local  temporal  averaging smooths  the \n\n\fA Neural  Network for  Motion Detection of Drift-Balanced Stimuli \n\n717 \n\nmoving regions, thus creating a more coherently structured input for the motion \nunits. \n\nThe input units have a  pointwise rectifying response  governed by an  autoregressive \nfilter  of the following form: \n\nwhere  a  E  [0,1]  is  a  variable which  controls the  degree  of temporal filtering  of the \nchange in  input intensity, nand n  - 1 are successive  image frames,  and  Rn  and  In \nare  the filter  output  and  input, respectively. \n\nThe receptor  unit responses  for  two different  a  values are shown in  Figure  2.  C\\'  can \nthus  be  used  to  alter  the  amount  of motion  blur  produced  for  a  particular  frame \nrate,  effectively  producing a  unit with differing  velocity sensitivity. \n\n(1 ) \n\n( a) \n\n(b) \n\nFigure  2:  Receptor  Unit  Response:  (a)  a  =  0.3;  (b)  a  =  0.7. \n\n4.2  MOTION  UNITS \n\nThese  units  determine  the  coherence of image  changes  indicated  by  corresponding \nreceptor  units.  First-order  motion  produces  highly-tuned  motion  activity - i.e.  a \nstrong response  in a particular direction - whilst second-order  motion results in  less \ncoherent  output. \n\nThe operation of a  basic motion detector  can  be described  by: \n\nw here  !vI  is  the  detector,  (if, j') is  a  point  in  frame  n  at  a  distance  d  from  (i, j), \na  point  in  frame  n  - 1,  in  the  direction  k.  Therefore,  for  coherent  motion  (i.e. \nfirst-order),  in  direction  k  at  a  speed  of d units/frame, as  n  ---- 00: \n\n(2) \n\n(3) \n\n\f718 \n\nTunley \n\nThe  convergence  of motion  activity  can  be  seen  using  an  example.  The  stimulus \nsequence  used  consists  of a  bar  of re-randomising  texture  moving  to  the  right  in \nfront  of a  leftward  moving background  with  the  same  texture  (i.e.  random  dots). \nThe  bar motion  is  second-order  as  it  contains  no  correlatable  features,  whilst  the \nbackground  consists  of a  simple  first-order  shifting of dots  between  frames.  Fig(cid:173)\nures  3,  4 and 5 show two-dimensional images of the leftward motion activity for  the \nstimulus after  3,4 and 6 frames  respectively.  The background,  which  has  coherent \nleftward  movement  (at  speed  d  units/frame)  is  gradually  reducing  to  zero  whilst \nthe microbalanced rightwards-moving bar, remains active.  The fact  that a non-zero \nresponse  is  obtained  for  second-order  motion suggests,  according  to  the  definition \nof Chubb and Sperling [Chubb  1989],  that first-order  detectors produce no response \nto  MBS,  that this detector  is  second-order  with  regard  to motion detection. \n\nFigure  3:  Leftward  Motion Response  to Third Frame in Sequence. \n\nHfOL(tlyllmh ~ .4) \n\n.. ' \n\nFigure 4:  Leftward  Motion  Response  to  Fourth Frame. \n\nHf Ol (llyrlnh ~. 6) \n\nFigure  5:  Leftward Motion  Response  to Sixth  Frame. \n\nThe  motion  units  in  this  model  are  arranged  on  a  hexagonal  grid.  This  grid  is \nknown  as  a flow  web  as  it  allows information to flow,  both laterally between  units \nof the  same type,  and  between  the  different  units in  the  model (motion,  occlusion \nor  edge).  Each  flow  web  unit  is  represented  by  three  variables  - a  position  (a, b) \nand  a direction  k,  which is  evenly spaced  between 0 and  360  degrees.  In  this model \neach  k  is  an  integer between  1 and  kmax  -\nthe  value of kmax  can  be  varied  to vary \nthe sensitivity of the  units. \n\nA  way  of  using  first-order  techniques  to  discriminate  between  first  and  second(cid:173)\norder  motions  is  through  the  concept  of coherence.  At  any  point  in  the  motion(cid:173)\nprocessed  images in Figures 3-5, a measure of the overall variation in motion activity \ncan  be  used  to  distinguish  between  the  motion of the  micro-balanced  bar  and  its \nbackground.  The motion energy for  a detector with displacement d,  and orientation \n\n\fA Neural Network for  Motion Detection of Drift-Balanced Stimuli \n\n719 \n\nk,  at position (a, b),  can be represented  by Eabkd.  For each motion unit, responding \nover  distance  d,  in each cluster  the energy  present  can  be  defined  as: \n\nE \n\n_  mink(Mabkd) \n\nabkdn  -\n\nAI \n\nabkd \n\n(4) \n\nwhere  mink(xk) is  the minimum value of x  found searching over k  values.  If motion \nis  coherent,  and  of approximately  the  correct  speed  for  the  detector  M,  then  as \nn  -+ 00: \n\n(5) \n\nwhere  km  is  in the  actual direction  of the motion.  In reality n  need  only  approach \naround 5 for  convergence to occur.  Also,  more importantly, under the same conver(cid:173)\ngence  conditions: \n\n(6) \n\nThis is  due  to the  fact  that  the minimum activation value  in  a  group of first-order \ndetectors  at  point  (a, b)  will  be  the  same  as  the  actual  value in  the  direction,  km . \nBy  similar reasoning,  for  non-coherent  motion as  n  -+ 00: \n\nEabkdn  -\n\n1 'Vk \n\n(7) \n\nin  other  words  there  is  no  peak of activity in  a  given direction .  The motion energy \nis  ambiguous at  a  large  number of points in  most images, except  at discontinuities \nand  on  well-textured surfaces. \n\nA  measure of motion coherence  used  for  the  motion units can  now  be  defined  as: \n\nMc( abkd) = \n\n. Eabkd \n\",\", k max  E \nL...k=l \n\nabkd \n\nFor  coherent  motion in  direction  km  as  n  -+ 00: \n\nWhilst for  second-order  motion, also  as  n  -\n\n00: \n\n(8) \n\n(9) \n\n(10) \n\nUsing this approach the total Me  activity at each position - regardless of coherence, \nor lack of it - is unity.  Motion energy is the same in all moving regions,  the difference \nis  in  the  distribution,  or  tuning of that energy. \n\nFigures  6,  7  and  8  show  how  motion  coherence  allows  the  flow  web  structure  to \nreveal  the  presence  of motion in  microbalanced areas  whilst  not  affecting the easily \ndetected  background motion for  the stimulus. \n\n\f720 \n\nTunley \n\nFigure  6:  Motion  Coherence  Response  to Third  Frame \n\nFigure  7:  Motion  Coherence  Response  to Fourth  Frame \n\nFigure  8:  Motion  Coherence  Response  to Sixth  Frame \n\n4.3  OCCLUSION  UNITS \n\nThese  units  identify  discontinuities  in  second-order  motion  which  are  vitally  im(cid:173)\nportant when  computing the  direction of that motion . They  determine spatial and \ntemporal changes in motion coherence  and can process single or multiple motions at \neach  image  point .  Established  and  newly-activated occlusion  units  work,  through \na  gating process,  to  enhance  continuously-displacing surfaces,  utilising the  concept \nof visual inertia. \n\nThe implementation details  of the  occlusion stage  of this model  are  discussed  else(cid:173)\nwhere  [Tunley  1992], but some output from the occlusion units to the above second(cid:173)\norder  stimulus are  shown  in  Figures  9  and  10.  The  figures  show  how  the  edges  of \nthe  bar can  be  determined. \n\nReferences \n\n[Adelson  1985) \n\n[Bray  1990) \n\n[Chubb  1989) \n\nE.H. Adelson and J .R. Bergen . Spatiotemporal energy models for \nthe perception  of motion.  J.  Opt.  Soc.  Am. 2,  1985. \nA.J .  Bray.  Tracking  objects  using  image  disparities.  Image  and \nVision  Computin,q,  8,  1990. \nC.  Chubb  and  G.  Sperling.  Second-order  motion  perception: \nSpace/time separable mechanisms. In  Proc.  Workshop  on  Visual \nMotion,  Irvine,  CA ,  USA,  1989. \n\n\fA Neural Network for  Motion  Detection of Drift-Balanced Stimuli \n\n721 \n\nFigure  9:  Occluding  Motion  Information:  Occlusion  activity  produced  by  an  in(cid:173)\ncrease  in motion coherence  activity. \n\nO( IlynnlJsl . 1\") \n\nFigure  10:  Occluding  Motion  Information:  Occlusion  activity  produced  by  a  de(cid:173)\ncrease  in  motion activity at a  point.  Some spurious activity is  produced  due  to the \nrandom nature  of the  second-order  motion information. \n\n[Heeger  1988] \n\n[Koffka 1935] \n\n[Landy  1991] \n\n[Mather  1991] \n[Reichardt  1961] \n\nD.J.  Heeger.  Optical  Flow  using  spatiotemporal  filters.  Int.  J. \nCamp.  Vision,  1,  1988. \nK.  Koffka.  Principles  of  Gestalt  Psychology.  Harcourt  Brace, \n1935. \nM.S.  Landy,  B.A.  Dosher,  G.  Sperling  and  M.E.  Perkins.  The \nkinetic  depth  effect  and  optic  flow  II:  First- and  second-order \nmotion.  Vis.  Res.  31,  1991. \nG.  Mather.  Personal Communication. \nW.  Reichardt.  Autocorrelation,  a  principle for  the  evaluation of \nsensory  information by the central nervous system. In W.  Rosen-\nblith, editor,  Sensory  Communications.  Wiley  NY,  1961. \n\n[Van  Santen 1985]  J .P.H.  Van  Santen  and  G.  Sperling.  Elaborated  Reichardt  de(cid:173)\n\n[Tunley  1990] \n\n[Tunley  1991a] \n\n[Tunley  1991b] \n\n[Tunley  1992] \n\n[Verri  1990] \n\ntectors.  J.  Opt.  Soc.  Am. 2,  1985. \nH.  Tunley. Segmenting Moving Images. In  Proc.  Int.  Neural Net(cid:173)\nwork  Conf  (INN C9 0) ,  Paris,  France,  1990. \nH.  Tunley. Distributed  dynamic processing for  edge  detection.  In \nProc.  British  Machine  Vision  Conf  (BMVC91),  Glasgow,  Scot(cid:173)\nland,  1991. \nH.  Tunley.  Dynamic segmentation and  optic flow  extraction. In. \nProc.  Int.  Joint.  Conf  Neural  Networks  (IJCNN91) ,  Seattle, \nUSA,  1991. \nH.  Tunley.  Sceond-order  motion  processing:  A  distributed  ap(cid:173)\nproach. CSRP 211, School of Cognitive and Computing Sciences, \nUniversity of Sussex  (forthcoming). \nA.  Verri,  F.  Girosi  and  V.  Torre. Differential techniques for  optic \nflow.  J.  Opt.  Soc.  Am. 7,  1990. \n\n\fRecurrent  Eye Tracking Network Using a \n\nDistributed Representation of Image Motion \n\nP. A. Viola \n\nArtificial Intelligence  Laboratory \n\nMassachusetts Institute of Technology \n\nS.  G.  Lisberger \n\nDepartment of Physiology \n\nW.M.  Keck  Foundation Center for  Integrative Neuroscience \n\nNeuroscience Graduate Program \n\nUniversity of California,  San  Francisco \n\nSalk Institute,  Howard  Hughes  Medical Institute \n\nT. J. Sejnowski \n\nDepartment of Biology \n\nUniversity of California,  San Diego \n\nAbstract \n\nWe have constructed a recurrent network that stabilizes images of a moving \nobject  on  the  retina  of a  simulated  eye.  The  structure  of the  network \nwas  motivated  by the  organization  of the  primate  visual  target  tracking \nsystem.  The basic components of a  complete target tracking system were \nsimulated, including visual processing, sensory-motor interface, and motor \ncontrol.  Our model is simpler in structure, function and performance than \nthe primate system,  but many of the complexities inherent in a  complete \nsystem are present. \n\n380 \n\n\fRecurrent Eye  Tracking Network Using a Distributed Representation of Image Motion \n\n381 \n\nRetinotopic \n\nMaps \n\nVisual \nProcessing \n\nV \n\nMotor \nInterface \n\nImages  ->  \n\nEstimate of \n\nRetinal \nVelocity \n\n! \n\nEye \n\nVelocity \n\nI \n\nV \n\nTarget \n\nEye \n\nMotor \nControl \n\nFigure 1:  The overall structure of the visual tracking model. \n\n1 \n\nIntroduction \n\nThe fovea of the primate eye  has a  high density of photoreceptors.  Images that fall \nwithin  the fovea  are  perceived with  high  resolution.  Perception of moving objects \nposes  a  particular  problem  for  the  visual  system.  If the  eyes  are  fixed  a  moving \nimage  will  be  blurred.  When  the  image  moves  out  the  of the  fovea,  resolution \ndecreases.  By  moving their  eyes  to foveate  and  stabilize  targets,  primates ensure \nmaximum perceptual resolution.  In addition, active target tracking simplifies other \ntasks, such as spatial localization  and spatial coordinate transformations  (Ballard, \n1991). \nVisual tracking is  a  feedback  process, in which the eyes are moved to stabilize and \nfoveate the image of a  target.  Good  visual tracking performance depends  on  accu(cid:173)\nrate  estimates of target velocity  and  a  stable feedback  controller.  Although  many \nvisual tracking systems have been designed by engineers, the primate visual tracking \nsystem has yet to be matched in its ability to perform in complicated environments, \nwith unrestricted targets, and over a wide  variety of target trajectories.  The study \nof the  primate  oculomotor system  is  an  important step  toward  building  a  system \nthat can attain primate levels of performance.  The model presented here can accu(cid:173)\nrately and stably track a  variety of targets over  a wide  range  of trajectories and is \na  first  step toward achieving this goal. \n\nOur  model  has  four  primary  components:  a  model  eye,  a  visual  processing  net(cid:173)\nwork, a  motor interface network,  and a  motor control network  (see  Figure  1).  The \nmodel eye receives a sequence of images from a changing visual world, synthetically \nrendered,  and  generates a  time-varying output signal.  The retinal signal is  sent  to \nthe visual  processing network which is  similar in function  to the motion processing \nareas  of the  visual  cortex.  The  visual  processing network constructs  a  distributed \nrepresentation of image velocity.  This  representation is  then used  to estimate the \nvelocity of the target on the retina.  The retinal velocity of the target forms  the in(cid:173)\nput to the motor control network that drives the eye.  The eye responds by rotating, \n\n\f382 \n\nViola,  Lisberger,  and Sejnowski \n\nMotion Energy Output Unit \n\nCombination \nLayer \n\nSpace-time \nSeparable Units \n\nFigure  2:  The  structure of a  motion  energy  unit.  Each  space-time  separable  unit \nhas a  receptive field  that covers 16  pixels in space and 16 steps in time (for  a  total \nof 256  inputs).  The shaded triangles  denote complete projections. \n\nwhich in turn affects  incoming retinal signals. \n\nIf these  networks  function  perfectly,  eye  velocity  will  match  target  velocity.  Our \nmodel generates smooth eye motions to stabilize smoothly moving targets.  It makes \nno attempt to foveate  the image of a  target.  In primates, eye motions  that foveate \ntargets  are  called  saccades.  Saccadic  mechanisms  are  largely  separate  from  the \nsmooth eye  motion system (Lisberger et.  al.  1987).  We  do  not address them here. \n\nIn  contrast  with  most  engineered  systems,  our  model is  adaptive.  The  networks \nused  in  the  model  were  trained  using  gradient  descentl .  This  training  process \ncircumvented the need for  a  separate calibration of the  visual  tracking system. \n\n2  Visual  Processing \n\nINetwork simulations were  carried out with the SN2  neural network simulator. \n\n\fRecurrent Eye Tracking Network Using a Distributed Representation of Image Motion \n\n383 \n\nThe  middle  temporal  cortex  (area  MT)  contains  cells  that  are  selective  for  the \ndirection  of visual  motion.  The  neurons  in  MT  are  organized  into a  retinotopic \nmap  and  small lesions  in  this  area lead  to selective impairment of visual  tracking \nin  the  corresponding  regions  of the  visual  field  (Newsome  and  Pare,  1988).  The \nvisual  processing  networks  in  our  model  contain  directionally-selective  processing \nunits  that are  arranged in a  retinotopic  map.  The spatio-temporal  motion  energy \nfilter  of Adelson  and  Bergen  (Adelson  and  Bergen,  1985)  has  many of the  proper(cid:173)\nties of directionally-selective cortical neurons;  it is  used as  the basis  for  our  visual \nprocessing  network.  We  constructed  a  four  layer  time-delay  neural  network  that \nimplements a  motion energy calculation. \nA single  motion-energy  unit  can  be constructed from four  intermediate units  hav(cid:173)\ning  separable  spatial  and  temporal  filters.  Adelson  and  Bergen  demonstrate  that \ntwo spatial filters  (of even and odd symmetry)  and two temporal filters  (temporal \nderivatives  for  fast  and  slow  speeds)  are  sufficient  to  detect  motion.  The  filters \nare  combined  to construct  4  intermediate  units  which  project  to  a  single  motion \nenergy  unit.  Because  the spatial and  temporal properties  of the receptive field  are \nseparable, they can be computed separately and convolved  together to produce the \nfinal  output.  The temporal response is  therefore the same throughout the extent of \nthe spatial receptive field. \n\nIn our model,  motion  energy  units are implemented as  backpropagation  networks. \nThese units have a receptive field 16 pixels wide over a 16 time step window.  Because \nthe input  weights are shared,  only  32  parameters  were  needed for  each space-time \nseparable  unit.  Four space-time  separable  units  project  through  a  16  unit  combi(cid:173)\nnation layer to the  output  unit  (see  Figure 2).  The entire network can be trained \nto approximate a  variety of motion-energy filters. \n\nWe trained the motion energy network in two different ways:  as a single multilayered \nnetwork and in stages.  Staged training proceded first by training intermediate units, \nthen,  with  the  intermediate  units  fixed,  by  training the  three  layer  network  that \ncombines  the  intermediate  units  to produce  a  single  motion  energy  output.  The \noutput unit is  active when a  pattern in the appropriate range of spatial frequencies \nmoves  through  the  receptive field  with  appropriate velocity.  Many  such  units  are \nrequired  for  a  range  of velocities,  spatial  frequencies,  and  spatial  locations.  We \nuse  six different types  of motion  energy units - each tuned  to a  different temporal \nfrequency  - at  each  of the  central 48  positions  of a  64  pixel  linear  retina.  The  6 \npopulations  form  a  distributed,  velocity-tuned representation  of image  motion  for \na  total of 288  motion energy units. \n\nIn  addition  to  the  motion  energy  filters,  static  spatial  frequency  filters  are  also \ncomputed and  used  in  the  interface  network, one  for  each band and  each position \nfor  a  total of 288  units. \n\nWe  chose  an  adaptive  network  rather than  a  direct  motion  energy calculation  be(cid:173)\ncause  it  allows  us  to  model  the  dynamic  nature  of the  visual  signal  with  greater \n:flexibility.  However,  this  raises  complications  regarding  the set  of training images. \nAssuming  5  bits  of information  at  each  retinal  position,  there  are  well  over  10  to \nthe  100th  possible  input  patterns.  We  explored  sine  waves,  random  spots  and  a \nvariety of spatial  pre-filters,  and found  low-pass filtered  images  of moving  random \nspots worked best.  Typically we  began the training process from  a  plausible set of \n\n\f384 \n\nViola,  Lisberger,  and Sejnowski \n\nweights, rather than from random values, to prevent the network from  settling into \nan  initial  local  minima.  Training proceeded  for  days  until  good  performance  was \nobtained on  a  testing set. \nKrauzlis and Lisberger (1989)  have predicted that the  visual stimulus to the visual \ntracking system  in  the  brain contains information  about  the  acceleration  and  im(cid:173)\npulse of the target as well as  the velocity.  Our motion energy networks are sensitive \nto target acceleration,  producing transients for  accelerating stimuli. \n\n3  The Interface  Network \n\nThe function  of the interface is  to take the distributed representation of the image \nmotion  and  extract  a  single  velocity  estimate  for  the  moving  object.  We  use  a \nrelatively simple method that was adequate for tracking single objects without other \nmoving distractolS.  The activity level of a single  motion energy unit is ambiguous. \nFirst, it is  necessary for  the object to have a  feature that is  matched to the spatial \nfrequency  ba.ndpass  of the  motion  energy  unit.  Second,  there  is  an  a.llay  of units \nfor  each spatial frequency  and the object will stimulate only  a  few  of these  at any \ngiven time.  For instance, a  large white object will  have no features in its interior; a \nunit with its receptive field located in the interior can detect no motion.  Conversely, \ndetectors with receptive fields on the border between the object and the background \nwill  be strongly stimulated. \n\nWe use two stages of processing to extract a  velocity.  In the first  stage,  the motion \nenergy in  each spatial frequency  band is estimated by summing the outputs of the \nmotion  energy  filters  across  the  retina weighted  by  the  spatial  frequency  filter  at \neach location.  The six populations of spatial frequency  units each yield one  value. \nNext, a  6-6-1  feedforward  network,  trained  using backpropagation,  predicts target \nvelocity from  these values. \n\n4  The Motor Control Network \n\nIn comparison with the visual processing network, the motor control network is quite \nsmall  (see  Figure  3).  The goal  of the  network  is  to  move  the  eye  to stabilize  the \nimage  of the  object.  The  visual  processing  and  interface  networks  convert  images \nof the  moving  target  into an  estimate  for  the  retinal  velocity  of the  target.  This \nretinal  velocity  can  be  considered  a  motor  error.  One  approach  to  reducing  this \nerror is a simple proportional feedback controller, which drives the eye at a  velocity \nproportional  to  the  error.  There  is  a  large,  50-100  ms  delay  that  occurs  during \nvisual processing  in  the primate  visual  system.  In  the  presence  of a  large  delay  a \nproportional controller will either be inaccurate or unstable.  For this reason simple \nproportional feedback is  not  sufficient  to control tracking in the primate.  Tracking \ncan be made stable and accurate by including an internal positive feedback pathway \nto prevent instability while  preserving accuracy  (Robinson,  1971). \n\nThe  motor  control  network  was  based  on  a  model  of the  primate  visual  tracking \nmotor  control system by Lisberger  and  Sejnowski  (1992).  This  recurrent  artificial \nneural network includes both the smooth visual  tracking system and the vestibulo(cid:173)\nocular  system,  which  is  important  for  compensating  head  movements.  We  use  a \n\n\fRecurrent Eye Tracking Network Using a  Distributed Representation of Image Morion \n\n385 \n\nFlocculus \n\nTarget \nRetinal \nVelocity \n\n-\u00b7~~-\u00b7~lggl\u00b7  ~~ \n\nl Igl--t ..... ~ -......... --1 ...  Eye Velocity \n\nBrain \nStem \n\nMotor \nNeurons \n\nFigure 3:  The structure of the recurrent network.  Each circle is a unit.  Units within \na  box are not interconnected and all units between boxes were fully  interconnected \nas indicated by the arrows. \n\nsimpler version of that model that does  not have  vestibular inputs.  The network is \nconstructed from  units with continuous smooth temporal responses.  The state of a \nunit is  a function  of previous inputs and previous state: \n\nBj(t + ~t) = (1  - T~t)Bj(t) + IT~t \n\nwhere  Bj(t)  is  the  state  of  unit  j  at  time  t,  T  is  a  time  constant  and  I  is  the \nsigmoided  sum  of the  weighted  pre-synaptic  activities.  The  resulting  network  is \ncapable of smooth responses  to inputs. \n\nThe motor control network has 12  units,  each with a  time constant of 5 ms  (except \nfor  a  few  units  with  longer  delay).  There  is  a  time  delay  of 50  ms  between  the \ninterface  network  and  control network.  (see  Figure  3).  The input  to  the  network \nis  retinal  target  velocity,  the output is  eye  velocity.  The  motor control network is \ntrained  to track a  target in  the presence  of the visual  delay. \n\nThe motor  control  network contains a  positive feedback  loop  that is  necessary  to \nmaintain  accurate  tracking  even  when  the  error  signal  falls  to  zero.  The  overall \ncontrol network also contains a  negative feedback loop since  the output of the  net(cid:173)\nwork  affects subsequent inputs.  The gradient descent optimization  procedure  uses \nthe  relationship  between  the  output  and  the input during training-this relation(cid:173)\nship can be  considered  a  model of the plant.  It should  be possible to use the same \napproach with more complex plants. \n\nThe  control  network  was  trained  with  the  visual  processing  network  frozen.  A \ntraining  example  consists  of an  object  trajectory  and  the  goal  trajectory  for  the \neye.  A standard recurrent network training paradigm is  used  to adjust the weights \nto minimize the error between actual outputs and desired outputs for  step changes \nin target  velocity. \n\n\f386 \n\nViola,  Lisberger, and Sejnowski \n\n- --.....  - ~---- .... --. \n\n,~ --\n\nI \nI \n\nI , , , , , \n\nI \nI \nJ \n\nSeconds \n\nFigure 4:  Response  of the eye to a step in target velocity of 30  degrees  per second. \nThe  solid  line  is  target  velocity,  the  dashed  line  is  eye  velocity.  This  experiment \nwas performed with a  target that did not appear in the training set. \n\n5  Performance \n\nAfter  training the network on  a  set  of trajectories for  a  single  target,  the  tracking \nperformance  was  equally  good  on  new  targets.  TYacking  is  accurate  and  stable  -\nwith little tendency  to ring  (see  Figure  4).  This good  performance is  surprising in \nthe presence of a 50 millisecond delay in the visual feedback signal2 \u2022  Stable tracking \nis not possible without the positive internal feedback loop in the model (eye velocity \nsignal to the flocculus  in  Figure 3). \n\n6  Limitations \n\nThe system that we have designed is a relatively small one having a one-dimensional \nretina only 64 pixels wide.  The eye and the target can only move in one dimension(cid:173)\nalong the length of the retina.  The visual analysis that is performed is  not, however, \nlimited  to  one  dimension.  Motion  energy  filters  are  easily  generalized  to  a  two(cid:173)\ndimensional  retina.  Our  approach  should  be  extendable  to  the  two-dimensional \ntracking problem. \nThe  backgrounds of images  that  we  used  for  tracking  were  featureless.  The  cur(cid:173)\nrent system cannot distinguish  target features from background features.  Also,  the \ninterface  network  was  designed  to  track a  single  object  in  the  absence  of moving \ndistractors.  The  next  step  is  to  expand  this  interface  to  model  the  attentional \nphenomena  observed  in  primate  tracking,  especially  the  process  of initial  target \n\n2We  selected  time  constants,  delays,  and  sampling  rates  throughout  the  model  to \nroughly approximate the  time course of the primate visual  tracking response.  The model \nruns on a  workstation taking approximately thirty times real-time to complete a processing \nstep. \n\n\fRecurrent Eye Tracking Network Using a Distributed Representation of Image Motion \n\n387 \n\nacquisition. \n\n7  Conclusion \n\nIn simulations, our eye tracking model performed well.  Many additional difficulties \nmust be addressed,  but we feel  this system can perform well  under real-world real(cid:173)\ntime  constraints.  Previous  work  by  Lisberger  and  Sejnowski (1992)  demonstrates \nthat this visual tracking model can be integrated with inertial eye stabilization-the \nvestibulo-ocular reflex.  Ultimately, it should be possible  to build a  physical system \nusing  these design  principles. \n\nEvery  component  of the  system  was  designed  using  network  learning  techniques. \nThe  visual processing, for  example,  had a  variety of components that were  trained \nseparately  and  in  combinations.  The architecture of the  networks  were  based  on \nthe anatomy and  physiology of the  visual and oculomotor systems.  This approach \nto reverse engineering is  based on the existing knowledge of the flow  of information \nthrough  the relevant brain pathways. \nIt should  also be  possible  to  use  the model  to develop  and  test  theories  about  the \nnature  of biological  visual  tracking.  This  is  just  a  first  step  toward  developing  a \nrealistic model of the primate oculomotor system, but it has already provided useful \npredictions for  the possible  sites  of plasticity during gain changes  of the  vestibulo(cid:173)\nocular reflex  (Lisberger and Sejnowski,  1992). \n\nReferences \n\n[1]  E. H.  Adelson and J. R. Bergen. Spatiotemporal energy models of the perception \n\nof motion.  Journal  of the  Optical Society  of America, 2(2):284-299,  1985. \n\n[2]  D.  H.  Ballard.  Animate vision.  Artificial Intelligence,  48:57-86,  1991. \n[3]  R.J.  Krauzlis  and  S.  G.  Lis berger.  A control systems  model  of smooth pursuit \neye  movements with realistic emergent properties.  Neural Computation,  1:116-\n122,  1992. \n\n[4]  S.  G.  Lisberger,  E. J.  Morris,  and L.  Tychsen.  Ann.  Rev.  Neurosci.,  10:97-129, \n\n1987. \n\n[5]  S.G.  Lisberger  and  T.J.  Sejnowski.  Computational  analysis  suggests  a  new \n\nhypothesis for  motor learning in the vestibulo-ocular reflex.  Submitted for  pub(cid:173)\nlication., 1992. \n\n[6]  W.T.  Newsome  and  E.  B.  Pare.  A  selective  impairment of motion  perception \nfollowing  lesions  of the  middle  temporal  visual  area  (MT).  J.  Neuroscience, \n8:2201-2211,  1988. \n\n[7]  D.  A.  Robinson.  Models  of oculomotor neural organization.  In  P.  Bach y  Rita \nand C. C.  Collins, editors,  The  Control of Eye Movements, page 519.  Academic, \nNew York,  1971. \n\n\f", "award": [], "sourceid": 562, "authors": [{"given_name": "Paul", "family_name": "Viola", "institution": null}, {"given_name": "Stephen", "family_name": "Lisberger", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}