{"title": "Active Gesture Recognition using Learned Visual Attention", "book": "Advances in Neural Information Processing Systems", "page_first": 858, "page_last": 864, "abstract": null, "full_text": "Active Gesture Recognition using \n\nLearned Visual  Attention \n\nTrevor Darrell and Alex  Pentland \n\nPerceptual  Computing Group \n\nMIT Media Lab \n\n20  Ames Street,  Cambridge MA,  02138 \n\ntrevor,sandy~media.mit.edu \n\nAbstract \n\nWe have developed a foveated  gesture recognition system that runs \nin an unconstrained office environment with an active camera.  Us(cid:173)\ning vision routines previously implemented for  an interactive envi(cid:173)\nronment,  we  determine  the  spatial  location  of salient  body  parts \nof a  user  and guide  an active  camera to obtain images of gestures \nor expressions.  A hidden-state reinforcement  learning paradigm is \nused  to implement visual  attention.  The  attention  module selects \ntargets  to foveate  based on the goal of successful  recognition,  and \nuses  a  new  multiple-model  Q-Iearning  formulation.  Given  a  set \nof target  and  distractor  gestures,  our  system  can  learn  where  to \nfoveate  to maximally discriminate a  particular gesture. \n\n1 \n\nINTRODUCTION \n\nVision  has  numerous  uses  in  the  natural  world.  It is  used  by  many organisms in \nnavigation and object recognition tasks, for  finding resources  or avoiding predators. \nOften overlooked in computational models of vision,  however,  and  particularly rel(cid:173)\nevant for  humans, is  the use  of vision for  communication and interaction.  In  these \ndomains visual  perception  is  an  important communication modality,  either  in  ad(cid:173)\ndition  to  language  or  when  language  cannot  be  used. \nconsiderable weight on visual signals from another individual, such  as facial expres(cid:173)\nsion, hand gestures,  and body language.  We have been developing neurally-inspired \nmethods which combine low-level vision and learning to model these visual abilities. \n\nIn  general,  people  place \n\nPreviously,  we  presented  a  method  for  view-based  recognition  of spatia-temporal \nhand  gestures  [2]  and  a  similar mechanism for  the  analysis/real-time  tracking  of \nfacial expressions  [4].  These methods offered  real-time performance and a  relatively \nhigh  level  of accuracy,  but  required  foveated  images  of the  object  performing the \n\n\fActive Gesture Recognition  Using Learned Visual  Attention \n\n859 \n\ngesture.  There  are  many domains/tasks for  which  these  are  not  unreasonable  as(cid:173)\nsumptions, such as interaction with a single user workstation or an automobile with \na  single  driver.  However  the method  had  limited  usefulness  in  unconstrained  do(cid:173)\nmains,  such  as  \"intelligent  rooms\"  or  interactive  virtual  environments,  when  the \nidentity and location of the  user  are unknown. \n\nIn this  paper, we  expand our gesture  recognition method to include an active com(cid:173)\nponent, utilizing a foveated  image sensor that can selectively  track a person's hand \nor  face  as  they  walk  through  a  room.  The  camera  tracking  and  model  selection \nroutines  are guided  by  an action-selection system that implements visual  attention \nbased on  reinforcement  learning.  Using on a simple reward schedule,  this attention \nsystem learns the appropriate object  (hand,  head)  to foveate  in order  to maximize \nrecognition performance. \n\n2  FOVEATED  GESTURE  ANALYSIS \n\nOur  system  for  foveated  gesture  recognition  combines  person  tracking  routines, \nan active,  high-resolution  camera,  and  view-based  normalized  correlation  analysis. \nFirst  we  will  briefly  describe  the  person  tracking module and  view-based  analysis, \nthen  discuss  their  use  with an  active camera. \nWe  have implemented vision  routines  to track  a  user  in in  an office  setting as  part \nof our  ALIVE  system,  an  Artificial  Life  Interactive  Video  Environment[3].  This \nsystem  can  track  people  and  identify  head/hand  locations  as  they  walk  about  a \nroom,  and  provides  the  contextual  environment  within  which  view-based  gesture \nanalysis  methods  can  be  successfully  applied.  The  ALIVE  system  assumed  little \nprior  knowledge  of the  user,  and operated  on  coarse-scale  images. 1  ALIVE  allows \na  user  to interact  with virtual artificial life creatures,  through the use  of a  \"magic(cid:173)\nmirror\"  metaphor in which  user  sees  him/herself presented  in a  video display along \nwith virtual creatures.  A wide field-of-view  video  camera acquires  an image of the \nuser,  which  is  then  combined  with  computer  graphics  imagery  and projected  on  a \nlarge  screen  in front  of the  user.  Vision  routines in ALIVE compute figure/ground \nsegmentation  and  analyze the  user's silhouette to determine  the  location of head, \nhands,  and other salient body features.  We use  only a single,  calibrated, wide field(cid:173)\nof-view  camera to determine the 3-D  position of these features. 2  For  details of our \nperson  tracking method see  [14]. \n\nIn  our  approach  to  real-time  expression  matching/tracking,  a  set  of  view-based \ncorrelation models is  used  to represent  spatio-temporal gesture  patterns.  We  take \na  sequence  of  images  representing  the  gesture  to  be  trained,  and  build  a  set  of \nview  models that are  sufficient  to track  the object  as  it performs the gesture.  Our \nview models are normalized correlation templates, and can either be intensity-based \nor  based  on  band-pass  or  wavelet-based  signal  representations. 3  We  applied  our \nmodel  to  the  problem of hand gesture  recognition  [2]  as  well  as  for  tracking facial \nexpressions  [4].  For  facial  tracking,  we  implemented an  interpolation paradigm to \nmap view-based correlation scores to facial motor controls.  We used the Radial Basis \nFunction  (RBF)  method[7];  interpolation  was  performed  using  a  set  of exemplars \nconsisting of pairs of real faces  and model faces  in different  expressions,  which were \n\n1 A  simple mechanism for  recognition  of hand gestures was implemented  in  the original \nALIVE system  but  made no  use of high-resolution  view  models,  and  could  only  recognize \npointing  and  waving  motions  defined  by  the motion  of the  centroid  of the hand. \n\n2By assuming the the user is sitting or standing on the ground plane,  we use the imaging \n\nand  ground  plane  geometry  to compute the location  of the user  in  3-D. \n\n3The latter  have  the advantage  of being  less  dependent  on illumination  direction. \n\n\f860 \n\nT.DARRELL,A.PENTLAND \n\nanimation / rendering \n\n~VideoWall \n\nVIEW-BASED \nGESTURE \nANALYSIS \n\nFigure  1:  Overview  of system  for  person  tracking  and  active  gesture  recognition. \nStatic,  wide-field-of-view,  camera tracks  user's  head  and hands,  which  drives  gaze \ncontrol  of active  narrow-field-of-view  camera.  Foveated  images  are  used  for  view(cid:173)\nbased  gesture  analysis  and  recognition.  Graphical  objects  are  rendered  on  video \nwall  and  can  react  to user's position,  pose,  and gestures. \n\nobtained by generating a 3-D model face  and asking the user to match it.  With this \nsimple formalism, we  were  able  to track  expressions  of a  real  user  and  interpolate \nequivalent 3-D model faces  in  real-time. \n\nThis view-based  analysis requires  detailed imagery, which  cannot be obtained from \na  single,  fixed  camera as  the user  walks  about  a  room.  To provide  high  resolution \nimages  for  gesture  recognition,  we  augment  the  wide  field-of-view  camera  in  our \ninteractive  environment  with  an  active,  narrow-field-of-view  camera,  as  shown  in \nFigure 1.  Information about head/hand location from the existing ALIVE routines \nis  used  to drive the motor control parameters of the narrow field  camera.  Currently \nthe  camera can  be  directed  to autonomously track  head or  hands.  Using  a  highly \nsimplified, two expression model offacial expression (neutral and surprised), we have \nbeen  able  to track facial expressions  as  users  move about the room and  the narrow \nangle camera followed  the face.  For  details on this foveated  gesture  recognition  see \n[5] \n\n3  VISUAL  ATTENTION  FOR RECOGNITION \n\nThe visual routines in the ALIVE system can be used  to track  the head  and hands \nof a user,  and the active camera can provide foveated images for gesture recognition. \nIf we  know  a  priori  which  body  part  will  produce  the  gesture  of interest,  or  if we \nhave  a  sufficient  number  of active  cameras  to  track  all  body  parts,  then  we  have \nsolved  the  problem.  Of course,  in  practice  there  are  more  possible  loci  of gesture \nperformance than there  are  active cameras, and  we  have to address  the problem of \naction selection for  visual routines,  i.e. , attention.  In  our active gesture recognition \nsystem, we  have adopted an action selection model based on reinforcement learning. \n\n\fActive  Gesture Recognition Using  Learned Visual Attention \n\n861 \n\n3.1  THE ACTIVE  GESTURE RECOGNITION  PROBLEM \n\nWe  define  an Active  Gesture  Recognition (AGR)  task  as  follows .  First,  we  assume \nprimitive routines exist to provide the continuous valued control and tracking of the \ndifferent  body parts that perform gestures.  Second,  we  assume that body pose  and \nhand/face state is represented as a feature set, based on the representation produced \nby  our  body  tracker  and  view-based  recognition  system,  and  we  define  a  gesture \nto be  a  configuration of the user's  body  pose  and hand/face expression.  Third,  we \nassume  that,  in  addition to  there  being  actions  for  foveating all  the  relevant  body \nparts,  there  is  also  a  special  action  labeled  accept, and  that the execution  of this \naction  by  the  AG R  system  signifies  detection  of the  gesture.  Finally,  the  goal  of \nthe  AGR  task  is  to execute  the  accept  action  whenever  the  user  is  in  the  target \ngesture  state,  and  not  to  perform  that  action  when  the  user  is  in  any  other  (e.g. \ndistract or)  state.  The  AGR system  should  use  the  foveation  actions  to  optimally \ndiscriminate the target  pattern frqm distractor  patterns,  even  when  no single view \nof the user  is sufficient  to decide  what gesture  the user  is  performing. \n\nAn  important problem  in  applying  reinforcement  learning  to  this  task  is  that our \nperceptual observations may not provide a complete description of the user's state. \nIndeed,  because  we  have  a  foveated  image  sensor  we  know  that  the  user's  true \ngestural  state  will  be  hidden  whenever  the  user  is  performing  a  gesture  and  the \ncamera is  not  foveated  on  the  appropriate  body  part.  By  definition,  a  system for \nperceptual  action selection must not assume a full  observation of state is  available, \notherwise  there  would be no meaningful perception  taking place. \n\nThe AG R task can be considered as a Partially Observable Markov Decision Process \n(POMDP), which is essentially a  Markov  Decision Process  without direct  access  to \nstate[ll,  9].  Rather  than  attempt  to solve  them  explicitly,  we  look  to  techniques \nfor  hidden state reinforcement  learning to find  a  solution  [10,  8,  6,  1].  A  POMDP \nconsists  of a  set  of states in  the world  S,  a set of observations  0,  a  set  of actions \nA, a  reward function  R.  After executing an action a,  the likelihood of transitioning \nbetween  two states  s, s'  is  given  by  T(s, a, a'),  an  observation  0  is  generated  with \nprobability  O(s, a, 0).  In  practice,  T  and  0  are  not  easily obtainable,  and  we  use \nreinforcement  learning methods which  do  not  require  them  a  priori. \n\nOur state is defined by the users pose, facial expression,  and hand configurations, ex(cid:173)\npressed in nine variables.  Three are boolean and are provided directly by the person \ntracker:  person-present, left-arm-extended, and  right-arm-extended.  Three \nmore are provided by  the foveated  gesture recognition system, (face,  left-hand, \nright-hand),  and  take  on  an  integer  number  of values  according  to  the  number \nof view-based expressions/hand-poses:  in our first  experiments face can be  one of \nneutral, smile, or surprise, and the hands can each be one of neutral, point, or \ngrab.  In  addition,  three boolean features  represent  the internal state of the vision \nsystem:  head-foveated,  left-hand-foveated, right-hand-foveated.  At  each \ntime step,  the world is  defined  by  a state s  E S,  which  is defined  by these features . \nAn  observation,  0  E  0,  consists  of the  same  feature  variables,  except  that  those \nprovided by the foveated gesture system (e.g.,  head and hands)  are only observable \nwhen foveated.  Thus the face variable is hidden unless the head-foveated variable \nis set,  the left-hand variable hidden unless  the left-hand-foveated variable set, \nand similarly with the right hand.  Hidden variables are  set  to a  undefined value. \nThe  set  of actions,  A,  available  to  the  AGR  system  are  4  foveation  commands: \nlook-body,  look-head,  look-left-hand, and  look-right-hand plus  the special \naccept  action.  Each  foveation  command  causes  the  active  camera  to  follow  the \nrespective  body part, and sets  the internal foveation  feature  bits accordingly. \n\n\f862 \n\nT.  DARRELL, A.  PENTLAND \n\nThe  reward  function  provides  a  unit  positive  reward  whenever  the  accept  action \nis  performed  and  the  user  is  in  the  target  state  (as  defined  by  an  oracle,  external \nto the AGR system),  and  a fixed  negative reward  of magnitude a  when  performed \nand  the user  is  in  a  distractor  (non-target)  state.  Zero  reward  is given  whenever  a \nfoveation  action  is  performed. \n\n3.2  HIDDEN-STATE REINFORCEMENT  LEARNING \n\nWe have implemented a instance-based method for hidden state reinforcement learn(cid:173)\ning,  based  on  earlier  work  by  McCallum  [10].  The  instance-based  approach  to  re(cid:173)\ninforcement  learning  replaces  the  absolute state  with a  distributed  memory-based \nstate  representation.  Given  a  history  of action,  reward,  and  observation  tuples, \n(a[t], r[t], o[t]) , 0  :::;  t  :::;  T,  a  Q-value  is  also  stored  with each  time step,  q[t],  and \nQ-Iearning[12,  13]  is  performed by evaluating the similarity of recently observed  tu(cid:173)\nples  with sequences  farther  back in the history  chain.  Q-values  are  computed,  and \nthe  Q-Iearning  update  rule  applied,  maintaining  this  distributed,  memory-based \nrepresentation of Q-values. \n\nAs  in  traditional  Q-Iearning,  at  each  time  step  the  utility  of each  action  in  the \ncurrent  state  is  evaluated.  If full  access  to  the  state  was  available  and  a  table \nused  to represent  Q values,  this would simply be a table look-up operation, but in a \nPOMDP we  do not have full access  to state.  Using a variation on the instance based \napproach employed by McAllum's Nearest Sequence  Memory (NSM)  algorithm, we \ninstead find  the I<  nearest  neighbors in the history list relative to the current  time \npoint,  and compute their  average Q value.  For each element on the history list,  we \ncompute the sequence  match criteria with the current  time point,  M(i, T),  where \n\nM(i,j) =  S(i,j) + M(i -l,j -1) \n\nif S(i,j) > 0  and i> 0  and j  > 0 \n\nWe  define  Sci, j)  to  be  1  if  o[i]  = o[j]  or  a[i]  = a(j],  2  if  both  are  equal,  and \no otherwise.  Using  a  superscript  in  parentheses  to  denote  the  action  index  of a \nQ-value,  we  then  compute \n\no  otherwise. \n\nQ(a)[T] = (1/ I<) L v(a)[i]q[t]  , \n\nT \n\ni=O \n\n(1) \n\nwhere v(a*)[i] indicates whether the history tuple at time step i votes when comput(cid:173)\ning the Q-value of a new  action a\"':  v(a*)[i]  is set to 1 when a[i] =  a'\"  and M( i-I, T) \nis  among the  I<  largest  match values for  all k which  have a[k] =  a\"',  otherwise it is \nset  to O.  Given  Q  values for  each  action the optimal policy is  simply \n\nlI\"[T]  =  arg maxQ(a)[T]  . \n\naEA \n\n(2) \n\nThe  new  action  a[T + 1]  is  chosen  either  according  to  this  policy  or  based  on  an \nexploration strategy.  In either  case,  the action  is  executed  yielding  an observation \nand  reward,  and  a  new  tuple  added  to  the  history.  The  new  Q-value  is  set  to  be \nthe  Q  value  of the  chosen  action,  q[T + 1]  = Q(a[T+1]) [T].  The  update step  of Q \nlearning is  then  computed, evaluating \n\nU[T + 1] = maxQ(a)[T + 1]  , \n\naEA \n\nfor  each  i  such  that v(a[T+l])[i] = l. \n\nq[i]  +- (1  - fJ)q[i] + fJ(r[i] + ')'U[T + 1])  , \n\n(3) \n\n(4) \n\n\fActive Gesture Recognition Using Learned  Visual Attention \n\n863 \n\n%error \n\n60 \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\n(a) \n\n0L---------~---.8----~ \n\n0.84%  0.44\"10 \n4 \n(\\3={).5, \"(=0.5, a.= 10, 2500 trialS) \n\n0.48% \n16 \n\n2 \nK \n\nFigure  2:  (a)  Multiple  model  Q-learning:  one  Q-learning  agent  for  each  target \ngesture  to  be recognized,  with coupled  observation and action but separate reward \nand  Q-value.  (b)  Results  on  recognition  task  with 8  gesture  targets;  graph shows \nerror  rate  after  convergence  plotted  as  a  function  of number of nearest  neighbors \nused  in  learning algorithm. \n\n4  MULTIPLE  MODEL  Q-LEARNING \n\nIn  general,  we  have  found  the  simple,  instance-based  hidden  state  reinforcement \nlearning  described  above  to  be  an  effective  way  to  perform  action  selection  for \nfoveation  when  the  task  is  recognition  of a  single  object  from  a  set  of distractors. \nHowever,  we  did  not  find  that  this  type  of system  performed  well  when  the  AG R \ntask was extended  to include more than one target gesture.  When multiple accept \nactions  were  added  to  enumerate  the  different  targets,  we  were  not  able  to  find \nexploration strategies  that would  converge  in  reasonable time. \n\nThis  is  not  unexpected,  since  the  addition  of multiple  causes  of positive  reward \nmakes the  Q-value space  considerably  more complex.  To remedy  this  problem, we \npropose  a  multiple model  Q-learning system.  In  a  multiple model  approach to the \nAG R  problem, separate learning agents model the task from  each  targets perspec(cid:173)\ntive.  Conceptually, a separate Q-learning agent exists for  each target, maintains it's \nown  Q-value  and  history  structure,  and  is  coupled  to  the other  agents  via shared \nobservations.  Since  we  can interpret  the  Q-value of an  individual AGR agent  as  a \nconfidence  value  that  its target  is  present,  we  can  mediate among the  actions  pre(cid:173)\ndicted  by  the  different  agents  by  selecting  the  action from  the  agent  with  highest \nQ-value (Figure 2). \n\nFormally,  in  our  multiple model  Q-learning  system  all  agents  share  the same  ob(cid:173)\nservation  and  selected  action, but  have  different  reward  and  Q-values.  Thus  they \ncan be considered  a single Q-learning system, but with vector reward  and Q-values. \nOur multiple model learning system is thus obtained by rewriting Eqs.  (1)-(4)  with \nvector  q[t]  and  r[t].  Using  a subscript  j  to indicate the target index,  we  have \n\nQ;a)[T] = (1/ K) L v(a)[i]qj [t]  , \n\nT \n\ni=O \n\n1T[11  = arg max (maxQ;a)[T])  . \n\naEA \n\nJ \n\n(5) \n\nRewards  are computed with:  if a[T] =  accept then  rj [T]  = R(j, T) else  rj [T]  = 0; \nR(j, T) = 1 if gesture j  was  present  at time T,  else  R(j, T) = -(Y.  Further, \n\nUj [T + 1]  =  maxQ(a)[T + 1]  , \n\naEA \n\n] \n\n(6) \n\n\f864 \n\nT.DARRELL,A.PENTLAND \n\nqj[i]  f - (1- ,8)qj[i] + ,8(rj[i] + /'Uj[T+ 1])  Vi s.t.  v(a[T+1])[i] = 1 . \n\n(7) \nNote  that our sequence  match criteria, unlike that in  [10],  does  not  depend  on  r[t]; \nthis  allows  considerable  computational savings in  the multiple model system  since \nv(a)  need  not  depend  on j. \n\nWe  ran the multiple model learning system on  the AGR task  using 8 targets,  with \n,8 = 0.5, /' = 0.5,  Q; = 10.  Results summed over 2500 trials are shown in Figure 2(b), \nwith classification error plotted against the number of nearest  neighbors used  in the \nNSM  algorithm.  The error  rate shown  is  after  convergence;  we  ran  the  algorithm \nwith a period of deterministic exploration before following the optimal policy.  (The \nsystem  deterministically explored  each  action/accept  pair.)  As  can  be  seen  from \nthe graph, for  any non-degenerate value of K  reasonable performance was obtained; \nfor  K  > 2,  the system performed almost perfectly. \nReferences \n[1]  A.  Cassandra,  L.  P.  Kaelbling,  and  M.  Littman.  Acting optimally in  partially \nobservable  stochastic  domains.  In  Proc.  AAAI-94,  pages  1023-1028.  Morgan \nKaufmann,  1994. \n\n[2]  T.  Darrell  and A. P.  Pentland.  Classification of Hand Gestures  using a  View(cid:173)\nBased Distributed Representation  In Advances  in Neural Information  Process(cid:173)\ning  Systems  6,  Morgan  Kauffman,  1994. \n\n[3]  T.  Darrell,  P.  Maes,  B.  Blumberg,  and  A.  P.  Pentland,  A  Novel  Environment \nfor  Situated Vision and Behavior,  Proc.  IEEE  Workshop  for  Visual Behaviors, \nIEEE  Compo  Soc. Press,  Los Alamitos, CA,  1994 \n\n[4]  T. Darrell, I. Essa,  and A.  P.  Pentland, Correlation and Interpolation Networks \n\nfor  Real-time Expression  Analysis/Synthesis,  In  Advances  in  Neural  Informa(cid:173)\ntion  Processing  Systems  7,  MIT Press,  1995. \n\n[5]  T. Darrell and A. Pentland, A., Attention-driven Expression and Gesture Anal(cid:173)\nysis in an Interactive Environment, in Proc.  Inti.  Workshop  on  A utomatic Face \nand  Gesture  Recognition  (IWAFGR  '95),  Zurich,  Switzerland,  1995. \n\n[6]  T.  Jaakkola,  S.  Singh,  and  M.  Jordan.  Reinforcement  Learning  Algorithm \nfor  Partially  Observable  Markov  Decision  Problems.  In  Advances  In  Neural \nInformation  Processing  Systems  7,  MIT Press,  1995. \n\n[7]  T.  Poggio and F. Girosi, A Theory of Networks for  Approximation and  Learn(cid:173)\n\ning.  MIT AI  Lab TR-1140, 1989. \n\n[8]  1.  Lin  and  T.  Michell.  Reinforcement  learning  with  hidden  states.  In  Proc. \n\nAAAI-92.  Morgan Kaufmann,  1992. \n\n[9]  W.  Lovejoy.  A  survey  of algorithmic  methods  of partially  observed  markov \n\ndecision  processes.  Annals  of Operation  Reserach,  28:47-66,  1991. \n\n[10]  R.  A.  McCallum. Instance-based State Identification for  Reinforcement Learn(cid:173)\n\ning. In Advances In  Neural Information Processing Systems  7,  MIT Press,  1995. \n[11]  Edward J. Sondik. The optimal control of partially observable markov processes \nover  the  infinite  horizon:  Discounted  costs.  Operations  Reserach,  26(2):282-\n304,  1978. \n\n[12]  R.  S.  Sutton.  Learning  to  predict  by  the  method  of  temporal  differences. \n\nMachine  Learning,  3:9-44, 1988. \n\n[13]  C.  Watkins and  P.  Dayan.  Q-learning.  Machine  Learning,  8:279-292,  1992. \n[14]  C.  Wren,  A.  Azarbayejani,  T.  Darrell,  and  A.  Pentland,  Pfinder:  Real-Time \n\nTracking of the Human Body,  Media Lab  Per.  Compo  TR-353,  1994 \n\n\f", "award": [], "sourceid": 1079, "authors": [{"given_name": "Trevor", "family_name": "Darrell", "institution": null}, {"given_name": "Alex", "family_name": "Pentland", "institution": null}]}