{"title": "Audio Vision: Using Audio-Visual Synchrony to Locate Sounds", "book": "Advances in Neural Information Processing Systems", "page_first": 813, "page_last": 819, "abstract": null, "full_text": "Audio-Vision: \n\nUsing Audio-Visual  Synchrony to Locate \n\nSounds \n\nJohn Hershey .. \n\nJavier Movellan \n\njhershey~cogsci.ucsd.edu \nDepartment of Cognitive Science \nUniversity of California, San Diego \n\nLa Jolla,  CA  92093-0515 \n\nmovellan~cogsci.ucsd.edu \nDepartment of Cognitive Science \nUniversity of California,  San Diego \n\nLa Jolla,  CA 92093-0515 \n\nAbstract \n\nPsychophysical and physiological evidence shows that sound local(cid:173)\nization of acoustic signals is strongly influenced by  their synchrony \nwith visual signals.  This effect,  known as ventriloquism, is at work \nwhen  sound  coming  from  the  side  of a  TV  set  feels  as  if it  were \ncoming  from  the  mouth  of the  actors.  The  ventriloquism  effect \nsuggests that there is  important information about sound location \nencoded in  the synchrony between the audio and video signals.  In \nspite  of  this  evidence,  audiovisual  synchrony  is  rarely  used  as  a \nsource of information  in  computer  vision  tasks.  In  this  paper  we \nexplore the use  of audio  visual  synchrony to locate sound sources. \nWe developed a system that searches for regions of the visual land(cid:173)\nscape that correlate highly with the acoustic signals and tags them \nas  likely  to contain an acoustic  source.  We  discuss  our experience \nimplementing the system,  present results on a  speaker localization \ntask and discuss potential applications of the approach. \n\nIntrod uction \n\nWe  present  a  method  for  locating  sound  sources  by  sampling  regions  of an  im(cid:173)\nage  that  correlate in  time  with  the  auditory  signal.  Our  approach  is  inspired  by \npsychophysical  and physiological  evidence  suggesting that audio-visual  contingen(cid:173)\ncies  play  an  important  role  in  the  localization  of sound  sources:  sounds  seem  to \nemanate from  visual stimuli that are synchronized with  the sound.  This effect  be(cid:173)\ncomes  particularly noticeable when  the perceived  source of the sound is  known  to \nbe  false,  as  in  the  case  of  a  ventriloquist's  dummy,  or  a  television  screen.  This \nphenomenon is  known in the psychophysical community as  the ventriloquism effect, \ndefined as a mislocation of sounds toward their apparent visual source.  The effect is \nrobust in a wide variety of conditions, and has been found to be strongly dependent \non the degree of \"synchrony\"  between the auditory and visual signals (Driver,  1996; \nBertelson, Vroomen,  Wiegeraad & de  Gelder,  1994). \n\n\"1'0  whom correspondence should be addressed. \n\n\f814 \n\nJ.  Hershey and J.  R.  Movellan \n\nThe ventriloquism effect is  in fact  less speech-specific than first  thought.  For exam(cid:173)\nple  the effect  is  not  disrupted  by  an  upside-down  lip  signal  (Bertelson,  Vroomen, \nWiegeraad  &  de  Gelder,  1994)  and  is  just  as  strong  when  the  lip  signals  are  re(cid:173)\nplaced  by  light  flashes  that  are  synchronized  with  amplitude  peaks  in  the  audio \nsignal  (Radeau &  Bertelson,  1977).  The crucial aspect  here  is  correlation between \nvisual and auditory intensity over time.  When the light flashes are not synchronized \nthe effect  disappears. \nThe ventriloquism effect  is  strong enough to produce an enduring localization bias, \nknown  as  the  ventriloquism  aftereffect.  Over time,  experience with  spatially offset \nauditory-visual stimuli causes a persistent shift in subsequent auditory localization. \nExposure to audio-visual stimuli offset from each other by only 8 degrees of azimuth \nfor  20-30 minutes is sufficient to shift auditory localization by the same amount.  A \ncorresponding shift  in neural processing has been detected in macaque monkeys  as \nearly as  primary auditory cortex(Recanzone,  1998).  In  barn owls  a  misalignment \nof visual  and  auditory  stimuli  during  development  causes  the  realignment  of the \nauditory  and  visual  maps  in  the  optic  tectum(Zheng  &  Knudsen,  1999;  Stryker, \n1999;  Feldman &  Knudsen,  1997). \nThe strength of the psychophysical and physiological evidence suggests that audio(cid:173)\nvisual  contingency may be used as  an important source of information that is  cur(cid:173)\nrently underutilized in  computer vision  tasks.  Visual  and auditory  sensor systems \ncarry  information about  the  same  events  in  the world,  and  this  information  must \nbe  combined  correctly in  order for  a  useful  interaction of the  two  modalities.  Au(cid:173)\ndiovisual contingency can be exploited to help  determine which signals in  different \nmodalities  share a  common  origin.  The benefits  are two-fold:  the  two  signals  can \nhelp  localize  each  other,  and  once  paired  can  help  interpret  each  other.  To  this \neffect  we  developed  a  system  to localize  speakers  using  input  from  a  camera and \na  single microphone.  The approach is  based on searching for  regions  of the image \nwhich  are  \"synchronized\"  with the acoustic signal. \n\nMeasuring  Synchrony \n\nThe concept of audio-visual  synchrony is  not well  formalized  in  the psychophysical \nliterature, so for a working definition we interpret synchrony as the degree of mutual \ninformation between audio and spatially localized video  signals.  Ultimately  it is  a \ncausal relationship that we  are often interested in,  but causes 'can only  be inferred \nfrom  effects  such  as  synchrony.  Let  a(t)  E  IRn  be a  vector  describing  the acoustic \nsignal at time t.  The components of a(t)  could be cepstral coefficients,  pitch mea(cid:173)\nsurements, or the outputs of a filter bank.  Let v(x, y, t)  E IRm  be a vector describing \nthe visual signal at time t,  pixel  (x,y).  The components ofv(x,y,t) could represent \nGabor energy coefficients,  RGB  color values,  etc. \n\nConsider now  a  set of s  audio and visual  vectors S  =  (a(tl), v(x, y, tl\u00bbl=k-s-l,. .. ,k \nsampled  at  times  tk-s-l,'\"  ,tk  and at  spatial  coordinates  (x, y).  Given  this  set \nof vectors our goal is  to provide a  number that describes  the  temporal contingency \nbetween  audio  and  video  at  time  tk'  The  approach  we  take  is  to  consider  each \nvector  in  S  as  an independent  sample from  a  joint multivariate  Gaussian  process \n(A(tk), V(x, y, tk\u00bb  and define  audio-visual  synchrony at time  tk  as  the estimate of \nthe mutual information between the audio and visual components of the process. \n\nLet  A(tk)  ,...,  Nn(ltA(tk), ~A(tk\u00bb' and  V(x,y, tk)  ,...,  Nm(ltv(x, y, t), ~v(x,y, tk)), \nwhere  It  represents  means  and  ~ covariance  matrices.  Let  A(tk)  and  V(x, y, tk) \nbe jointly Gaussian, i.e.,  (A(tk), V(x, y, tk\u00bb ,...,  Nn+m(ltA,V (x, y, tk), ~A,V(X, y, tk)' \n\n\fAudio Vision:  Using Audio-Vzsual Synchrony to Locate Sounds \n\n815 \n\nThe mutual information between A(x, y, tk) and V(tk) can be shown to be as follows \n\n[(A(tk); V(x, y, tk))  =  H(A(tk)) + H(V(x, y, tk)) - H(A(tk), V(x, y, tk)) \n\n1 \n\"2log(27re)nIEA(tk)1 + \"2log(27re)mIEv(x, y, tk)1 \n\n1 \n\n(1) \n\n1 \n\n-\"2log(27re)n+mIEA ,v(x,y, tk)1 \n\nIEA(tk)IIEv(x,y,tk)1 \n11 \n- og \"'-----:-::::---'--'-'--;----'-----'-::-:--':\"':' \nIEA,V(X,y,tk)I' \n2 \n\nIn the special  case  that n = m = 1,  then \n\n(2) \n\n(3) \n\n(4) \n\nwhere p(x, y, tk)  is the Pearson correlation coefficient between A(tk) and V(x, y, tk)' \nFor  each  triple  (x, y, tk)  we  estimate  the  mutual  information  between  A( tk)  and \nV(x, y, tk)  by  considering  each  element  of S  as  an  independent  sample  from  the \nrandom  vector  (A(tk), V(x, y, tk))'  This  amounts  to  computing  estimates  of  the \njoint covariance matrix EA,V (x, y, tk).  For  example the estimate of the covariance \nbetween the ith  audio component and the jLh  video component would be as follows \n\nSAi,v; (x, y, tk)  = s _  1 I)ai(tk-l) - ai(tk))(Vj(X,y, tk-l) - Vj (x, y, tk)), \n\n1  8-1 \n\nwhere \n\n1=0 \n\n(5) \n\n(6) \n\n(7) \n\n(8) \n\nThese  simple  covariance  estimates  can  be  computed  recursively  in  constant  time \nwith  respect  to  the  number  of timepoints.  The  independent  treatment  of pixels \nwould lend well  to a  parallel implementation. \nTo  measure  performance,  a  secondary  system  produces  a  single  estimate  of  the \nauditory  location,  for  use  with a  database of labeled  solitary  audiovisual  sources. \nUnfortunately there are many ways of producing such  estimates so  it becomes dif(cid:173)\nficult  to  separate  performance  of the  measure  from  the  underlying  system.  The \nmodel  used  here  is  a  centroid  computation  on  the  mutual  information  estimates, \nwith some enhancements to aid tracking and reduce background noise. \n\nImplementation Issues \n\nA  real  time  system  was  prototyped  using  a  QuickCam  on  the  Linux  operating \nsystem and then ported to NT as  a DirectShow filter.  l'his platform provides input \nfrom  real-time audio and video capture hardware as well  as from  static movie files. \nThe video  output  could also  be rendered live  or compressed  and saved  in  a  movie \nfile.  The implementation was  challenging in that it turns out to be rather difficult \n\n\f816 \n\nJ.  Hershey and J.  R.  Movellan \n\n-'.'----f:.20--\"~--:':\"--\"\"=:----:':''':--~'20 \n\nF,_ \n\n~.~~'~.~20~-~\"=:---\"~~~-~\"-=~-~\"~\"\u00b7 \n\nF\" ... \n\n-2 \n\n(a)  M is talking. \n\n(b)  J  is  talking. \n\nFigure 1:  Normalized audio and visual intensity across sequences of frames in which \na  sequence of four  numbers is  spoken.  The top trace is  the contour of the acoustic \nenergy  from  one  of two  speakers,  M or J,  and  the bottom  trace is  the  contour of \nintensity values  for  a  single pixel,  (147,100), near the mouth of J. \n\nto  process precisely  time-synchronized audio and video  on a  serial machine in  real \ntime.  Multiple  threads are required  to read from  the peripheral  audio and  visual \ndevices.  By the time the audio and visual streams reach the AV  filter module, they \nare quite separate and asynchronous.  The separately threaded auditory and visual \npacket streams must be synchronized, buffered, and finally  matched and aligned by \ntime-stamps  before  they  can finally  be  processed.  It is  interesting  that successful \nbiologial  audiovisual  systems  employ  a  parallel  architecture  and  thus  avoid  this \nproblem. \n\nResults \n\nTo  obtain  a  performance  baseline  we  first  tried  the  simplest  possible  approach: \nA  single  audio  and  visual  feature  per  location:  n  =  m  =  1,  v(x, y, t)  E  IR  is  the \nintensity of pixel (x, y)  at time t, and a(t)  E  IR is the average acoustic energy over the \ninterval [t - 6.t, tJ,  where 6.t =  1/30 msec , the sampling period for the NTSC video \nsignal.  Figure  1 illustrates  the  time  course of these  signals for  a  non-synchronous \nand a  synchronous pair of acoustic energy and pixel intensity.  Notice in particular \nthat in the synchonous pair,  1 (b), where  the sound and pixel  values  come from  the \nsame speaker,  the  relationship  between  the  signals  changes  over  time.  There are \nregions  of positive and  negative  covariance  strung  together in  succession.  Clearly \nthe  relationship  over  the entire sequence  is  far  from  linear.  However  over  shorter \ntime  periods  a  linear  relationship  looks  like  a  better approximation.  Our window \nsize  of 16  samples  (Le.,  s  =  16  in  5  coincides  approximately with  this  time-scale. \nPerhaps by  averaging over  many  small  windows  we  can  capture on  a  larger  scale \nwhat would  be lost  to the same method applied with  a  larger window.  Of course \nthere is  a  trade-off in the time-scale between sensitivity to spurious transients, and \nthe response time of the system. \nWe  applied  this  mutual  information  measure  to  all  the  pixels  in  a  movie,  in  the \nspirit  of the  perceptual  maps  of the  brain.  The  result  is  a  changing  topographic \nmap of audiovisual mutual information.  Figure 2 illustrates two snapshots in which \n\n\fAudio  Vision:  Using Audio-Visual Synchrony to Locate Sounds \n\n817 \n\n(a)  Frame 206:  M  (at left)  is  talking. \n\n(b)  Frame 104:  J  (at right)  is  talking. \n\nFigure 2:  Estimated mutual information between pixel intensity and audio intensity \n(bright areas indicate greater mutual information) overlaid on stills  from  the video \nwhere one person is  in  mid-utterance. \n\ndifferent  parts  of the  face  are synchronous  (possibly  with  different  sign)  with  the \nsound  they  take  part  in  producing.  It is  interesting  that  the synchrony  is  shared \nby  some parts, such  as  the eyes,  that do  not directly  contribute to the sound,  but \ncontribute to the communication nonetheless. \n___ \n\nTo estimate the position of the speaker we computed a centroid were each point was \nweighted by the estimated mutual information between the correpsonding pixel and \nthe audio signal.  At each time step the mutual information was estimated using 16 \npast frames  (Le.,  s  =  16)  In  order  to  reduce  the intrusion of spurious  correlations \nfrom  competing targets, once a target has been found,  we employ a  Gaussian influ(cid:173)\nence  function.  (Goodall,  1983)  The influence  function  reduces the weight  given  to \nmutual information from  locations  far  from  the  current  centroid when  computing \nthe next centroid.  To  allow for  the speedy disengagement from  a  dwindling source \nof mutual information we set a threshold on the mutual information.  Measurements \nunder  the threshold are  treated as  zero.  This threshold  also  reduces  the  effects  of \nunwanted background noise,  such as  camera and microphone jitter. \n\nA \n\nSx(t)  = \n\nLx L  x 8(1og(1  - f} (x , y, t)))'I/;(X, Sx(t - 1)) \nLx L y 8(log(1- p2(X,y,t)))'I/;(x, Sx(t -1)) \n\nY \n\nA \n\n(9) \n\nwhere  Sx(t)  represents  the  estimate  of the  x  coordinate  for  the  position  of  the \nspeaker  at  time  t.  8(.)  is  the  thresholding  function,  and  'I/;(x, Sx(t  - 1))  is  the \ninfluence  function , which  depends  upon  the 'position  x  of the pixel  being sampled \nand the prior estimate Sx(t-1).  p2(X, y, t) is the estimate of the correlation between \nthe intensity  in  pixel  (x , y)  and  the acoustic  enery,  when  using the  16  past  video \nframes.  -~ log(l- p2(x, y, t))  is  the corresponding estimate of mutual information \n(the  factor,  -~ cancels  out  in  the  quotient  after  adjusting the  threshold  function \naccordingly. ) \nWe tried the approach on a movie of two people (M and J) taking turns while saying \nrandom digits.  Figure 3 shows the estimates of the actual positions  of the speaker \n\n\f818 \n\nJ.  Hershey and J.  R.  Movellan \n\nas a function of time.  The estimates clearly provide information that could be used \nto localize the speaker, especially in  combination with other approaches  (e.g., flesh \ndetection) . \n\n180 \n\n180 \n\n~ \n~ 140 \n~ \n~ 120 \ni \n~ 100 \n~ _!  80 \n\n40 \n\n20~----~----~----~----~----~------~--~ \n700 \n\n100 \n\n600 \n\n500 \n\no \n\n200 \n\n300 \n\n400 \nFrame Number \n\nFigure 3:  Estimated and actual position  of speaker at each  frame  for  six  hundred \nframes.  The sources,  M and J, took turns uttering a  series of four  digits,  for  three \nturns each.  The actual positions and alternation times were measured by hand from \nthe video  recording \n\nConclusions \n\nWe  have  presented  exploratory  work  on  a  system  for  localizing  sound  sources  on \na  video  signal  by  tagging  regions  of  the  image  that  are  correlated  in  time  with \nthe  auditory  signal.  The  approach  was  motivated  by  the  wealth  of evidence  in \nthe  psychophysical  and  physiological  literature showing  that sound  localization is \nstrongly influenced  by  synchrony  with  the  visual  signal.  We  presented  a  measure \nof local  synchrony  based  on  modeling  the  audio-visual  signal  as  a  non-stationary \nGaussian process.  We  developed a  general software tool  that accepts as  inputs all \nmajor video and audio file  formats as well as  direct input from a  video camera.  We \ntested  the  tool  on  a  speaker  localization  task with  very  encouraging  results.  The \napproach could have practical applications for localizing sound sources in situations \nwhere  where  acoustic  stereo  cues  are  inexistent  or  unreliable.  For  example  the \napproach could be used to help  localize the actor talking in  a  video  scene and put \nclosed-captioned text  near the  audio  source.  The approach could  also  be  used  to \nguide a  camera in teleconferencing applications. \n\nWhile the results reported here are very encouraging, more work needs to be done \nbefore  practical  applications  are  developed.  For  example  we  need  to  investigate \nmore  sophisticated  methods  for  processing  the  audio  and  video  signals.  At  this \npoint  we  use  average  energy  to  represent  the  video  and  thus  changes  in  the  fun(cid:173)\ndamental  frequency  that do  not  affect  the  average  energy  would  not  be captured \nby  our  model.  Similarly  local  video  decompositions,  like  spatio-temporal  Gabor \nfiltering,  or  approaches  designed  to  enhance  the  lip  regions  may  be helpful.  The \n\n\fAudio Vision:  Using Audio-Visual Synchrony to Locate Sounds \n\n819 \n\nchanging symmetry  observed between audio and video signals  might  be addressed \nrectifying or  squaring the normalized  signals  and derivatives.  Finally,  relaxing the \nGaussian constraints in our measure of audio-visual contingency may help  improve \nperformance.  While the work shown here is exploratory at this point, the approach \nis  very  promising:  It emphasizes  the  idea of machine  perception  as  a  multimodal \nprocess it is backed by psychophysical evidence, and when combined with other ap(cid:173)\nproaches it may help improve robustness in tasks such as localization and separation \nof sound sources. \n\nReferences \n\nBertelson,  P.,  Vroomen,  J.,  Wiegeraad,  G.,  &  de  Gelder,  B.  (1994).  Exploring the \nrelation  between  McGurk  interference  and  ventriloquism.  In  Proceedings  of \nthe  1994  International  Conference  on  Spoken  Language  Processing,  volume 2, \npages 559-562. \n\nDriver,  J.  (1996).  Enhancement  of selective  listening  by  illusory  mislocation  of \n\nspeech sounds due to lip-reading.  Nature,  381, 66-68. \n\nFeldman,  D.  E . &  Knudsen,  E.  I.  (1997).  An  anatomical  basis  for  visual  calibra(cid:173)\n\ntion  of the auditiory  space  map  in  the  barn  owl's  midbrain.  The  Journal  of \nNeuroscience,  17(17), 6820-6837. \n\nGoodall, C.  (1983). M-Estimators of Location:  an outline of the theory.  Wiley series \nin probability and mathematical statistics.  Applied  probability and statistics. \nRadeau, M. &  Bertelson, P.  (1977).  Adaptation to auditory-visual discordance and \nventriloquism  in  semi-realistic  situations.  Perception  and  Psychophysics,  22, \n137-146. \n\nRecanzone, G.  H.  (1998).  Rapidly  induced  auditory  plasticity:  The ventriloquism \naftereffect. Proceedings  of the National Academy of Sciences,  USA,  95, 869- 875. \n\nStryker, M.  P.  (1999) .  Sensory Maps on the Move.  Science,  925-926. \nZheng,  W.  &  Knudsen,  E.  I.  (1999).  Functional  Selection  of Adaptive  Auditory \n\nSpace Map by GABAA-Mediated Inhibition, 962-965. \n\n\f", "award": [], "sourceid": 1686, "authors": [{"given_name": "John", "family_name": "Hershey", "institution": null}, {"given_name": "Javier", "family_name": "Movellan", "institution": null}]}