{"title": "An Oscillatory Correlation Frame work for Computational Auditory Scene Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 747, "page_last": 753, "abstract": null, "full_text": "An Oscillatory Correlation Framework for \n\nComputational Auditory Scene Analysis \n\nGuyJ.Brown \n\nDepartment of Computer Science \n\nUniversity of Sheffield \n\nRegent Court, 211  Portobello Street, \n\nSheffield S 1 4DP, UK \n\nEmail:  g.brown@dcs.shefac.uk \n\nDeLiang L. Wang \n\nDepartment of Computer and Information \nScience and Centre for Cognitive Science \n\nThe Ohio State University \n\nColumbus, OH 43210-1277, USA \nEmail: dwang@cis.ohio-state.edu \n\nAbstract \n\nA  neural  model  is  described  which  uses  oscillatory  correlation  to \nsegregate speech from  interfering sound sources. The core of the model \nis  a two-layer neural  oscillator network.  A sound  stream  is  represented \nby  a  synchronized  population  of oscillators,  and  different  streams  are \nrepresented  by  desynchronized  oscillator  populations.  The  model  has \nbeen evaluated using a corpus of speech mixed with interfering sounds, \nand produces an  improvement in signal-to-noise ratio for every  mixture. \n\n1  Introduction \n\nSpeech is  seldom  heard in  isolation: usually,  it is  mixed with other environmental sounds. \nHence, the  auditory  system  must parse the acoustic mixture reaching the ears  in  order to \nretrieve  a  description  of each  sound  source,  a  process  termed  auditory  scene  analysis \n(ASA)  [2] . Conceptually,  ASA  may  be  regarded  as  a  two-stage  process. The  first  stage \n(which  we  term  'segmentation')  decomposes  the  acoustic  stimulus  into  a  collection  of \nsensory elements. In the second stage ('grouping'), elements that are likely to  have arisen \nfrom  the  same  environmental  event  are  combined  into  a  perceptual  structure  called  a \nstream. Streams may be further interpreted by higher-level cognitive processes. \nRecently,  there  has been a growing interest in  the development of computational systems \nthat  mimic  ASA  [4],  [1],  [5].  Such  computational  auditory  scene  analysis  (CASA) \nsystems are inspired by auditory function but do not model it closely; rather, they employ \nsymbolic  search  or  high-level  inference  engines.  Although  the  performance  of  these \nsystems is  encouraging, they  are no  match for the abilities of a human listener;  also, they \ntend  to  be complex  and computationally intensive.  In  short,  CASA currently  remains  an \nunsolved problem for real-time applications such as automatic speech recognition. \nGiven  that  human  listeners  can  segregate  concurrent  sounds  with  apparent  ease, \ncomputational systems that are more closely modelled on the neurobiological mechanisms \nof  hearing  may  offer  a  performance  advantage  over  existing  CAS A  systems.  This \nobservation - together with a desire to understand the neurobiological basis of ASA - has \nled  some investigators to  propose  neural  network models of ASA.  Most recently,  Brown \nand  Wang  [3]  have given an  account of concurrent vowel  separation  based on oscillatory \ncorrelation.  In  this  framework,  oscillators  that  represent  a  perceptual  stream  are \nsynchronized (phase locked with zero phase lag), and are desynchronized from oscillators \nthat represent different streams  [8].  Evidence for  the oscillatory correlation theory comes \nfrom  neurobiological studies which report synchronised oscillations in the auditory, visual \nand olfactory cortices (see [10] for a review). \n\n\f748 \n\nG. J.  Brown and D. L.  Wang \n\nIn  this paper,  we  propose a neural  network model  that  uses  oscillatory  correlation  as  the \nunderlying  neural  mechanism  for  ASA;  streams  are  formed  by  synchronizing oscillators \nin  a  two-dimensional  time-frequency  network.  The  model  is  evaluated  on  a  task  that \ninvolves  the  separation  of  two  time-varying  sounds.  It  therefore  extends  our  previous \nstudy [3],  which only considered the segregation of vowel  sounds with static spectra. \n\n2  Model description \n\nThe input  to  the  model  consists  of a  mixture of speech and  an  interfering  sound  source, \nsampled at  a rate of 16 kHz with  16 bit resolution.  This  input signal  is  processed in  four \nstages described below (see [10]  for a detailed account). \n\n2.1  Peripheral auditory processing \n\nPeripheral  auditory  frequency  selectivity  is  modelled  using  a  bank  of  128  gammatone \nfilters  with center frequencies equally distributed on the equivalent rectangular bandwidth \n(ERB)  scale  between  80  Hz  and  5  kHz  [1].  Subsequently,  the  output  of each  filter  is \nprocessed by  a  model  of inner hair cell  function.  The output of the  hair cell  model  is  a \nprobabilistic representation of auditory nerve firing  activity. \n\n2.2  Mid-level auditory representations \n\nMechanisms  similar to  those underlying pitch perception can contribute to  the perceptual \nseparation of sounds  that  have  different fundamental  frequencies  (FOs)  [3].  Accordingly, \nthe second stage of the model extracts periodicity information from the simulated auditory \nnerve  firing  patterns.  This  is  achieved  by  computing  a  running  autocorrelation  of the \nauditory  nerve activity in  each channel, forming a representation known as  a correlogram \n[1],  [5].  At time step j, the autocorrelation A(iJ,'t) for channel i with time lag 't is given by: \nA(i, j,'t)  =  I. r(i,j-k)r(i,j-k-'t)w(k) \n\n(1) \n\nK-I \n\nk=O \n\nHere,  r is  the output of the hair cell model  and w is  a rectangular window of width K  time \nsteps. We use K = 320, corresponding to a window width of 20 ms. The autocorrelation lag \n't  is  computed  in  L  steps  of the  sampling  period  between  0  and  L-1 ; we  use  L  = 201, \ncorresponding  to  a  maximum  delay  of 12.5  ms.  Equation  (1)  is  computed  for  M  time \nframes, taken at  10 ms intervals (i.e.,  at intervals of 160 steps of the time indexj). \nFor periodic sounds, a characteristic  'spine'  appears  in  the correlogram which is centered \non the lag corresponding to  the  stimulus  period (Figure  1A).  This  pitch-related structure \ncan be emphasized by  forming  a  'pooled'  correlogram s(j,'t),  which exhibits a prominent \npeak at the delay corresponding to perceived pitch: \ns(j, 't)  = I. A (i,  j, 't) \n\n(2) \n\nN \n\ni  = I \n\nIt  is  also  possible  to  extract  harmonics  and  formants  from  the  correlogram,  since \nfrequency  channels  that  are  excited  by  the  same  acoustic  component  share  a  similar \npattern of periodicity. Bands of coherent periodicity can be identified by cross-correlating \nadjacent correlogram channels; regions of high correlation indicate a harmonic or formant \n[1] . The cross-correlation C(iJ) between channels i and i+ 1 at time frame j  is defined as: \n\nL-I \n\nC(i,j)  =  IL.A(i,j, 't)A(i+l,j, 't) \n\n(l~i~N-l) \n\nt=O \n\n(3) \n\nHere,  A(i, j , 't) is  the  autocorrelation  function  of (1)  which  has  been  normalized  to  have \nzero mean and unity variance. A typical cross-correlation function  is shown in Figure 1A. \n\n\fOscillatory Correlation for CAS A \n\n749 \n\n2.3  Neural oscillator network: overview \n\nSegmentation and grouping take place within  a two-layer oscillator network (Figure  IB). \nThe  basic  unit  of the  network  is  a  single  oscillator,  which  is  defined  as  a  reciprocally \nconnected  excitatory  variable  x  and  inhibitory  variable  y  [7].  Since  each  layer  of  the \nnetwork takes the form of a time-frequency grid, we index each oscillator according to its \nfrequency channel (i) and time frame (j): \nXij  = 3xij-xt+2-Yij+lij+Sij+P \nYij  = \u00a3(y(1 + tanh(xi/~\u00bb  - Yij) \nHere,  Ii}  represents  external  input  to  the  oscillator,  Si}  denotes  the  coupling  from  other \noscillators in the network, c, 'Y and  ~ are parameters, and p is the amplitude of a Gaussian \nnoise  term.  If coupling  and  noise  are  ignored  and  Ii}  is  held  constant,  (4)  defines  a \nrelaxation oscillator with two time scales. The x-nullcline, i.e.  Xii' =  0, is a cubic function \nand  the y-nullcline is  a  sigmoid function.  If Ii\" > 0, the  two  nul clines  intersect only  at  a \npoint along the middle branch of the cubic with ~ chosen small. In this case, the oscillator \nexhibits a stable limit cycle for small  values of c,  and is  referred to as  enabled.  The limit \ncycle  alternates  between  silent  and  active  phases  of  near  steady-state  behaviour. \nCompared  to  motion  within  each  phase,  the  alternation  between  phases  takes  place \nrapidly, and is referred to as jumping. If Ii\" < 0, the two nullclines intersect at a stable fixed \npoint. In this case, no oscillation occurs. Hence, oscillations in (4) are stimulus-dependent. \n\n(4b) \n\n(4a) \n\n2.4  Neural oscillator network: segment layer \n\nIn the first layer of the network, segments are formed - blocks of synchronised oscillators \nthat  trace  the  evolution  of an  acoustic  component through  time  and  frequency.  The first \nlayer is  a two-dimensional time-frequency  grid of oscillators  with  a global inhibitor (see \nFigure IB). The coupling term Sij in  (4a) is defined as \n\nkl E  N(i, j) \n\nSij  =  ~  Wij ,k/H(xk/-ex )- WzH(z-ez) \n(5) \nwhere H  is the Heaviside function (i.e., H(x) = I for x ~ 0, and zero otherwise), Wij,kl is the \nconnection weight from an oscillator (iJ) to an oscillator (k,/)  and N(iJ) is the four nearest \nneighbors of (iJ).  The threshold ex is  chosen so  that an oscillator has  no  influence on  its \n\nA  5000 \n'N \n::z:: \n';::  2741 \nu <= \nI!.l '\" i3\"  1457 \nd:: ... \n~ I!.l \nU \nQ) \n<= \n<= \n~ ..c: \nU \n\ni\n\nn.D \n\ni ,   j \n2.5 \n10.0 \nAutocorrelation Lag (ms) \n\n5.0 \n\n, \n\n7.5 \n\nB \n\nI \n\n12.5 \n\nFigure I: A. Correlogram of a mixture of speech and trill telephone, taken 450 ms after the \nstart of the stimulus. The pooled correlogram is shown in  the bottom panel, and the cross(cid:173)\ncorrelation function is shown on the right. B. Structure of the two-layer oscillator network. \n\n\f750 \n\nG.  J.  Brown and D.  L.  Wang \n\nneighbors unless it is  in  the active phase. The weight of neighboring connections along the \ntime axis  is  uniformly  set to  1.  The connection  weight between  an  oscillator (iJ) and its \nvertical  neighbor (i+lJ) is  set to  1 if C(iJ) exceeds a threshold Se;  otherwise it is  set to O. \nWz is the weight of inhibition from  the global inhibitor z,  defined as \n\n(6) \nwhere <roo  = 1 if xi} 2::  Sz for at least one oscillator (iJ), and <roo = 0 otherwise. Hence Sz  is  a \nthreshold. If <roo  = 1, z ~ 1. \nSmall  segments  may  form  which  do  not  correspond  to  perceptually  significant  acoustic \ncomponents. In order to  remove these noisy fragments,  we introduce a lateral potential Pi} \nfor oscillator (iJ), defined as  [11]: \n\nPij  =  (1  - Pij)H[ \n\nL..  H(x kl - ex) - epJ - \u00a3Pij \n\nkleNp(i,j) \n\n(7) \n\nHere, Sp is a threshold. Nf(iJ) is called the potential neighborhood of (iJ), which is chosen \nto  be (iJ-l) and (iJ+l). I  both neighbors of (iJ) are active, Pi} approaches 1 on a fast time \nscale; otherwise, Pij relaxes to 0 on a slow time scale determined by c. \nThe lateral  potential plays its  role by  gating  the input to  an  oscillator.  More specifically, \nwe replace (4a) with \n\n(4a') \niij  =  3xij-x:j +2-Yij+ lijH(pij-e) +Sij+P \nWith  Pij  initialized  to  1,  it follows  that  Pij  will  drop  below  the  threshold  S  unless  the \noscillator (iJ) receives excitation from its entire potential neighborhood. Given our choice \nof  neighborhood  in  (5),  this  implies  that  a  segment  must  extend  for  at  least  three \nconsecutive  time  frames.  Oscillators  that  are  stimulated  but  cannot  maintain  a  high \npotential are relegated to a discontiguous  'background' of noisy activity. \n\nAn oscillator (iJ) is stimulated if its corresponding input  lij > O.  Oscillators are stimulated \nonly if the energy in  their corresponding correlogram channel exceeds a threshold Sa.  It is \nevident  from  (1)  that  the  energy  in  a  correlogram  channel  i  at  time  j  corresponds  to \nA(iJ,O); thus we set Ii} = 0.2 if A(iJ,O) > Sa'  and Iij = -5 otherwise. \nFigure 2A shows the segmentation of a mixture of speech and trill telephone. The network \nwas simulated by the LEGION algorithm [8], producing 94 segments (each represented by \na distinct gray level) plus  the background (shown in  black).  For convenience we show all \nsegments together in Figure 2A, but each actually arises during a unique time interval. \n\nB  5000 \n\ng 2741 \n\n>. \n<.) c: \n<) \n~ 1457 \n~ \n!:S \n1: \n<) \nU \n03 c: c: \n~ ..c: \nU \n\n729 \n\n315 \n\n0.0 \n\nTime (seconds) \n\n80 \n\n1.5 \n\nTime (seconds) \n\nFigure 2:  A. Segments formed by the first layer of the network for a mixture of speech and \ntrill telephone. B. Categorization of segments according to FO. Gray pixels represent the set \nP, and white pixels represent regions that do not agree with the FO. \n\n\fOscillatory Correlation for CASA \n\n751 \n\n2.5  Neural oscillator network: grouping layer \n\nThe  second  layer  is  a  two-dimensional  network  of laterally  coupled  oscillators  without \nglobal  inhibition. Oscillators in  this layer are stimulated if the corresponding oscillator in \nthe  first  layer  is  stimulated  and  does  not  form  part  of  the  background.  Initially,  all \noscillators  have  the  same  phase,  implying  that  all  segments  from  the  first  layer  are \nallocated  to  the  same  stream.  This  initialization  is  consistent  with  psychophysical \nevidence suggesting that perceptual fusion  is  the default state of auditory organisation [2]. \nIn the second layer, an oscillator has the same form  as in  (4), except that Xu is changed to: \niii  =  3xij - x~ + 2 - Yij  + Ii) 1 + !1H(Pij - a)] + Sij + P \nHere,  Jl  is  a  small  positive  parameter;  this  implies  that  an  oscillator  with  a  high  lateral \npotential gets a slightly higher external input.  We  choose NpCiJ)  and aR so that oscillators \nwhich correspond  to  the  longest  segment  from  the  first  layer  are  the  first  to jump to  the \nactive phase. The longest segment is  identified by using the mechanism described in  [9]. \nThe coupling term in  (4a\") consists of two types of coupling: \n\n(4a\") \n\ne \n\n(8) \n\nv \nSij  =  Sij + Sij \nHere,  S;j  represents  mutual  excitation  between  oscillators  within  each  segment.  We  set \nS~ = 4  if  the  active  oscillators  from  the  same  segment  occupy  more  than  half of the \nlength of the segment; otherwise S~j  = 0.1  if there is at least one active oscillator from  the \nsame segment. \nThe coupling term S;  denotes  vertical connections between  oscillators corresponding to \n\ndifferent frequency  channels  and different segments,  but  within  the  same  time frame.  At \neach  time frame,  an  FO  is  estimated from  the  pooled correlogram  (2)  and  this  is  used to \nclassify  frequency  channels  into  two  categories:  a  set of channels,  P,  that  are  consistent \nwith the FO,  and a set of channels that are not (Figure 2B). Given the delay 'tm at which the \nlargest peak occurs in the pooled correlogram, for each channel i at time frame j, i E  P if \nAU, j, 'tm )/ A(i, j, 0) > ad \n(9) \nSince AUJ,O) is  the energy in correlogram channel i at time j, (9) amounts to classification \non the basis of an energy threshold. We use ad = 0.95. The delay 'tm can be found by using \na winner-take-all network, although for simplicity we currently apply a maximum selector. \n\nA  5IMM) \nN :r: \n'-'  2741 \n>. \nu c: \n\nQ) \n;:::l \ng'  1457 \n\n.. u.. \n\nQ) \n\nQ) \n\n.... \nC  729 \nU \nQ) c: \n; \n..c: \nU \n\n315 \n\nQ) \n\nN :r: \n'-'  2741 \n>. \nu c: \n;:::l go  1457 \n~ \n.... \nQ) c Q) \nU \n\"0 \nc: \n\u00a7 \n..c: \nU \n\n80 \n\nTime (seconds) \n\nTime (seconds) \n\nFigure 3:  A.  Snapshot showing  the  activity  of the  second  layer shortly  after  the  start  of \nsimulation. Active oscillators (white pixels) correspond to  the speech stream.  B.  Another \nsnapshot, taken shortly after A. Active oscillators correspond to  the telephone stream. \n\n\f752 \n\nG. J.  Brown and D.  L.  Wang \n\nThe  FO  classification  process  operates  on  channels,  rather  than  segments.  As  a  result, \nchannels within the same segment at a particular time frame may be allocated to different \nFO categories.  Since segments cannot be decomposed,  we enforce a rule that all channels \nof the same frame  within each segment must belong to  the same FO category as  that of the \nmajority of channels. After this conformational step, vertical connections are fonned such \nthat, at each time frame,  two oscillators of different segments have mutual excitatory links \nif the  two  corresponding  channels  belong  to  the  same  FO  category; otherwise  they  have \nmutual  inhibitory links.  S~ is  set to -O.S  if (iJ) receives  an input from its  inhibitory links; \nsimilarly,  s~ is  set to O.S  if (iJ) receives an input from its vertical excitatory links. \nAt  present,  our model  has  no  mechanism  for  grouping  segments  that  do  not  overlap  in \ntime.  Accordingly,  we  limit operation of the second layer to  the  time span of the  longest \nsegment.  After  fonning  lateral  connections  and  trimming  by  the  longest  segment,  the \nnetwork is  numerically solved using the singular limit method [6]. \nFigure  3  shows  the  response  of  the  second  layer  to  the  mixture  of  speech  and  trill \ntelephone.  The  figure  shows  two  snapshots  of  the  second  layer,  where  a  white  pixel \nindicates  an  active  oscillator and  a black pixel  indicates  a  silent oscillator.  The  network \nquickly forms  two synchronous blocks,  which desynchronize from  each other.  Figure 3A \nshows a snapshot taken  when the oscillator block (stream) corresponding to the segregated \nspeech is  in the active phase; Figure 3B shows  a subsequent snapshot when the oscillator \nblock corresponding to  the trill telephone is  in the active phase. Hence, the activity in this \nlayer of the network embodies the result of ASA;  the components of an  acoustic  mixture \nhave been separated using FO infonnation and represented by oscillatory correlation. \n\n2.6  Resynthesis \n\nThe  last  stage  of  the  model  is  a  resynthesis  path.  Phase-corrected  output  from  the \ngammatone filterbank is divided into 20 ms sections, overlapping by  10 ms and windowed \nwith  a  raised  cosine.  A  weighting  is  then  applied  to  each  section,  which  is  unity  if the \ncorresponding  oscillator  is  in  its  active  phase,  and  zero  otherwise.  The  weighted  filter \noutputs are summed across all channels to yield a resynthesized wavefonn. \n\nA  70 \n\n'\"'\"\"'  60 \nc:Q \n\"0 \n'-' \n.9  50 \n<;:; \n.... \n40 \n0 \ntil \n\u00b70 \ns=  30 \nI B I e;;  20 \n\ns= \ntlO \n. r;; \n10 \ns= \n\u00abI \n0 :;  0 \n\n-10 \n\nr--\n\nf-\n\nt-\n\nF \n\n::1 \n\n\u2022 \u00b7n, ~If .i~ \n\n\u2022 \u2022 \n\nI \n- , \n\nI \n\u2022\u2022 \n\nI \n\n.J1l \n\nI \n\nNO  Nl  N2  N3  N4  N5  N6  N7  N8  N9 \n\nIntrusion type \n\nB  100 \n90 \n\n60 \n\n70 \n\n'\"'\"\"' ~  80 \n'-' \n\"0 \n.... \n0 \n0 \n> \n0 \nu \n~ \n50 \n\u00bb \nf:.Il  40 \no s= \n0 \n30 \n..c: \nu \n0 \n0 \n0.. \nCIl \n\n20 \n\nIO \n\n0 \n\nn \n\n\",.-\n\nI\"\u00b7' \n\nF \n\n-,,-\n\nIr r--\n\n:;-'\\  ~ \nH< \n\n~ \n\n\", m fi  t \n',' \n\n'i  ' \n\n\"1 \nill \nt' f l  \nf' \nj:  ,d \nNO  Nl  N2  N3  N4  N5  N6  N7  N8  N9 \n\nitllE \nBi \nU:, \n\nJ  \"f \n\nJ \n\nIntrusion type \n\nFigure 4:  A. SNR before (black bar)  and after (grey bar) separation by the model.  Results \nare shown  for  voiced  speech mixed  with  ten  intrusions  (NO  =  1 kHz tone;  Nl  = random \nnoise; N2 = noise bursts; N3 = 'cocktail party'  noise; N4 = rock music; NS  = siren; N6 = \ntrill  telephone;  N7  = female  speech;  N8  = male  speech;  N9  = female  speech).  B. \nPercentage of speech energy recovered from each mixture after separation by the model. \n\n\f", "award": [], "sourceid": 1669, "authors": [{"given_name": "Guy", "family_name": "Brown", "institution": null}, {"given_name": "DeLiang", "family_name": "Wang", "institution": null}]}