{"title": "Onset-based Sound Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 729, "page_last": 735, "abstract": null, "full_text": "Onset-based  Sound  Segmentation \n\nLeslie S.  Smith \n\nCCCN jDepartment of Computer Science \n\nUniversity of Stirling \n\nStirling FK9 4LA \n\nScotland \n\nAbstract \n\nA technique for  segmenting sounds using processing based on mam(cid:173)\nmalian  early  auditory  processing  is  presented.  The  technique  is \nbased on features in sound which neuron spike recording suggests \nare  detected in  the  cochlear  nucleus.  The  sound  signal  is  band(cid:173)\npassed  and  each  signal  processed  to  enhance  onsets  and  offsets. \nThe onset and offset signals are compressed, then clustered both in \ntime and  across  frequency  channels  using  a  network  of integrate(cid:173)\nand-fire  neurons.  Onsets  and  offsets  are  signalled  by  spikes,  and \nthe timing of these spikes used to segment the sound. \n\n1  Background \n\nTraditional speech interpretation techniques based on Fourier transforms, spectrum \nrecoding,  and  a  hidden Markov model or  neural network interpretation stage have \nlimitations  both  in  continuous  speech  and  in interpreting speech in  the  presence \nof noise,  and  this  has  led  to  interest  in  front  ends  modelling  biological  auditory \nsystems for  speech interpretation systems (Ainsworth and Meyer 92;  Cosi  93;  Cole \net al 95). \n\nAuditory  modelling  systems use  similar  early  auditory  processing  to  that  used in \nbiological systems.  Mammalian auditory processing uses two ears, and the incoming \nsignal  is  filtered first  by  the pinna  (external ear)  and the  auditory  canal before  it \ncauses the tympanic membrane (eardrum) to vibrate.  This vibration is then passed \non through the bones of the middle ear to the oval  window  on the cochlea.  Inside \nthe cochlea, the pressure wave causes a  pattern of vibration to occur on the basilar \nmembrane.  This appears to be an active process using both the inner and outer hair \ncells  of the organ  of  Corti.  The movement is  detected by  the inner  hair  cells  and \nturned into neural impulses by the neurons of the spiral ganglion.  These pass down \nthe auditory nerve, and arrive at various parts of the cochlear nucleus.  From there, \nnerve fibres innervate other areas:  the lateral and medial nuclei of the superior olive, \n\n\f730 \n\nL.S.SMITH \n\nand the inferior colliculus,  for  example.  (See  (Pickles 88)). \n\nVirtually all modern sound or speech interpretation systems use some form of band(cid:173)\npass filtering,  following  the biology  as far  as  the  cochlea.  Most  use  Fourier  trans(cid:173)\nforms  to perform a  calculation of the energy in each band over  some  time period, \nusually  between 25  and  75  ms.  This is not what  the cochlea does.  Auditory mod(cid:173)\nelling front  ends differ in the  extent  and length to which  they follow  animal  early \nauditory  processing,  but  the  term generally implies  at  least  that  wideband  filters \nare used, and that high temporal resolution is maintained in the initial stages.  This \nmeans the use of filtering techniques. rather than Fourier transforms in the bandpass \nstage.  Such filtering systems have been implemented by  Patterson and Holdsworth \n(Patterson and Holdsworth  90;  Slaney  93),  and  placed directly in silicon  (Lazzaro \nand Mead 89;  Lazzaro et al  93;  Liu et al  93;  Fragniere and van  Schaik 94). \n\nSome  auditory  models  have  moved  beyond  cochlear filtering.  The  inner  hair  cell \nhas  been modelled by  either simple  rectification  (Smith 94)  or  has  been based on \nthe work of (Meddis 88)  for example (Patterson and Holdsworth 90;  Cosi 93;  Brown \n92).  Lazzaro has experimented with a  silicon version of Licklider's autocorrelation \nprocessing  (Licklider  51;  Lazzaro  and  Mead  89).  Others such  as  (Wu  et  al  1989: \nBlackwood et al1990;  Ainsworth and Meyer 92;  Brown 92;  Berthommier 93;  Smith \n94)  have  considered  the  early  brainstem  nuclei,  and  their  possible  contribution, \nbased on the neurophysiology  of the different cell types (Pickles 88;  Blackburn and \nSachs  1989;  Kim et al  90). \n\nAuditory  model-based systems have  yet  to find  their  way  into mainstream speech \nrecognition  systems  (Cosi  93).  The  work  presented here  uses  auditory  modelling \nup  to  onset  cells  in  the  cochlear  nucleus.  It adds  a  temporal  neural  network  to \nclean up the segmentation produced.  This  part  has  been filed  as  a  patent  (Smith \n95).  Though  the  system  has  some  biological  plausibility,  the  aim  is  an  effective \ndata-driven segmentation technique implement able in silicon. \n\n2  Techniques used \n\nDigitized sound was  applied to  an  auditory front  end,  (Patterson and Holdsworth \n90),  which bandpassed the sound into channels each with bandwidth 24.7{4.37Fr; + \nI)Hz, where  Fe  is the centre frequency  (in KHz)  of the band (Moore  and Glasberg \n83).  These were  rectified,  modelling  the effect  of the inner hair  cells.  The signals \nproduced bear  some  resemblance  to  that  in  the  auditory  nerve.  The real  system \nhas far  more channels and each nerve channel carries spike-coded information.  The \ncoding here models the signal in a  population of neighboring  auditory nerve fibres. \n\n2.1  The onset-offset filter \n\nThe signal  present in the auditory  nerve is  stronger near the onset  of a  tone  than \nlater (Pickles 88).  This effect is  much more pronounced in certain cell types of the \ncochlear nucleus.  These fire  strongly just after the onset  of a  sound in the band to \nwhich they are sensitive, and are then silent.  This emphasis on onsets was modelled \nby  convolving the signal in each band with a  filter which computes two  averages, a \nmore recent one,  and  a  less recent one,  and subtracts the less  recent one  from  the \nmore recent one.  One biologically  possible justification for  this is  to  consider  that \na  neuron is  receiving the same driving  input  twice,  one  excitatorily,  and  the  other \ninhibitorily;  the  excitatory  input  has  a  shorter  time-constant  than  the  inhibitory \ninput.  Both exponentially weighted averages, and averages formed using a Gaussian \nfilter  have been tried  (Smith 94),  but  the former  place  too much emphasis on  the \nmost recent part of the signal,  making the latter more effective. \n\n\fOnset-based  Sound  Segmentation \n\n731 \n\nThe filter  output for  input signal s(x) is \n\nO(t. k, 'f')  = lot (f(t - x, k) -\n\nf(t - x, k/,r))s(x)dx \n\n(1) \n\nwhere  f(x, y)  = vY exp( -yx2 ).  k  and  'r  determine  the  rise  and  fall  times of the \npulses  of  sOlmd  that  the  system is  sensitive  to.  We  used  A:  = 1000,  'r  = 1.2,  so \nthat  the SD  of the  Gaussians  are  24.49ms  and  22.36ms.  The convolving  filter  has \na  positive peak at O.  crosses 0 at  22.39ms.  and is  then negative.  With these values. \nthe  system is  sensitive  to  energy  rises  and  falls  which  occm  in  the  envelopes  of \neveryday sounds.  A positive onset-offset signal implies that the bandpassed signal is \nincreasing in intensity, and a negative onset-offset signal implies that it is decreasing \nin intensity.  The convolution used is  a  sound analog  of the difference of Gaussians \noperator used to extract black/white and white/black edges in monochrome images \n(MalT  and Hildreth 80).  In  (Smith 94)  we  performed sOlmd  segmentation directly \non this signal. \n\n2.2  Compressing the onset-offset signal \n\nThe onset-offset  signal was  divided into  two  positive-going signals,  an onset signal \nconsisting of the positive-going part, and an offset  signal consisting of the inverted \nnegative-going part.  Both were compressed logarithmically (where log(x) was taken \nas  0 for  0 S  x  S  1).  This increases the dynamical range of the system, and models \ncompressive biological  effects.  The compressed onset signal models  the output of a \npopulation of onset cells.  This technique for  producing an onset signal is related to \nthat of (Wu et al  1989:  Cosi  93). \n\n2.3  The integrate-and-fire  neural network \n\nTo segment the sound using the onset and offset  signals,  they need to be integrated \nacross  frequency  bands  and  across  time.  This  temporal  and  tonotopic  clustering \nwas achieved using a network of integrate-and-fire units.  An integrate-and-fire unit \naccumulates its weighted input  over time.  The activity of the unit  A. is initially O. \nand alters according to \n\ndA \n-\ndt \n\n=  I(t) - \"YA \n\n(2) \n\nwhere I(t) is  the input to the nemon and \"Y,  the dissipation,  describes the leakiness \nof the integration.  When A  reaches a  threshold.  the unit fires  (i.e.  emits a  pulse). \nand A is reset to O.  After firing,  there is a period of insensitivity to input, called the \nrefractory period.  Such nemons are discussed in.  e.g.  (Mirolla and Strogatz 90). \n\nOne integrate-and-fire neuron was used per charmel:  this neuron received input ei(cid:173)\nther from a single charmel, or from a set of adjacent charmels. all with equal positive \nweighting.  The output  of each neuron was  fed  back  to  a  set  of adjacent  neurons, \nagain with a fixed  positive weight, one time step (here 0.5ms)  later.  Because of the \nleaky nature of the accumulation of activity, excitatory input to the neuron arriving \nwhen its activation is near' threshold has a  lar'ger effect on the next firing  time than \nexcitatory input  arriving when activation is lower.  Thus, if similar input is applied \nto a  set  of neurons in adjacent  charmels.  the effect of the inter-neuron connections \nis  that  when the first  one  fires,  its neighbors  fire  almost  immediately.  This allows \na  network of such neurons  to cluster the onset  or  offset  signals,  producing a  sharp \nburst of spikes across a number of charmels providing unambiguous onsets or offsets. \n\nThe  external  and  internal  weights  of the  network  were  adjusted  so  that  onset  or \noffset input alone allowed neurons to fire,  while internal input alone was not enough \n\n\f732 \n\nL.  S.  SMITH \n\nto  cause  firing.  The refractory period used was  set  to  50ms  for  the  onset  system, \nand 5ms for  the offset system.  For the onset system, the effect was to produce sharp \nonset firing  responses across  adjacent channels in response to a  sudden increase in \nenergy in some channels,  thus grouping  onsets both  tonotopically  and temporally. \nThis is  appropriate for  onsets, as  these are generally brief and clearly marked.  The \noutput  of this stage we  call  the onset map.  Offsets tend to  be more  gradual.  This \nis  due  to  physical  effects:  for  example,  a  percussive sound  will  start  suddenly,  as \nthe  vibrating  element  starts to  move.  but  die  away  slowly  as  the vibration  ceases \n(see (Gaver 93)  for  a  discussion).  Even when the vibration does stop suddenly.  the \nsound will  die  away  more slowly  due to echoes.  Thus we  cannot reliably mark the \noffset of a sound:  instead. we reduce the refractory period of the offset neurons, and \nproduce a  train of pulses marking the duration of the offset in this channel.  We  call \nthe output of this stage the offset  map. \n\n3  Results \n\nAs the technique is entirely data-driven. it can be applied to sound from any source. \nIt has  been applied to both speech and musical sounds.  Figure  1 shows  the  effect \nof applying  the techniques discussed to a  short  piece of speech.  Fig  lc shows  that \nthe neural network integrates the onset timings across the channels,  allowing  these \nonsets  to  be  used  for  segmentation.  The  simplest  technique  is  to  divide  up  the \ncontinuous speech at  each onset:  however,to ensure  that  the occasional  onset  in a \nsingle  channel  does  not  confuse  the  system.  and  that  onsets  which  occur  near  to \neach other do  not  result in very short segments we  demanded that  a  segmentation \nboundary have at least 6 onsets inside a period of lOms.  and the minimum segment \nlength was  set to 25ms. \n\nThe utterance Ne'Uml  information processing  systems has phonetic representation: \n\n/ njtlrl: anfarme Ian prosc:salJ ststalllS / \n\nand is  segmented into the following  19  segments: \n/n/,  jtl/,  /r/,  /la/,  /a/,  /nf/.  /arm/,  /e/,  /I/,  /an/,  /pro/,  /os/,  /c:s/ ,  /aIJ/,  /s/, \n/t/, /st/,  /am/,  /s/ \nThe  same  text  spoken  more  slowly  (over  4.38s,  rather  than  2.31s)  has  phonetic \nrepresentation: \n\n/ njural:anftrmeIanprosc:stIJ ststams / \n\nSegmenting using this technique gives the following  25  segments: \n/n/ ,  /ju/ ,  /u/,  /r/.  /a/ ,  /al/,  /1/,  / /,  /an/,  /f/ ,  /um/,  /e/, /I/.  /an/,  /n:/,  /pr/, \n/ro/,  /os/,  /c:s/,  /tIJ/,  /s/,  /t/ ,  /st/, /am/,  /s/ \nAlthough some phonemes are broken between segments, the system provides effec(cid:173)\ntive  segmentation,  and is  relatively insensitive  to  speech rate.  The system is  also \neffective at  finding  speech inside  certain types of noise  (such as  motor-bike noise) , \nas  can be seen in fig  Ie and f. \n\nThe system has been used to segment sound from single musical instruments.  Where \nthese have clear breaks between notes this is straightforward:  in (Smith 94)  correct \nsegmentation was achieved directly from the onset-offset signal but was not achieved \nfor  slurred sounds, in which  the notes change smoothly.  As  is  visible in figure  2c, \nthe onsets here are clear using the network, and the segmentation produced is near-\n\n\fOnset-based  Sound  Segmentation \n\n733 \n\n[E]J \n\n:'. \n\n. .. ... \n\n' \" \n\n.. \n.. \n\nGJ \n\n. \n. , \n\n. \n\n\" \n\n.. \n\n.  .... \n. \n\n.. \n\n'\"\\N',,,,,,.J/'I~'/w,.-\"\u00b7'''''''''''~v..flt'''''I/!fII~~ ... A..it.-./'~f\\It'''v,i~~~~';''o~\\-J..J{iII''r  ~'if'I{I/'}/i'...J\"''' \n~ 'W'hlJ.,j.~~..f\"'\\/' JJ.A~ \"'v.\"\"./fI/<II'~rJ~ M '\\-'.'~'f..\"\"\"v/  \"'I'jNflNlI/V  '1'#.~N~I{'f!II/W/W \u00b7\" /\")~,\\/,, \n\n'  . . \n\n.. ; \n\nFigure 1:  (a-d):Onset and Offset maps from author saying Neural  information pro(cid:173)\ncessing  systems  rapidly.  a:  envelope  of  original  sound.  b:  onset  map.  from  28 \nchannels.  from  100Hz-6KHz.  Onset  filter  parameters  as  in  text;  one  neuron  per \nchannel,  with no  interconnection.  Neuron refractory period is  50ms.  c:  as  b,  but \nnetwork has input applied to 6 adjacent channels, and internal feedback to 10 chan(cid:173)\nnels.  d:  offset  map produced similarly,  with refractory period 5ms.  e:  envelope of \nsay,  that's  a  nice  bike with motorbike noise in background  (lines mark utterance). \nf,  g:  onset, offset  maps for  e. \n\nperfect.  Best results were obtained here when the input to the network is not spread \nacross  channels. \n\n4  Conclusions and further work \n\nAn effective data driven segmentation technique  based on  onset  feature  detection \nand  using  integrate-and-fire neurons  has  been  demonstrated.  The  system is  rel(cid:173)\natively  immune  to  broadband  noise.  Segmentation  is  not  an  end  in  itself:  the \neffectiveness of any technique will  depend on the eventual application. \n\n\f734 \n\n.' \n\nL. S.SMITH \n\n.. '~----\n\nFigure  2:  a:  slurred  flute  sound.  with  vertical  lines  showing  boundary  between \nnotes.  b:  onsets found  using  a  single  neuron per channel,  and no interconnection. \nc:  as  b,  but  with  internal feedback  from  each  channel  to  16  adjacent  channels  d: \noffsets found with refractory period 5ms. \n\nThe segmentation is  currently not using the information on which bands the onsets \noccur in.  We propose to extend this work by combining the segmentation described \nhere  with  work  streaming  bands  sharing  same-frequency  amplitude  modulation. \nThe aim of this is to extract sound segments from some subset of the bands, allowing \nsegmentation and streaming to run concurrently. \n\nAcknowledgements \n\nMany thanks are due to the members of the Centre for Cognitive and Computational \nNeuroscience at  the University of Stirling. \n\nReferences \n\nAinsworth W.  Meyer G.  Speech analysis by means of a  physiologically-based model \nof the cochlear nerve and cochlear nucleus. in  Visual  r'epresentations  of speech \nsignals.  Cooke M.  Beet S.  eds.  1992. \n\nBerthommier F ..  Modelling nelll'rul'eSpOllSes of t.he int.ermediate auditory system, in \nMathematics applied to  biology and medicine, Demongeot .I, Capa..'!so V, Wuertz \nPublishing,  Canada,  1993. \n\nBlackburn C.C ..  Sachs M.B.  Classification of unit types in the anteroventral cochlear \nnucleus:  PST hist.ograms  and  regularity  analysis,  .  J.  Neurophys'iology,  62,  6, \n1989. \n\n\fOnset-based  Sound  Segmentation \n\n735 \n\nBlackwood N ..  Meyer G., Aimsworth W. A Model of the processing of voiced plosives \nin the auditory  nerve and  cochlear nucleus,  Proceedings  Inst  of Acoustics,  12, \n10,  1990. \n\nBrown  G.  Computational Auditory  Scene  Analysis,  TR  CS-92-22,  Department  of \n\nComputing Science, University of Sheffield,  England,  1992. \n\nCole  R ..  et al,  The challenge of spoken language systems:  research directions of the \n\n90's.  IEEE  Trans  Speech  and Audio Pmcessing,  3.  1,  1995. \n\nCosi P.  On the use of auditory models in speech technology, in Intelligent Perceptual \n\nModels,  LNCS  745,  Springer Verlag,  1993. \n\nFragniere E.,  van  Schaik A ..  Lineal' predictive coding of the speech signal using an \nanalog cochlear modeL  MANTRA Internal Report, 94/2, MANTRA Center for \nNeuro-mimetic systems, EPFL, Lausanne,  Switzerland,  1994. \n\nGaver  W.W.  What  in the world do  we  hear?:  an ecological  approach  to  auditory \n\nevent perception.  Ecological Psychology,  5(1).  1-29,  1993. \n\nKim D.O.  ,Sirianni  .T.G.,  Chang  S.O ..  Responses  of DCN-PVCN neurons  and  au(cid:173)\n\nditory nerve  fibres  in lmanesthetized decerebrate cats to AM  and pure tones: \nanalysis  with  autocorrelation/power-spectrum,  Hearing  Research.  45,  95-113. \n1990. \n\nLazzaro  .T.,  Mead  C.,  Silicon  modelling  of  pitch  perception,  Proc  Natl.  Acad Sci(cid:173)\n\nences,  USA,  86.  9597-9601,  1989. \n\nLazzaro .T.,  Wawrzynek .T ..  Mahowald M. , Sivilotti M.,  Gillespie D ..  Silicon auditory \nprocessors as computer peripherals. IEEE Trans  on Neural Networks, 4,  3,  May \n1993. \n\nLicklider .T.C.R,  A Duplex theory of pitch perception, Experentia,  7.  128-133, 1951. \nLiu W ..  Andreou A.G.,  Goldstein M.H.,  Analog cochlear model for  multiresolution \nspeech analysis,  Advances in Neural Information Processing Systems 5,  Hanson \nS . .T.,  Cowan .T.D.,  Lee Giles  C.  (eds),  Morgan Kaufmann,  1993. \n\nMarl'  D.,  Hildreth E.  Theory of edge detection,  Proc.  Royal Society  of London  B, \n\n207.  187-217,  1980. \n\nMeddis  R ..  Simulation of auditory-neural  transduction:  further studies.  J.  Acollst \n\nSoc  Am.  83.  3,  1988. \n\nMoore  B.C . .J..  Glasberg  B.R.  Suggested  formulae  for  calculating  auditory-filter \n\nbandwidths and excitation patterns,  J  Acoust Soc  America,  74.  3,  1983. \n\nMirollo  RE. , Strogatz S.H.  Synchronization of pulse-coupled biological  oscillators, \n\nSIAM J.  Appl Math,  50,  6,  1990. \n\nPatterson R.  Holdsworth  .T.  (1990).  An Introd'IJ,ction  to  A1J,ditory  Sensation  Pro(cid:173)\n\ncessing.  in AAM HAP. Vol  1.  No  1. \n\nPickles  .T.O.  (1988).  An  Introd'uction  to  the  PhyS'iology  of  Hearing,  2nd  Edition, \n\nAcademic Press. \n\nSlaney M ..  An efficient implementation of the Patterson-Holdsworth auditory filter \n\nbank,  Apple technical report  No 35,  Apple Computer Inc,  1993. \n\nSmith L.S.  SO\\illd segmentation using onsets and offsets,  J  of New Music  Research, \n\n23,  1,  1994. \n\nSmith L.S.  Onset/offset  coding  for  interpretation  and  segmentation  of sound,  UK \n\npatent no  9505956.4.  March 1995. \n\nWu  Z.L.,  Schwartz  .T.L ..  Escudier  P.  A  theoretical  study  of  neural  mechanisms \nspecialized  in  the  detection  of  articulatory-acoustic  events,  Proc  Eurospeech \n89.  ed Tubach .T.P.,  Mariani .T . .T.,  Paris,  1989. \n\n\f", "award": [], "sourceid": 1024, "authors": [{"given_name": "Leslie", "family_name": "Smith", "institution": null}]}