{"title": "Neural System Model of Human Sound Localization", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 767, "abstract": null, "full_text": "Neural System Model of Human Sound \n\nLocalization \n\nCraig T. Jin \n\nDepartment of Physiology and \n\nDepartment of Electrical Engineering, \nUniv.  of Sydney, NSW 2006, Australia \n\nSimon Carlile \n\nDepartment of Physiology \n\nand Institute of Biomedical Research, \nUniv.  of Sydney, NSW 2006, Australia \n\nAbstract \n\nThis paper examines the role of biological constraints in the human audi(cid:173)\ntory localization process.  A psychophysical and neural system modeling \napproach  was  undertaken  in which  performance  comparisons between \ncompeting  models  and  a  human  subject  explore  the  relevant  biologi(cid:173)\ncally plausible  \"realism  constraints\".  The  directional  acoustical  cues, \nupon which sound localization is  based,  were  derived  from  the  human \nsubject's  head-related transfer functions  (HRTFs).  Sound stimuli  were \ngenerated by convolving bandpass noise with the HRTFs and were pre(cid:173)\nsented to both the subject and the model.  The input stimuli to the model \nwas processed using the Auditory Image Model of cochlear processing. \nThe  cochlear  data  was  then  analyzed  by  a  time-delay  neural  network \nwhich integrated temporal and spectral information to determine the spa(cid:173)\ntial  location  of the  sound  source.  The  combined  cochlear  model  and \nneural network provided a system model of the sound localization pro(cid:173)\ncess.  Human-like  localization performance  was  qualitatively  achieved \nfor broadband and bandpass stimuli when the model architecture incor(cid:173)\nporated frequency division (or tonotopicity), and was trained using vari(cid:173)\nable bandwidth and center-frequency sounds. \n\n1  Introduction \n\nThe ability to accurately estimate the location of a sound source has obvious evolutionary \nadvantages in terms of avoiding predators and finding prey.  Indeed, humans are very accu(cid:173)\nrate in their ability to localize broadband sounds.  There has been a considerable amount of \npsychoacoustical research into the auditory processes involved in human sound localization \n(recent review [1]).  Furthermore, numerous models of the human and animal sound local(cid:173)\nization process have been proposed (recent reviews  [2,3]).  However, there still remains a \nlarge  gap between the  psychophysical and the model explanations.  Principal congruence \nbetween the two approaches exists for localization performance under restricted conditions, \nsuch as for narrowband sounds where spectral integration is  not required, or for restricted \nregions of space.  Unfortunately,  there  is  no  existing computational model that accounts \nwell  for  human  sound  localization performance  for  a  wide-range  of sounds  (e.g.,  vary(cid:173)\ning in bandwidth and center-frequency).  Furthermore, the biological constraints pertinent \nto  sound localization have  generally not been explored by these  models.  These  include \nthe  spectral  resolution of the  auditory system in terms  of the  number and  bandwidth  of \n\n\f762 \n\nC.  T.  Jin and S.  Carlile \n\nfrequency channels and the role of tonotopic processing.  In addition,  the perfonnance re(cid:173)\nquirements of such a system are substantial and involve, for example, the accomodation of \nspectrally complex sounds, the robustness to irregularity in the sound source spectrum, and \nthe channel based structure of spatial coding as evidenced by auditory spatial after-effects \n[4].  The crux of the matter is  the  notion that \"biologically-likely realism\",  if built into a \nmodel, provides for a better understanding of the underlying processes. \n\nThis work attempts to bridge part of this gap between the modeling and psychophysics.  It \ndescribes the development and use (for the first time, to the authors ' knowledge) of a time(cid:173)\ndelay  neural  network model  that integrates  both  spectral  and temporal  cues  for auditory \nsound localization and compares the perfonnance of such a model with the corresponding \nhuman psychophysical evidence. \n\n2  Sound Localization \n\nThe sound localization perfonnance of a nonnal hearing human subject was tested using \nstimuli consisting  of three  different band-passed sounds:  (1)  a low-passed sound (300 -\n2000 Hz) (2)  a high-passed sound (2000 - 14000 Hz) and (3) a broadband sound (300 -\n14000 Hz).  These frequency bands respectively cover conditions in which either temporal \ncues,  spectral cues,  or both dominate the localization process (see  [1]).  The subject per(cid:173)\nfonned five  localization trials for each sound condition, each with 76 test locations evenly \ndistributed about the  subject's head.  The detailed methods used in free-field  sound local(cid:173)\nization can be found in [5].  A short summary is presented below. \n\n2.1  Sound Localization Task \n\nHuman  sound  localization  experiments  were  carried  out  in  a  darkened  anechoic  cham(cid:173)\nber.  Free-field sound stimuli were presented from a loudspeaker carried on a semicircular \nrobotic ann.  These stimuli consisted of \"fresh\" white Gaussian noise appropriately band(cid:173)\npassed for each trial.  The robotic ann allowed for placement of the speaker at almost any \nlocation on  the  surface of an  imaginary sphere, one meter in radius,  centered on  the sub(cid:173)\nject's head.  The subject indicated the  location of the sound source by pointing his nose in \nthe perceived direction of the sound.  The subject's head orientation was monitored using \nan electromagnetic sensor system (Polhemus, Inc.). \n\n2.2  Measurement and Validation of Outer Ear Acoustical Filtering \n\nThe cues  for  sound  localization depend not only upon the  spectral  and  temporal  proper(cid:173)\nties  of the  sound stimulus,  but also on the  acoustical  properties of the  individual's outer \nears.  It is  generally accepted that the relevant acoustical cues (i.e., the interaural time dif(cid:173)\nference,  ITO;  interaurallevel difference,  ILD; and  spectral cues) to  a sound's location in \nthe free-field are described by the head-related transfer function (HRTF) which is typically \nrepresented by a finite-length  impulse  response  (FIR) filter  [1].  Sounds filtered  with  the \nHRTF should be localizable when played over ear-phones which bypass the acoustical fil(cid:173)\ntering of the  outer ear.  The illusion of free-field  sounds  using head-phones  is  known  as \nvirtual auditory space (VAS). \n\nThus in order to  incorporate outer ear filtering  into the modelling process, measurements \nof the subject's HRTFs were carried out in the anechoic chamber.  The measurements were \nmade for both ears simultaneously using a ''blocked ear\" technique [1].  393 measurements \nwere  made  at  locations  evenly  distributed  on the  sphere.  In  order  to  establish  that  the \nHRTFs  appropriately  indicated  the  direction  of a  sound  source  the  subject repeated  the \nlocalization task as above with the stimulus presented in VAS. \n\n\fNeural System Model of Human Sound Localization \n\n763 \n\n2.3  Human Sound Localization Performance \n\nThe sound localization performance of the human subject in three different stimulus con(cid:173)\nditions  (broadband,  high-pass,  low-pass)  was  examined  in both  the  free-field  and in vir(cid:173)\ntual auditory space.  Comparisons between the two (using correlational statistics,  data not \nshown, but see [3]) across all sound conditions demonstrated their equivalence.  Thus the \nmeasured HRTFs were highly effective. \n\nLocalization data across all three sound conditions (single trial VAS data shown in Fig.  la) \nshows  that the subject performed well in both the  broadband and high-pass sound condi(cid:173)\ntions and rather poorly in the low-pass condition, which is consistent with other studies [6]. \nThe data is illustrated using spherical localization plots which well demonstrates the global \ndistribution of localization responses.  Given the  large  qualitative differences  in the  data \nsets presented below, this visual method of analysis was sufficient for evaluating the com(cid:173)\npeting models.  For each condition, the target and response locations are shown for both the \nleft (L)  and right (R) hemispheres of space.  It is  clear that in the  low-pass condition,  the \nsubject demonstrated gross mislocalizations with the responses clustering toward the lower \nand  frontal  hemispheres.  The gross mislocalizations correspond mainly to the  traditional \ncone of confusion errors [6]. \n\n3  Localization Model \n\nThe  sound  localization  model  consisted  of two  basic  system  components:  (1)  a  modi(cid:173)\nfied  version of the physiological Auditory Image Model  [7]  which simulates the  spectro(cid:173)\ntemporal characteristics of peripheral auditory processing, and (2) the computational archi(cid:173)\ntecture of a time-delay neural network.  The sounds presented to the model  were filtered \nusing  the  sUbject's  HRTFs  in  exactly  the  same  manner  as  was  used  in producing  VAS. \nTherefore, the modeling results can be compared with human localization performance on \nan individual basis. \n\nThe  modeling  process  can be  broken  down  into  four  stages.  In  the  first  stage  a  sound \nstimulus  was  generated with specific  band-pass  characteristics.  The sound stimulus  was \nthen  filtered  with  the  subject's  right and  left  ear HRTFs  to  render  an  auditory  stimulus \noriginating from  a particular location in space.  The auditory stimulus was then processed \nby the  Auditory  Image  Model  (AIM)  to  generate  a neural  activity  profile that simulates \nthe output of the inner hair cells in the organ of Corti and indicates the spiking probability \nof auditory nerve fibers.  Finally, in the  fourth and last stage, a time-delay neural network \n(TDNN)  computed  the  spatial  direction of the  sound  input  based on  the  distribution  of \nneural activity calculated by AIM. \n\nA detailed presentation of the modeling process can be found in [3], although a brief sum(cid:173)\nmary is  presented here.  The distribution of cochlear filters  across  frequency  in AIM  was \nchosen such that the minimum center frequency was 300 Hz and the maximum center fre(cid:173)\nquency  was  14  kHz with 31  filters  essentially equally spaced on a logarithmic  scale.  In \norder to  fully  describe  a  computational  layer of the  TDNN,  four  characteristic  numbers \nmust be specified:  (l) the number of neurons;  (2)  the  kernel length, a number which de(cid:173)\ntermines the size of the  current layer's time-window in terms of the number of time-steps \nof the  previous layer;  (3)  the  kernel  width,  a number which specifies how many neurons \nin the previous layer with which there are actual connections;  and (4)  the  undersampling \nfactor,  a number describing the multiplicative factor by which the current layer's time-step \ninterval is  increased from the previous layer's.  Using  this nomenclature,  the architecture \nof the different layers of one TDNN is summarized in Table  1, with the smallest time-step \nbeing 0.15  ms.  The exact connection arrangement of the network is described in the next \nsection. \n\n\f764 \n\nC.  T.  Jin  and S.  Carlile \n\nTable  I:  The Architecture of the TDNN. \n\nLayer \nInput \nHidden I \nHidden 2 \nOutput \n\n15 \n10 \n4 \n\nNeurons  Kernel Length  Kernel Width  Undersampling \n62 \n50 \n28 \n393 \n\n6 \n4,5,6 \n28 \n\n2 \n2 \n1 \n\nThe spatial location of a sound source was encoded by the network as a distributed response \nwith the peak occurring at the output neuron representing the  target location of the input \nsound.  The  output response  would then  decay  away  in the  fonn  of a  two-dimensional \nGaussian as one moves to neurons further away from the target location. This derives from \nthe well-established paradigm that the nervous system uses overlapping receptive fields to \nencode properties of the physical world. \n\n3.1  Networks with Frequency Division and Tonotopicity \n\nThe  major  auditory  brainstem  nuclei  demonstrate  substantial  frequency  division  within \ntheir  structure.  The tonotopic organization of the  primary  auditory  nerve fibers  that  in(cid:173)\nnervate the  cochlea carries forward to the brainstem's auditory nuclei.  This arrangement \nis described as a tonotopic organization.  Despite this  fact  and to our knowledge, no pre(cid:173)\nvious  network model  for  sound localization  incorporates such frequency  division within \nits architecture.  Typically (e.g., [8]) all of the neurons in the first  computational layer are \nfully connected to all of the input cochlear frequency channels.  In this work, different ar(cid:173)\nchitectures were examined with varying amounts of frequency division imposed upon the \nnetwork structure. The network with the architecture described above had its network con(cid:173)\nnections constrained by frequency in a tonotopic like arrangement.  The 31  input cochlear \nfrequency channels for each ear were split into ten overlapping groups consisting generally \nof six contiguous frequency channels. There were five neurons in the first hidden layer for \neach group of input channels. The kernel widths of these neurons were set, not to the total \nnumber of frequency channels in the input layer, but only to the six contiguous frequency \nchannels defining the group. Infonnation across the different groups of frequency channels \nwas progressively integrated in the higher layers of the network. \n\n3.2  Network Training \n\nSounds with different center-frequency and bandwidth were used for training the networks. \nIn one particular training paradigm, the center-frequency and bandwidth of the noise were \nchosen  randomly.  The  center-frequency was  chosen  using  a  unifonn  probability  distri(cid:173)\nbution on  a  logarithmic scale  that was  similar to the physiological  distribution of output \nfrequency channels from  AIM.  In this manner, each frequency region was trained equally \nbased on the density of neurons in that frequency region.  During training, the error back(cid:173)\npropagation algorithm  was  used with  a  summed  squared error measure.  It  is  a  natural \nfeature  of the learning rule that a given neuron's weights are only updated when there is \nactivity in  its respective cochlear channels.  So,  for example, a training sound containing \nonly low frequencies will not train the high-frequency neurons and vice versa.  All  model(cid:173)\ning results correspond with a single tonotopically organized TDNN trained using random \nsounds (unless explicitly stated otherwise). \n\n\fNeural System Model of Human Sound Localization \n\n765 \n\n4  Localization Performance of a Tonotopic Network \n\nExperimentation with the different network architectures clearly demonstrated that a net(cid:173)\nwork with frequency division vastly improved the localization performance of the TDNNs \n(Figure  I). In this case, frequency division was essential to producing a reasonable neural \nsystem model that would localize similarly to the human subject across all of the different \nband-pass  conditions.  For any  single band-pass condition,  it was  found  that the TDNN \ndid not require frequency division within its architecture to produce quality solutions when \ntrained only on these band-passed sounds. \n\nAs mentioned above it was observed that a tonotopic network, one that divides the input fre(cid:173)\nquency channels into different groups and then progressively interconnects the neurons in \nthe higher layers across frequency, was more robust in its localization performance across \nsounds with variable center-frequency and bandwidth than a  simple fully  connected net(cid:173)\nwork.  There are two likely explanations for this observation.  One line of reasoning argues \nthat it  was easier for the tonotopic network to prevent a narrow band of frequency chan(cid:173)\nnels  from  dominating the localization computation across the entire set of sound stimuli. \nOr expressed slightly differently, it may have been easier for it to incorporate the relevant \ninformation across the different frequency channels. A second line of reasoning argues that \nthe tonotopic network structure (along with the training with variable sounds) encouraged \nthe network to develop meaningful connections for all frequencies. \n\n(a) SUBJECT VAS \n\n(b) TONOTOPIC \n\nNETWORK \n\n(c) NETWORK without \nFREQUENCY DIVISION \n\n\u2022//'~~~;\" \" \n\n'.,  >~ \n.>t \n'\" \n\n.~::,:<;-\n, , , \n..  ' \n\n.'  .~ :\n\n' \n\n~;~::;;;,-\n\nL \n\nR \n\nL \n\nR \n\nL \n\nR \n\nFigure  I:  Comparison of the subject's VAS  localization performance and the model's lo(cid:173)\ncalization performance both with and without frequency division.  The viewpoint is  from \nan outside observer, with the target location shown by a cross and the  response  location \nshown by a black dot. \n\n\f766 \n\nC.  T.  Jin  and S.  Carlile \n\n5  Matched Filtering and Sound Localization \n\nA  number of previous sound localization models  have  used  a  relatively straight-forward \nmatched  filter  or template  matching  analysis  [9].  In  such  cases,  the  lTD  and  spectrum \nof a  given  input  sound  is  commonly  cross-correlated  with  the  lTD  and  spectrum  of an \nentire database of sounds for which the location is  known.  The location with the highest \ncorrelation is then chosen as the optimal source location. \n\nMatched filtering analysis is  compared with the localization performance of both the hu(cid:173)\nman  subject  and the  neural  system model  using  a  bandpass  sound with  restricted  high(cid:173)\nfrequencies (Figure 2).  The matched filtering localizes the sounds much better than the sub(cid:173)\nject or the TDNN model.  The matched filtering model used the same number of cochlear \nchannels as the TDNNs and therefore contained the same inherent spectral resolution.  This \nspectral resolution (31  cochlear channels) is certainly less than the spectral resolution of the \nhuman cochlea.  This shows that although there was  sufficient information to  localize the \nsounds from the point of view of matched filtering, neither the human nor TDNN demon(cid:173)\nstrated such ability in their performance.  In order for  the TDNN to localize similarly to \nthe matched filtering model,  the network weights corresponding to a given location need \nto  assume the  form  of the filter  template  for that location.  As  all of the training sounds \nwere flat-spectrum,  the TDNN  received no  ambiguity as  far as  the source spectrum was \nconcerned.  Thus it is likely that the difference in the distribution of localization responses \nin  Figure 2b,  as  compared with that in Figure 2c,  has been encouraged by using training \nsounds with random center-frequency and bandwidth, providing a partial explanation as to \nwhy the human localization performance is not optimal from a matched filtering standpoint. \n\nFigure 2:  Comparison of the localization performances of the subject, the TDNN model, \nand a matched filtering model.  Details as in Fig.  I. \n\n6  Varying Sound Levels and the ILD Cue \n\nThe training ofthe TDNNs was performed in such a fashion, that for any particular location \nin space, the sound level (67 dB SPL) did not vary by more than 1 dB SPL during repeated \npresentations of the sound.  The localization performance of the neural system model was \nthen examined,  using  a broadband sound source,  across  a  range of sound levels varying \nfrom  60 dB  SPL to 80  dB  SPL.  The spherical  correlation coefficient between the target \nand response locations ([10], values above 0.8 indicate \"high\" correlation) remained above \n0.8  between  60  and  75  dB  SPL  demonstrating  that  there  was  a  graceful  degradation in \nlocalization performance over a range in sound level of 15 dB. \n\nThe network was also tested on broadband sounds,  10 dB louder in one ear than the other. \nThe results of these tests are shown in  Figure 3 and clearly illustrate that the localization \nresponses were pulled toward the side with the louder sound.  While the magnitude of this \neffect is certainly not human-like, such behaviour suggests that interaurallevel difference \n\n\fNeural System Model of Human  Sound Localization \n\n767 \n\ncues were a prominent and constant feature of the data that conferred a measure of robust(cid:173)\nness to sound level variations. \n\nFigure  3:  Model's  localization  performance  with  a  10  dB  increase  in  sound  level: \n(a,b) monaurally, (c) binaurally. \n\n7  Conclusions \n\nA neural  system  model was  developed in  which  physiological constraints were  imposed \nupon  the  modeling  process:  (I) a  TDNN model  was  used  to  incorporate the  important \nrole of spectral-temporal processing in the auditory nervous system, (2) a tonotopic struc(cid:173)\nture was added to the network, (3) the training sounds contained randomly varying center(cid:173)\nfrequencies and bandwidths.  This biologically plausible model provided increased under(cid:173)\nstanding of the role that these constraints play in determining localization performance. \n\nAcknowledgments \n\nThe authors thank Markus Schenkel and Andre van Schaik for valuable comments.  This \nresearch was supported by the NHMRC, ARC, and a Dora Lush Scholarship to CJ. \n\nReferences \n[I]  S.  Carlile,  Virtual  auditory  space:  Generation  and applications.  New  York:  Chapman  and \n\nHall,  1996. \n\n[2]  R.  H.  Gilkey and T.  R.  Anderson, Binaural and Spatial Hearing in  real and virtual environ(cid:173)\n\nments.  Mahwah, New Jersey: Lawrence Erlbaum Associates, Publishers,  1997. \n\n[3]  C. Jin, M.  Schenkel, and S. Carlile, \"Neural system identification model of human sound local(cid:173)\n\nisation,\" (Submitted to J.  Acoust.  Soc. Am.), 1999. \n\n[4]  S.  Hyams  and  S.  Carlile,  \"After-effects  in auditory  localization:  evidence  for channel  based \n\nprocessing,\" Submitted to  the J.  Acoust. Soc. Am., 2000. \n\n[5]  S. Carlile, P. Leong, and S. Hyams, \"The nature and distribution of errors in the localization of \n\nsounds by humans,\"  Hearing Research, vol.  114, pp. 179-196,  1997. \n\n[6]  S. Carlile,  S.  Delaney,  and A.  Corderoy,  \"The localization of spectrally restricted  sounds  by \n\nhuman listeners,\"  Hearing Research, vol.  128, pp.  175-189,  1999. \n\n[7]  C.  Giguere and P.  C. Woodland, \"A computational model of the auditory periphery for speech \n\nand hearing research. i. ascending path,\"  J.  Acoust. Soc. Am., vol.  95, pp. 331-342,  1994. \n\n[8]  C. Neti, E.  Young,  and M.  Schneider, \"Neural network models of sound localization based on \n\ndirectional filtering by the pinna,\" J.  Acoust.  Soc.  Am., vol. 92, no. 6, pp. 3140-3156, 1992. \n\n[9]  J. Middlebrooks, \"Narrow-band sound localization related to external ear acoustics,\"  J.  Acoust. \n\nSoc.  Am., vol. 92, no. 5, pp. 2607-2624,  1992. \n\n[10]  N. Fisher, I, T.  Lewis, and B. J. J. Embleton, Statistical analysis of spherical data.  Cambridge: \n\nCambridge University Press, 1987. \n\n\f", "award": [], "sourceid": 1734, "authors": [{"given_name": "Craig", "family_name": "Jin", "institution": null}, {"given_name": "Simon", "family_name": "Carlile", "institution": null}]}