{"title": "A Segment-Based Automatic Language Identification System", "book": "Advances in Neural Information Processing Systems", "page_first": 241, "page_last": 248, "abstract": null, "full_text": "A  Segment-based Automatic Language \n\nIdentification System \n\nYeshwant K.  Muthusamy &  Ronald A.  Cole \n\nCenter for  Spoken Language Understanding \n\nOregon  Graduate Institute of Science  and Technology \n\nBeaverton OR 97006-1999 \n\nAbstract \n\nWe have developed  a four-language automatic language identification sys(cid:173)\ntem  for  high-quality  speech.  The  system  uses  a  neural  network-based \nsegmentation algorithm to segment speech  into seven  broad phonetic cat(cid:173)\negories.  Phonetic  and prosodic features  computed on  these  categories  are \nthen  input to a  second  network that performs the language classification. \nThe system was trained  and tested on separate sets  of speakers of Ameri(cid:173)\ncan English, Japanese, Mandarin Chinese and Tamil.  It currently performs \nwith  an accuracy  of 89.5% on  the utterances  of the test set. \n\n1 \n\nINTRODUCTION \n\nAutomatic language identification is the rapid  automatic determination of the lan(cid:173)\nguage  being  spoken,  by  any  speaker,  saying  anything.  Despite  several  important \napplications of automatic language identification, this area has suffered  from a lack \nof  basic  research  and  the  absence  of  a  standardized,  public-domain  database  of \nlanguages. \n\nIt is  well  known that languages have characteristic sound patterns.  Languages have \nbeen described subjectively as  \"singsong\" , \"rhythmic\" , \"guttural\",  \"nasal\" etc.  The \nkey to solving the problem of automatic language identification is the detection  and \nexploitation of such  differences  between languages. \n\nWe assume that each language in the world has a unique acoustic structure, and that \nthis structure  can  be  defined  in  terms of phonetic  and  prosodic features  of speech. \n241 \n\n\f242 \n\nMuthusamy and Cole \n\nPhonetic,  or  segmental  features,  include  the  the  inventory  of phonetic  segments \nand  their  frequency  of occurrence  in  speech.  Prosodic  information  consists  of the \nrelative  durations  and  amplitudes of sonorant (vowel-like)  segments,  their  spacing \nin time,  and patterns of pitch change within and  across  these  segments. \n\nTo  the extent  that  these  assumptions  are  valid,  languages  can  be  identified  auto(cid:173)\nmatically by segmenting speech into broad phonetic categories, computing segment(cid:173)\nbased features  that capture the relevant phonetic and prosodic structure, and train(cid:173)\ning a  classifier  to associate  the feature  measurements with  the spoken  language. \n\nWe have developed  a  language identification system  that uses  a  neural  network to \nsegment  speech  into  a  sequence  of seven  broad  phonetic  categories.  Information \nabout these categories is then used  to train a second  neural network to discriminate \namong utterances  spoken  by native speakers of American  English,  Japanese,  Man(cid:173)\ndarin Chinese and Tamil.  When tested on utterances produced by six new speakers \nfrom each language, the system correctly identifies the language being spoken 89.5% \nof the time. \n\n2  SYSTEM  OVERVIEW \n\nThe following  steps  transform  an  input  utterance  into a  decision  about  what  lan(cid:173)\nguage was spoken. \n\nData Capture \nThe speech  is  recorded  using  a  Sennheiser  HMD  224  noise-canceling  microphone, \nlow-pass filtered  at 7.6  kHz  and sampled at  16  kHz. \n\nSignal Representations \nA  number  of waveform  and  spectral  parameters  are  computed  in  preparation  for \nfurther  processing.  The spectral parameters are generated from a  128-point discrete \nFourier  transform  computed  on  a  10  ms  Hanning  window.  All  parameters  are \ncomputed every  3 ms. \n\nThe waveform parameters consist  of estimates of (i)  zc8000:  the zero-crossing  rate \nof the  waveform  in  a  10  ms  window,  (ii)  ptp700  and  ptp8000:  the  peak-to-peak \namplitude of the  waveform  in  a  10  ms  window  in  two frequency  bands  (0-700  Hz \nand 0-8000 Hz),  and (iii) pitch:  the presence or absence of pitch in each 3 ms frame. \nThe pitch estimate is  derived from a neural network pitch tracker that locates pitch \nperiods  in  the  filtered  (0-700  Hz)  waveform  [2].  The  spectral  parameters  consist \nof  (i)  DFT  coefficients,  (ii)  sda700  and  sda8000:  estimates  of averaged  spectral \ndifference  in  two  frequency  bands,  (iii)  sdf:  spectral  difference  in  adjacent  9  ms \nintervals, and  (iv)  cmlOOO:  the center-of-mass of the spectrum in  the region  of the \nfirst  formant. \n\nBroad Category Segmentation \n\nSegmentation  is  performed  by  a  fully-connected,  feedforward,  three-layer  neural \nnetwork that  assigns  7 broad  phonetic  category  scores  to each  3  ms time frame  of \nthe  utterance.  The broad  phonetic  categories  are:  vac (vowel) , FRIC  (fricative), \n\n\fA Segment-based Automatic Language Identification  System \n\n243 \n\nSTOP  (pre-vocalic  stop),  PRVS  (pre-vocalic  sonorant),  INVS  (inter-vocalic sono(cid:173)\nrant),  POVS  (post-vocalic sonorant),  and  CLOS  (silence  or  background noise).  A \nViterbi  search,  which  incorporates  duration  and  bigram  probabilities,  uses  these \nframe-based output activations to find  the best scoring sequence  of broad phonetic \ncategory  labels  spanning  the  utterance.  The  segmentation  algorithm  is  described \nin  greater  detail  in  [31. \n\nLanguage Classification \n\nLanguage  classification  is  performed  by  a  second  fully-connected  feedforward  net(cid:173)\nwork that uses  phonetic  and prosodic features  derived from the time-aligned broad \ncategory  sequence.  These  features,  described  below,  are  designed  to  capture  the \nphonetic  and  prosodic  differences  between  the four  languages. \n\n3  FOUR-LANGUAGE HIGH-QUALITY SPEECH \n\nDATABASE \n\nThe data for  this research consisted  of natural continuous speech  recorded  in  a  lab(cid:173)\noratory by 20 native speakers (10  male and 10 female) of each of American English, \nMandarin  Chinese,  Japanese  and Tamil.  The speakers  were  asked  to speak  a  total \nof 20  utterances!: 15 conversational sentences of their choice,  two questions of their \nchoice,  the days of the week,  the months of the year and the numbers 0 through 10. \nThe objective was to have a mix of unconstrained- and restricted-vocabulary speech. \nThe segmentation algorithm was trained on just the conversational sentences,  while \nthe language classifier  used  all  utterances  from  each speaker. \n\n4  NEURAL NETWORK  SEGMENTATION \n\n4.1  SEGMENTER TRAINING \n\n4.1.1  Training and Test  Sets \n\nFive utterances  from each of 16  speakers  per  language  were  used  to  train and  test \nthe segmenter.  The training set  had  50  utterances  from  10 speakers  (5  male and  5 \nfemale) from each of the 4 languages, for  a total of 200 utterances.  The development \ntest  set  had  10  utterances  from  a  different  set of 2 speakers  (1  male and  1 female) \nfrom each language, for  a total of 40 utterances.  The final  test set had 20  utterances \nfrom  yet another set  of 4 speakers  (2  male and 2 female)  from each  language for  a \ntotal  of 80  utterances.  The  average  duration  of the  utterances  in  the  training  set \nwas 4.7 secs  and that of the test  sets  was 5.7 secs. \n\n4.1.2  Network Architecture \n\nThe  segmentation  network  was  a  fully-connected,  feed-forward  network  with  304 \ninput  units,  18  hidden  units  and  7 output units.  The  number of hidden  units  was \ndetermined experimentally.  Figure 1 shows the network configuration and the input \nfeatures. \n\n1 Five speakers in  Japanese  and one  in  Tamil provided only  10  utterances  each. \n\n\f244 \n\nMuthusamy and Cole \n\nNEURAL  NETWORK  SEGMENTATION \n\nVOC \n\nFRIC  CLOS  STOP  PRVS \n\nINVS  POVS \n\n7 \n\nOlJTPUT \nUNITS \n\n18 \n\nHIDDEN \nUNITS \n\n/ \nL-...JL-...JL-...JL-JL-JL-JL-...JL-JL-J \n\n, \n\nPltcfl  F_tCh ... g.  CoM \n\n84  OFT \n\nSO 0-700 \n\n0\u00b71000  Co.fflc!.nt. \n\n~ \n\n304 \n\nINPUT \nU~ \n\nZ.o \nPTP \nCronlntl  0-700 \nL \n\nAv~. \n\nPTP \n\nAv~. \n\nSO 0-700  0-8000  SO 0-8000 \n\n30  samples  Each \n\nFigure 1:  Segmentation Network \n\n4.1.3  Feature Measurements \n\nThe feature measurements used to train the network include the 64 DFT coefficients \nat the frame to be classified and 30 samples each of zc8000,  ptp700,  ptp8000,  sda 700, \nsda8000,  sd/,  pitch  and  cml 000,  for  a  total  of 304  features.  These  samples  were \ntaken from  a 330 ms window centered on the frame,  with more samples being taken \nin  the immediate vicinity of the frame than  near  the ends of the window. \n\n4.1.4  Hand-labeling \n\nBoth  the  training  and  test  utterances  were  hand-labeled  with  7  broad  phonetic \ncategory labels and checked  by  a second  labeler for  correctness  and consistency. \n\n4.1.5  Coarse Sampling of Frames \n\nAs  it  was  not  computationally  feasible  to  train  on  every  3  ms frame  in  each  ut(cid:173)\nterance,  only  a  few  frames  were  chosen  at  random from  each  segment.  To ensure \napproximately equal  number of frames from each category,  fewer  frames were sam(cid:173)\npled from the more frequent  categories such  as  vowels and  closures. \n\n4.1.6  Network Training \n\nThe  networks  were  trained  using  backpropagation  with  conjugate  gradient  opti(cid:173)\nmization [1].  Training was  continued  until the  performance of the  network on  the \ndevelopment test set leveled off. \n\n\fA Segment-based Automatic Language Identification  System \n\n245 \n\n4.2  SEGMENTER EVALUATION \n\nSegmentation  performance  was  evaluated  on  the  80-utterance  final  test  set.  The \nsegmenter output was compared to the hand-labels for  each 3 ms time frame.  First \nchoice  accuracy  was  85.1%  across  the  four  languages.  When  scored  on  the  mid(cid:173)\ndle  80%  and  middle  60%  of each segment,  the  accuracy  rose  to  86.9%  and  88.0% \nrespectively, pointing to the presence  of boundary errors. \n\n5  LANGUAGE IDENTIFICATION \n\n5.1  CLASSIFIER TRAINING \n\n5.1.1  Training and Test  Sets \n\nThe training set contained 12 speakers from each language, with 10 or 20 utterances \nper  speaker,  for  a  total  of 930  utterances.  The  development  test  set  contained  a \ndifferent group of 2 speakers per language with 20 utterances from each speaker, for \na  total of 160  utterances.  The final  test  set  had 6  speakers  per  language,  with  10 \nor 20 utterances per speaker, for  a  total of 440  utterances.  The average duration of \nthe utterances  in  the training set  was 5.1  seconds  and that of the test  sets  was 5.5 \nseconds. \n\n5.1.2  Feature Development \n\nSeveral  passes  were  needed  through  the  iterative  process  of feature  development \nand  network training  before  a  satisfactory  feature  set  was  obtained.  Much  of the \neffort  was  concentrated  on  statistical  and  linguistic  analysis of the  languages with \nthe  objective  of determining  the  distinguishing  characteristics  among  them.  For \nexample,  the  knowledge  that  Mandarin  Chinese  was  the  only  monosyllabic  tonal \nlanguage in the set (the other three being stress languages), led us to design features \nthat attempted to capture the large variation in pitch within and across segments for \nMandarin Chinese  utterances.  Similarly, the  presence  of sequences  of equal-length \nbroad category segments in  Japanese utterances led us to design  an  \"inter-segment \nduration difference\"  feature.  The final  set of 80 features  is  described  below.  All the \nfeatures are computed over the entire length of an utterance and use the time-aligned \nbroad category sequence  provided  by  the segmentation algorithm.  The numbers in \nparentheses refer  to the number of values generated. \n\n\u2022  Intra-segment pitch variation:  Average of the standard deviations of the pitch \n\nwithin all  sonorant segments-VOC, PRVS,  INVS,  POVS  (4  values) \n\n\u2022  Inter-segment pitch  variation:  Standard  deviation  of the  average  pitch  in  all \n\nsonorant segments (4 values) \n\n\u2022  Frequency of occurrence (number of occurrences per second of speech) of triples \nof segments.  The following triples  were  chosen  based  on statistical analyses  of \nthe  training  data:  VOC-INVS-VOC,  CLOS-PRVS-VOC,  VOC-POVS-CLOS, \nSTOP-VOC-FRIC, STOP-VOG-CLOS,  and FRIC-VOC-CLOS  (6  values) \n\n\u2022  Frequency of occurrence  of each of the seven  broad phonetic labels  (7  values) \n\n\f246 \n\nMurhusamy and Cole \n\n\u2022  Frequency  of  occurrence  of  all  segments  (number  of  segments  per  second) \n\n(1  value) \n\n\u2022  Frequency  of occurrence of all  consonants (STOPs and FRICs)  (1  value) \n\u2022  Frequency of occurrence  of all  sonorants (4  values) \n\u2022  Ratio of number of sonorant segments to total number of segments (1  value) \n\u2022  Fraction  of the  total  duration  of the  utterance  devoted  to  each  of the  seven \n\nbroad phonetic labels  (7  values) \n\n\u2022  Fraction of the total duration of the utterance devoted to all sonorants (1  value) \n\u2022  Frequency of occurrence of voiced  consonants  (1  value) \n\u2022  Ratio of voiced  consonants to total number of consonants (1  value) \n\u2022  Average duration of the seven  broad phonetic labels (7  values) \n\u2022  Standard deviation of the duration of the seven broad phonetic labels (7 values) \n\u2022  Segment-pair ratios:  conditional probability of occurrence  of selected  pairs  of \nsegments.  The segment-pairs were selected  based on histogram plots generated \non  the  training  set.  Examples  of selected  pairs:  POVS-FRIC,  VOC-FRIC, \nINVS-VOC, etc.  (27  values) \n\n\u2022  Inter-segment duration difference:  Average absolute difference  in  durations be-\n\ntween successive segments (1  value) \n\n\u2022  Standard deviation of the inter-segment duration  differences  (1  value) \n\u2022  Average distance  between the  centers of successive  vowels  (1  value) \n\u2022  Standard  deviation  of  the  distances  between  centers  of  successive  vowels \n\n(1  value) \n\n5.2  LANGUAGE IDENTIFICATION  PERFORMANCE \n\n5.2.1  Single Utterances \n\nDuring  the  feature  development  phase,  the  2  speakers-per-Ianguage  development \ntest  set  was  used.  The classifier  performed  at  an  accuracy  of 90.0%  on  this  small \ntest  set.  For  final  evaluation,  the  development  test  set  was  combined  with  the \noriginal  training set  to  form  a  14  speakers-per-Ianguage  training  set.  The  perfor(cid:173)\nmance of the classifier  on the 6 speakers-per-Ianguage final  test set was 79.6%.  The \nindividual language  performances  were  English  75.8%,  Japanese  77.0%,  Mandarin \nChinese  78.3%,  and Tamil 88.0%.  This  result  was obtained  with training  and  test \nset utterances  that were  approximately 5.4 seconds  long on the average. \n\n5.2.2  Concatenated Utterances \n\nTo observe  the  effect  of training  and  testing  with  longer  durations  of speech  per \nutterance,  a  series  of experiments  were  conducted  in  which  pairs  and  triples  of \nutterances from each speaker were concatenated end-to-end  (with 350 ms of silence \nin  between  to simulate natural  pauses)  in  both the  training  and  test  sets.  It is  to \nbe  noted  that  the  total  duration  of speech  used  in  training  and  testing  remained \nunchanged  for  all  these  experiments.  Table 1 summarizes  the  performance  of the \n\n\fA Segment-based Automatic Language  Identification System \n\n247 \n\nTable 1:  Percentage Accuracy  on Varying Durations of Speech  Per  Utterance \n\nA vge.  Duration  of \nTest  Utts.  (sec) \n5.7 \n\n17.1 \n\nll.~ \n\nAvge.  Duration  of \nTraining  Utts.  (sec) \n\n5.3 \n10.6 \n15.2 \n\n79.6  73.6 \n71.8  86.8 \n67.9 \n85.5 \n\n71.2 \n85.0 \n89.5 \n\nclassifier  when  trained  and  tested  on  different  durations  of speech  per  utterance. \nThe rows of the  table show  the effect  of testing on  progressively longer  utterances \nfor  a  given  training set,  while  the  columns of the  table show  the effect  of training \non  progressively  longer  utterances  for  a  given  test  set.  Not  surprisingly,  the  best \nperformance is obtained when the classifier is trained and tested on three utterances \nconcatenated  together. \n\n6  DISCUSSION \n\nThe  results  indicate  that  the  system  performs  better  on  longer  utterances.  This \nis  to be expected  given  the feature  set,  since  the segment-based statistical features \ntend  to  be  more reliable  with  a  larger  number of segments.  Also,  it  is  interesting \nto  note  that  we  have  obtained  an  accuracy  of 89.5%  without  using  any  spectral \ninformation in  the  classifier  feature  set.  All  of the features  are based on  the broad \nphonetic category  segment sequences  provided by  the segmenter. \nIt should be noted that approximately 15% of the utterances in the training and test \nsets  consisted  of a  fixed  vocabulary:  the  days of the  week,  the  months of the  year \nand the numbers zero through ten.  It is likely that the inclusion of these utterances \ninflated  classification  performance.  Nevertheless,  we  are  encouraged  by  the  10.5% \nerror  rate,  given  the  small  number  of speakers  and  utterances  used  to  train  the \nsystem. \n\nAcknowledgements \n\nThis research  was supported  in  part by  NSF  grant No.  IRI-9003110,  a  grant from \nApple Computer, Inc., and by a grant from DARPA to the Department of Computer \nScience  &  Engineering  at the  Oregon  Graduate  Institute.  We  thank  Mark  Fanty \nfor  his many useful  comments. \n\nReferences \n\n[1]  E.  Barnard and R.  A.  Cole.  A neural-net training program based on  conjugate(cid:173)\n\ngradient optimization.  Technical Report CSE 89-014,  Department of Computer \nScience,  Oregon  Graduate Institute of Science  and Technology,  1989. \n\n\f248 \n\nMuthusamy and Cole \n\n[2]  E.  Barnard,  R.  A.  Cole,  M.  P.  Vea,  and  F.  A.  Alleva.  Pitch  detection  with  a \nneural-net  classifier.  IEEE  Transactions  on  Signal  Processing,  39(2):298-307, \nFebruary 1991. \n\n[3]  Y.  K.  Muthusamy,  R.  A.  Cole,  and  M.  Gopalakrishnan.  A  segment-based  ap(cid:173)\nproach  to  automatic  language  identification.  In  Proceedings  1991  IEEE  In(cid:173)\nternational  Conference  on  Acoustics,  Speech,  and  Signal  Processing,  Toronto, \nCanada, May  1991. \n\n\fPART  V \n\nTEMPORAL \nSEQUENCES \n\n\f\f", "award": [], "sourceid": 517, "authors": [{"given_name": "Yeshwant", "family_name": "Muthusamy", "institution": null}, {"given_name": "Ronald", "family_name": "Cole", "institution": null}]}