{"title": "A Continuous Speech Recognition System Embedding MLP into HMM", "book": "Advances in Neural Information Processing Systems", "page_first": 186, "page_last": 193, "abstract": null, "full_text": "186 \n\nBourlard and Morgan \n\nA  Continuous Speech Recognition System \n\nEmbedding MLP  into HMM \n\nHerve Bourlard \nPhilips Research Laboratory \nAv.  van  Becelaere 2. Box 8 \nB-1170 Brussels. Belgium \n\nNelson  Morgan \nIntI.  Compo  Sc.  Institute \n1947 Center Street. Suite 600 \nBerkeley. CA 94704. USA \n\nABSTRACT \n\nWe  are  developing  a  phoneme  based.  speaker-dependent  continuous \nspeech  recognition  system  embedding  a  Multilayer Perceptron  (MLP) \n(Le .\u2022  a  feedforward  Artificial  Neural  Network).  into  a Hidden  Markov \nModel (HMM) approach.  In [Bourlard &  Wellekens]. it was  shown that \nMLPs  were approximating Maximum  a Posteriori (MAP) probabilities \nand  could  thus  be  embedded  as  an  emission  probability  estimator  in \nHMMs.  By using contextual information from  a sliding window on the \ninput frames.  we  have  been  able  to  improve  frame  or phoneme  clas(cid:173)\nsification  performance  over the  corresponding  performance  for Simple \nMaximum  Likelihood  (ML)  or even  MAP  probabilities  that  are  esti(cid:173)\nmated without the benefit of context.  However. recognition of words in \ncontinuous speech was  not so simply improved by the use of an  MLP. \nand  several  modifications  of the  original  scheme  were  necessary  for \ngetting acceptable performance.  It is  shown here that word recognition \nperformance for a  simple discrete density  HMM  system  appears  to  be \nsomewhat better when MLP methods are used to estimate the emission \nprobabilities. \n\nINTRODUCTION \n\n1 \nWe  have  performed  a  number  of experiments  with  a  1000-word  vocabulary  continu(cid:173)\nous  speech  recognition  task.  Our  frame  classification  results  [Bourlard  et  al .\u2022  1989] \n\n\fA Continuous Speech Recognition-System Embedding MLP into HMM \n\n187 \n\nare consistent with  other research  showing  the capabilities  of MLPs  trained  with  back(cid:173)\npropagation-styled learning  schemes  for  the recognition of voiced-unvoiced  speech  seg(cid:173)\nments  [Gevins  &  Morgan,  1984], isolated  phonemes  [Watrous  &  Shastri,  1987;  Waibel \net al.,  1988; Makino et al.,  1983], or of isolated words  [peeling &  Moore,  1988].  These \nresults indicate that \"neural network\" approaches can, for some problems, perform pattern \nclassification  at least as  well as  traditional  HMM approaches.  However, this  is  not par(cid:173)\nticularly mysterious.  When traditional statistical assumptions (distribution, independence \nof multiple features,  etc.)  are not valid, systems  which do not rely on these assumptions \ncan  work  better  (as  discussed  in  [Niles  et  al.,  1989]).  Furthermore,  networks  provide \nan  easy  way  to  incorporate  multiple  sources  of evidence  (multiple  features,  contextual \nwindows, etc.)  without restrictive assumptions. \n\nHowever, it is  not so easy to improve the recognition of words  in continuous  speech  by \nthe use of an MLP. For instance, while it has been shown that the outputs of a feedforward \nnetwork can  be used  as  emission  probabilities  in  an  HMM  [Bourlard et al.,  1989],  the \ncorresponding  word recognition  performance  can be  very poor.  This  is  true  even when \nthe  same  network  demonstrates  extremely  good  performance at  the  frame  or phoneme \nlevels.  We  have  developed  a  hybrid  MLP-HMM  algorithm  which  (for  a  preliminary \nexperiment)  appears  to  exceed  perfonnance  of the  same  HMM  system  using  standard \nstatistical approaches to estimate the emission probabilities.  This was  only possible after \nthe original algorithm was modified in ways  that did not necessarily maximize the frame \nrecognition performance for the training set  We will describe these modifications below, \nalong  with  experimental results. \n\n2  METHODS \nAs  shown by both theoretical  [Bourlard &  Wellekens,  1989]  and experimental  [Bourlard \n&  Morgan,  1989] results, MLP output values  may be considered to be good estimates of \nMAP probabilities  for pattern classification.  Either these,  or some other related quantity \n(such as  the output normalized by the prior probability of the corresponding class) may be \nused in a  Viterbi  search  to  determine  the  best time-warped  succession  of states  (speech \nsounds)  to  explain  the  observed  speech  measurements.  This  hybrid  approach  (MLP \nto  estimate probabilities,  HMM  to  incorporate  them  to recognize  continuous  speech  as \na  succession  of words)  has  the  potential  of exploiting  the  interpolating  capabilities  of \nMLPs  while using a Dynamic Time Warping (DTW) procedure to capture the dynamics \nof speech. \n\nHowever, to achieve good perfonnance at the word level,  the following modifications of \nthis  basic scheme were necessary: \n\n\u2022  MLP  training  methods  - a  new  cross-validation  [Stones,  1977]  training  algorithm \nwas  designed  in  which  the  stopping  criterion  was  based  on  perfonnance  for  an \nindependent  validation  set  [Morgan  &  Bourlard,  1990].  In  other  words,  training \nwas  stopped when perfonnance on a second set of data began going down, and not \nwhen training error leveled off.  This greatly improved generalization, which could \nbe further  tested on a  third independent validation set \n\n\f188 \n\nBourlard and Morgan \n\n\u2022  probability  estimation  from  the  MLP  outputs  - In  the  original  scheme  [Bourlard \n&  Wellekens,  1989],  MLP outputs  were  used as  MAP probabilities  for  the HMM \ndirectly.  While  this  helped  frame  performance,  it hurt  word  performance.  This \nmay  have  been  due  at  least partly  to  a  mismatch  between  the  relative  frequency \nof phonemes  in  the  training  sets  and  test  (word  recognition)  sets.  Division  by \nthe  prior class  probabilities  as  estimated from  the  training  set removed  this  effect \nof the  priors  on  the  DTW.  This  led  to  a  small  decrease  in  frame  classification \nperformance, but a large (sometimes  10  - 20%) improvement in  word recognition \nrates  (see  Table  1 and accompanying description). \n\n\u2022  word transition costs for the underlying HMM - word transition penalties had to be \nincreased for  larger contextual windows  to avoid a large number of insertions;  see \nSection 4.  This is  shown to be equivalent to keeping the same word transition cost \nbut scaling the log probabilities down by a number which reflected the dependence \nof neighboring frames.  A reasonable value for this can be determined from  recog(cid:173)\nnition on a small number of sentences (e.g., 50), choosing a value which results  in \ninsertions  at most equal to the number of deletions. \n\n\u2022  segmentation of training data - much as with HMM systems, an iterative procedure \nwas  required  to  time  align  the  training  labels  in  a  manner  that  was  statistically \nconsistent with  the recognition  methods  used.  In our most recent experiments, we \nsegmented the data using an iterative Viterbi alignment starting from a segmentation \nbased  on  average  phoneme  durations,  and  terminated  at  the  segmentation  which \nled to  the  best performance  on an  independent test  set  For one of our speakers, \nwe  had  available  a  more  accurate  frame  labeling  (produced  by  an  automatic  but \nmore  complex  procedure  [Aubert,  1987])  to  use  as  a  start point for  the  iteration, \nwhich led to even better performance. \n\n3  EXPERIMENTAL APPROACH \nWe have been using a speaker-dependent German database (available from  our collabora(cid:173)\ntion with Philips) called SPICaS [Ney & Noll, 1988].  The speech had been sampled at a \nrate of 16 kHz, and 30 points of smoothed, \"mel-scaled\" logarithmic  spectra (over bands \nfrom  200 to 6400 Hz)  were calculated every  10-ms  from  a 512-point FFf over a 25-ms \nwindow.  For our experiments,  the  mel  spectrum  and  the energy  were  vector-quantized \nto pointers  into a  single speaker-dependent table of prototypes. \nTwo independent sets  of vocabularies  for  training  and  test are  used.  The training data(cid:173)\nset  consists  of two  sessions  of  100  German  sentences  per  speaker.  These  sentences \nare representative of the phoneme distribution in the German language and include 2430 \nphonemes in each session.  The two sessions of 100 sentences are phonetically segmented \non  the  basis  of 50  phonemes,  using  a  fully  automated  procedure  [Aubert,  1987].  The \ntest set consists of one session of 200 sentences per speaker.  The recognition  vocabulary \ncontains  918  words  (including the \"silence\" word)  and the  overlap between training and \nrecognition is  51  words.  Most of the latter are articles, prepositions and other structural \nwords.  Thus,  the  training  and  test are  essentially  vocabulary-independent.  Initial  tests \n\n\fA Continuous Speech Recognition System Embedding MLP into HMM \n\n189 \n\nused  sentences  from  a  single  male  speaker.  The  final  algorithms  were  tested  on  an \nadditional male and female  speaker. \nThe acoustic vectors were coded on the basis of 132 prototype vectors by a simple binary \nrepresentation  with  only  one  bit  'on'.  Multiple  frames  were  used  as  input  to  provide \ncontext to the network.  In the experiments reported here. the context was  9 frames. while \nthe size of the output layer was kept fixed at 50 units. corresponding to the 50 phonemes \nto be recognized.  The input field contained 9 x  132 = 1188 units. and the total number of \npossible inputs was  thus equal to 1329\u2022  There were 26767 training patterns (from the first \ntraining session of 100 sentences) and 26702 independent test patterns  (from  the second \ntraining session of 100 sentences).  Of course. this represented only a very small fraction \nof the possible inputs. and generalization was thus potentially difficult  Training was done \nby  the  classical  \"error-back propagation\"  algorithm.  starting by  minimizing  an  entropy \ncriterion.  and  then  the  standard  least-mean-square error criterion.  In  each  iteration.  the \ncomplete training set was presented. and the parameters  were updated after each training \npattern.  To  avoid  overtraining  of the  MLP.  improvement on  a  cross-validation  set was \nchecked after each iteration and if classification was decreasing. the adaptation parameter \nof the  gradient  procedure  was  reduced.  otherwise  it was  kept constant  Later  on  this \napproach  was  systematized by  splitting the  data in  three parts:  one for training. one for \ncross-validation and a third one absolutely independent of the  training  procedure for  the \nactual validation.  No Significant difference was  observed between classification rates  for \nthe  last two data sets. \n\nIn [Bourlard et al .\u2022  1989] this procedure was shown yielding improved frame classification \nperformance over simple ML and MAP estimates.  However. acceptable word recognition \nperfomance was  still difficult to reach. \n\n4  WORD RECOGNITION RESULTS \nThe output values  of the MLP  were evaluated for each  frame.  and (after division by the \nprior probability of each phoneme) were used as emission probabilities in a discrete HMM \nsystem.  In  this  system.  each  phoneme  was  modeled  with  a  single conditional  density. \nrepeated D /2 times. where D  was a prior estimate of the duration of the phoneme.  Only \nselfloops  and  sequential  transitions  were  permitted.  A  Viterbi  decoding  was  then  used \nfor recognition of the first hundred sentences of the test session (on which word entrance \npenalties were optimized), and our best results were validated by a further recognition on \nthe  second hundred sentences  of the  test set  Note  that  this  same simplified HMM  was \nused  for  both  the  ML  reference  system  (estimating  probabilities  directly  from  relative \nfrequencies)  and the MLP system. and that the same input features  were used for both. \nTable  1 shows  the recognition  rate  (100%  - error rate,  where errors  includes  insertions. \ndeletions. and substitutions) for the first  100 sentences of the test session.  All runs except \nthe last were done with 20 hidden units  in the MLP. as  suggested by frame performance. \nNote  the  significant  positive  effect of division  of the  MLP  outputs.  which  are  trained \nto  approximate  MAP probabilities. by estimates  of the  prior probabilities  for  each  class \n(denoted  \"MLP/priors\" in Thble  1). \n\n\f190 \n\nBourlard and Morgan \n\nTable 1:  Word Recognition, speaker mOO3 \n\nsystem \nmethod \n\nMLP \n\nMLP/priors \n\nMLP \n\nMLP/priors \n\nML \n\nMLP/priors \n(0 hidden) \n\nsize of \ncontext \n\n1 \n1 \n9 \n9 \n1 \n9 \n\n%  correct \n\ntest  I validation \n27.3 \n49.7 \n40.9 \n51.9 \n52.6 \n53.3 \n\n52.2 \n52.5 \n\nTable 2:  Word Recognition using Viterbi segmentation, speaker mOO3 \n\nI  method  I context  I test  I \nMLP/priors \n(0 hidden) \n\n65.3 \n\n9 \n\nML \n\n1 \n\n56.9 \n\nWord transition probabilities were optimized for both the Maximum Likelihood and MLP \nstyle  HMMs.  This  led  to a  word  exit probability  of  10- 8  for  the  ML and  for  I-frame \nMLP's,  and  10- 14  for  an  MLP  with  9  frames  of context  After  these  adjustments, \nperformance  was  essentially the  same  for  the  two approaches.  Performance on  the  last \nhundred sentence of the  test session (shown in the last column of Table  1)  validated that \nthe  two systems generalized equivalently despite  these tunings. \n\nAn  initial  time  alignment  of the  phonetic  transcription  with  the  data  (for  this  speaker) \nhad previously been calculated using a program incorporating speech-specific knowledge \n[Aubert,  1987].  This  labeling had been  used  for  the  targets  of the frame-based  training \ndescribed  above.  We  then  used  this  alignment  as  a  ''bootstrap''  segmentation  for  an \niterative Viterbi procedure, much as  is done in conventional HMM systems.  As  with  the \nMLP training,  the data was  divided into a training  and cross-validation set, and the best \nsegmentation (corresponding to the best validation set frame classification rate) was  used \nfor later training.  For both  cross-validation  procedures,  we switched to a  training  set of \n150 sentences (two repetitions of 75 sentences) and a cross-validation set of 50 sentences \n(two repetitions of 25 each).  Finally, since the best performance in Table 1 was  achieved \nusing no hidden layer, we continued our experiments  using this  simpler network, which \nalso  required  only  a  simple  training  procedure  (entropy  error  criterion  only).  Table  2 \nshows this performance for the full 200 recognition sentences (test + validation sets from \nTable  1). \n\nTwo  of the  more  puzzling  observations  in  this  work  were  the  need  to  increase  word \nentrance penalties  with  the  width  of the  input context and  the  difficulty  to reflect good \nframe performance at the word level.  MLPs  can make better frame  level discriminations \n\n\fA Continuous Speech Recognition System Embedding MLP into HMM \n\n191 \n\nthan simple statistical classifiers, because they can easily incorporate multiple sources of \nevidence (multiple frames, multiple features)  without simplifying assumptions.  However, \nwhen the input features  within a contextual window are roughly independent. the Viterbi \nalgorithm  will  already  incorporate  all  of the  context in  choosing  the  best  HMM  state \nsequence explaining an utterance.  If emission probabilities are estimated from  the outputs \nof  an  MLP  which  has  a  2c +  1  frame  contextual  input.  the  probability  to  observe  a \nfeature  sequence {It, 12, ... , fN}  (where  fn  represents  the feature  vector at time n) on \na  particular HMM state q\"  is estimated as: \n\nN II P{Ii-c, ... ,  fi,\"\"  fi+clq,,), \n\ni-I \n\nwhere  Bayes'  rule  has  already  been  used  to convert  the  MLP outputs  (which estimate \nMAP probabilities) into ML probabilities.  If independence is  assumed. and if boundary \neffects (context extending before frame 1 or after frame N) are ignored (assume (2c+ 1) <: \nN). this  becomes: \n\nc \n\nN \n\nN \n\nII II p{fi+;lq,,)  =  II [P{lilq,,)]2c+l, \n\ni-I  ;--c \n\nwhere the latter probability  is  just the classical Maximum Likelihood solution, raised to \nthe power 2c + 1.  Thus.  if the  features  are independent over time. to keep the effect of \ntransition  costs  the  same as  for  the  simple  HMM.  the  log  probabilities  must  be scaled \ndown  by the size of the contextual  window.  Note that.  in  the  more realistic case where \ndependencies  exist between  frames.  the  optimal  scaling  factor  will  be less  than  2c + 1. \ndown  to  a  minimum  of 1 for  the case in which  frames  are  completely dependent (e.g .\u2022 \nsame within a  constant factor);  the scaling factor should thus  reflect the time correlation \nof the  input features.  Thus.  if the  features  are assumed  independent  over time.  there is \nno  advantage  to  be gained by  using  an  MLP  to extract  contextual  information  for  the \nestimation of emission probabilities for an HMM Viterbi decoding.  In general. the relation \nbetween  the MLP and ML solutions  will  be more complex. because of interdependence \nover time  of the  input features.  However.  the  above relation  may give some insight as \nto the difficulty  we have  met in improving  word recognition  performance with  a  single \ndiscrete  feature  (despite  large improvements  at  the  frame  level).  More  positively.  our \nresults  show that the probabilities  estimated by MLPs can be used at least as  effectively \nas  conventional  estimates  and  that  some  advantage  can  be  gained  by  providing  more \ninformation for estimating these probabilities. \n\nWe  have duplicated our recognition test\\!  for two other speakers from  the same data base. \nIn  this  case.  we  labeled  each  training  set  (from  the  original  male  plus  a  male  and  a \nfemale  speaker)  using  a  Viterbi  iteration  initialized  from  a  time-alignment  based  on  a \nsimple estimate of average phoneme duration.  This reduced all of the recognition scores. \nunderlining the necessity of a good start point for the Viterbi iteration.  However. as can be \nseen from the Table 3 results (measured over the full 200 recognition sentences). the MLP(cid:173)\nbased methods appear to consistently offer at least some measurable improvement over the \nsimpler estimation technique.  In particular. the performance for the two systems differed \nsignificantly  (p <  0.001)  for  two  out of three  speakers.  as  well  as  for  a  multispeaker \n\n\f192 \n\nBourlard and Morgan \n\nTable 3:  Word Recognition for 3 speakers. simple initialization \n\nI speaker I MLE  I MLP  I \n\nmoo3 \nmOO 1 \nwOlO \n\n54.4 \n47.4 \n54.2 \n\n59.7 \n51.9 \n54.3 \n\ncomparison  over  the  three  speakers  (in  each  case  using  a  normal  approximation  to  a \nbinomial distribution for the null  hypothesis). \n\n5  CONCLUSION \nThese results  show some of the improvement for MLPs over conventional HMMs  which \none might expect from  the frame level results.  MLPs  can sometimes  make better frame \nlevel discriminations than simple statistical classifiers. because they can easily incorporate \nmultiple  sources  of evidence  (multiple  frames.  multiple  features).  which  is  difficult  to \ndo  in  HMMs  without  major  simplifying  assumptions.  In general.  the  relation  between \nthe  MLP  and  ML  word recognition  is  more  complex.  Part  of the  difficulty  with  good \nrecognition  may be due to  our choice  of discrete.  vector-quantized  features.  for  which \nno metric  is  defined over the prototype space.  Despite  these  limitations. it now  appears \nthat the probabilities  estimated by MLPs  may offer improved  word recognition  through \nthe incorporation of context in the estimation of emission probabilities.  Furthermore. our \nnew result shows the effectiveness of Viterbi segmentation in labeling training data for an \nMLP.  This result appears  to remove a  major handicap of MLP  use.  i.e.  the requirement \nfor hand-labeled speech. and also offers the possibility to deal with more complex HMMs. \n\nAcknowledgments \n\nSupport from  the  International  Computer Science  Institute  (ICSI)  and Philips  Research \nfor  this  work  is  gratefully  acknowledged.  Chuck  Wooters  of ICSI  and  UCB  provided \nmuch-needed assistance. and Xavier Aubert of Philips put together our Spicos materials. \n\nReferences \n\nX.  Aubert. (1988). \"Supervised Segmentation with  Application to Speech Recognition'\" \nin Proc.  Eur.  ConE.  Speech  Technology. Edinburgh. p.161-164. \n\nH.  Bourlard,  N.  Morgan.  &  C.J.  Wellekens.  (1989).  \"Statistical Inference  in  Multilayer \nPerceptrons and Hidden Markov Models  with Applications in Continuous Speech Recog(cid:173)\nnition\",  to  appear  in  Neuro  Computing,  Algorithms,  a.nd  Applica.tions.  NATO  ASI \nSeries. \n\nH.  Bourlard,  H.  &  N.  Morgan.  (1989).  \"Merging  Multilayer  Perceptrons  and  Hidden \nMarkov  Models:  Some Experiments  in  Continuous  Speech  Recognition\"  International \nComputer Science Institute lR-89-033. \n\n\fA Continuous Speech Recognition System Embedding MLP into HMM \n\n193 \n\nH.  Bourlard &  CJ. Wellekens,  (1989), \"Links  between  Markov  models  and  multilayer \nperceptrons\",  to  be  published  in  IEEE  Trans.  on  Pattern  Analysis  and  Macbine \nIntelligence,  1990. \nA.  Gevins  &  N.  Morgan, (1984), \"Ignorance-Based Systems'\" Proc.  IEEE  IntI.  ConE. \non  Acoustics,  Speecb,  &  Signal Processing, Vol.  3, 39A5.1-39A5.4, San Diego. \n\nS.  Makino,  T.  Kawabata,  T.  &  K.  Kido,  (1983),  \"Recognition  of consonants  based on \nthe  Perceptron  Model\",  Proc. \nIEEE  IntI.  ConE.  on  Acoustics,  Speecb,  &  Signal \nProcessing, Vol.  2. pp.  738-741. Boston, Mass. \n\nN. Morgan &  H.  Bourlard. (1989), \"Generalization and Parameter Estimation in Feedfor(cid:173)\nward Nets:  Some Experiments'\" Advances in Neural Information Processing Systems \nII. Morgan Kaufmann. \n\nH.  Ney  &  A.  Noll.  (1988), \"Phoneme Modeling Using  Continuous  Mixture  Densities'\" \nProc.  IEEE  IntI.  ConE.  on  Acoustics,  Speecb,  &  Signal  Processing,  Vol. \nI,  pp. \n437-440. New  York. \n\nL.  Niles.  H.  Silverman. G.  Thjchman  &  M.  Bush,  (1989), \"How Limited Training  Data \nCan Allow a Neural Network Classifier to Outperform an 'Optimal' Statistical Classifier\", \nProc.  IEEE IntI.  ConE.  on Acoustics, Speecb,  &  Signal Processing. Vol.  I, pp.  17-\n20, Glasgow, Scotland. \n\nS.M.  Peeling,  S.M.  &  R.K.  Moore,  (1988), \"Experiments  in Isolated  Digit Recognition \nUsing  the  Multi-Layer Perceptron'\"  Royal  Speech  and  Radar Establishment.  Technical \nReport 4073, Malvern, Worcester. \n\nM.  Stone. (1987). \"Cross-validation:  a  review'\" Matb.  Operationforscb.  Statist.  Ser. \nStatist .\u2022  vol.9.  pp.  127-139. \n\nA.  Waibel, T.  Hanazawa. G.  Hinton. K.  Shikano &  K.  Lang. (1988), \"Phoneme Recog(cid:173)\nnition:  Neural  Networks  vs.  Hidden  Markov  Models'\"  Proc.  IEEE  IntI.  ConE.  on \nAcoustics,  Speecb,  &  Signal Processing, Vol.  I, pp.  107-110, New  York. \nR. Watrous & L. Shastri. (1987). Learning phonetic features using connectionist networks: \nan  experiment  in  speech  recognition\",  Proceedings  of tbe  First  IntI.  Conference  on \nNeural Networks, IV-381-388, San Diego,  CA. \n\n\f", "award": [], "sourceid": 222, "authors": [{"given_name": "Herv\u00e9", "family_name": "Bourlard", "institution": null}, {"given_name": "Nelson", "family_name": "Morgan", "institution": null}]}