{"title": "HMM Speech Recognition with Neural Net Discrimination", "book": "Advances in Neural Information Processing Systems", "page_first": 194, "page_last": 202, "abstract": null, "full_text": "194 \n\nHuang and Lippmann \n\nHMM Speech  Recognition \n\nwith  Neural Net Discrimination* \n\nWilliam Y.  Huang and Richard P.  Lippmann \n\nLincoln  Laboratory,  MIT \n\nRoom B-349 \n\nLexington,  MA  02173-9108 \n\nABSTRACT \n\nTwo approaches were explored which integrate neural net classifiers \nwith  Hidden  Markov  Model  (HMM)  speech  recognizers.  Both  at(cid:173)\ntempt to improve speech pattern discrimination while retaining the \ntemporal processing advantages of HMMs.  One approach used neu(cid:173)\nral nets to provide second-stage discrimination following an HMM \nrecognizer.  On  a  small  vocabulary  task,  Radial  Basis  Function \n(RBF)  and  back-propagation  neural  nets  reduced  the  error  rate \nsubstantially (from 7.9% to 4.2% for the RBF classifier).  In a larger \nvocabulary task, neural net classifiers did not reduce the error rate. \nThey, however,  outperformed Gaussian, Gaussian mixture, and k(cid:173)\nnearest  neighbor  (KNN)  classifiers.  In  another  approach,  neural \nnets  functioned  as  low-level  acoustic-phonetic  feature  extractors. \nWhen  classifying  phonemes  based on single  10  msec.  frames,  dis(cid:173)\ncriminant  RBF  neural  net  classifiers outperformed  Gaussian mix(cid:173)\nture  classifiers.  Performance,  however,  differed  little when  classi(cid:173)\nfying  phones  by  accumulating scores  across  all frames  in phonetic \nsegments using  a single node HMM  recognizer. \n\n-This  work  was  sponsored  by  the  Department  of  the  Air  Force  and  the  Air  Force  Office  of \n\nScientific Research. \n\n\fHMM Speech Recognition with Neural Net Discrimination \n\n195 \n\nB  ...  D \n\nCepstral Sequence \n\nSecond  Stage \nClassifier \n\nNode Averages \n\nViterbi \nSegmentation \n\nFigure 1:  Second  stage discrimination system.  HMM  recognition is based  on the \naccumulated scores from each node.  A second stage classifier can adjust the weights \nfrom each  node  to provide improved discrimination. \n\n1 \n\nIntroduction \n\nThis paper describes some of our current efforts to integrate discriminant neural net \nclassifiers  into  HMM  speech  recognizers.  The  goal  of this work  is  to combine  the \ntemporal  processing  capabilities of the  HMM  approach  with  the  superior  recogni(cid:173)\ntion rates provided by discriminant classifiers.  Although neural  nets  are well  devel(cid:173)\noped  for  static  pattern  classification,  neural  nets  for  dynamic  pattern  recognition \nrequire further  research.  Current conventional HMM  recognizers  rely  on likelihood \nscores  provided by  non-discriminant classifiers,  such  as  Gaussian  mixture  [11]  and \nhistogram  [5]  classifiers.  Non-discriminant  classifiers  are sensitive  to  assumptions \nconcerning  the shape of the probability density  function  and the robustness  of the \nMaximum Likelihood  (ML)  estimators.  Discriminant  classifiers  have  a  number  of \npotential advantages over non-discriminant classifiers on real world problems.  They \nmake fewer  assumptions concerning underlying class distributions, can be robust to \noutliers,  and can  lead  to efficient  parallel analog VLSI  implementation [4,  6,  7,  8]. \nRecent  efforts  in  applying  discriminant  training  to  HMM  recognizers  have  led  to \npromising techniques,  including Maximum Mutual Information (MMI)  training [2] \nand  corrective  training [5].  These  techniques  maintain the  same structure  as  in  a \nconventional  HMM  recognizer  but  use  a  different  overall error  criteria to estimate \nparameters.  We believe that a significant improvement in recognition rate will result \nif discriminant classifiers  are included directly  in the  HMM  structure. \n\nThis  paper  examines  two  integration  strategies:  second  stage  classification  and \ndiscriminant  pre-processing.  In  second  stage  classification,  discussed  in  Sec.  2, \nclassifiers are used to provide post-processing for  an HMM  isolated word recognizer. \nIn  discriminant pre-processing,  discussed  in  Sec.  3,  discriminant classifiers  replace \nthe maximum likelihood classifiers  used  in conventional HMM  recognizers. \n\n\f196 \n\nHuang and Lippmann \n\n2  Second Stage Classification \nHMM  isolated-word recognition requires one Markov model per word.  Recognition \ninvolves accumUlating scores  for  an unknown input across  the nodes  in each  word \nmodel,  and selecting  that word  model  which  provides  the  maximum accumulated \nscore.  In  the  case  of discriminating  between  minimal  pairs,  such  as  those  in  the \nE-set  vocabulary  (the  letters  {BCDEGPTVZ}),  it  is  desired  that  recognition  be \nfocused on the nodes  that correspond to the small portion of the utterance that are \ndifferent  between  words.  In the second stage classification approach,  illustrated in \nFig.  1,  the HMMs at the first  layer are the components of a fully-trained isolated(cid:173)\nword HMM  recognizer.  The second stage classifier is provided with matching scores \nand duration from  each  HMM  node.  A  simple second  stage  classifier  which  sums \nthe  matching scores  of the  nodes  for  each  word  would  be  equivalent  to  an  HMM \nrecognizer.  It is hoped that discriminant classifiers can utilize the  additional infor(cid:173)\nmation  provided  by  the  node  dependent  scores  and  duration  to  deliver  improved \nrecognition rates. \n\nThe second stage system of Fig. 1 was evaluated using the 9 letter E-set vocabulary \nand  the  {BDG}  vocabulary.  Words  were  taken  from  the  TI-46  Word  database, \nwhich contains 10 training and 16 testing tokens per word per talker and 16 talkers. \nEvaluation was  performed in the speaker dependent  mode;  thus,  there were  a  total \nof 30  training  and  48  testing  tokens  per  talker  for  the  {BDG }-set  task  and  90 \ntraining and 144 testing tokens per talker for the E-set task.  Spectral pre-processing \nconsisted of extracting the first  12 mel-scaled cepstral coefficients  [10],  ignoring the \noth  cepstral  coefficient  (energy),  for  each  10  ms  frame.  An  HMM  isolated  word \nrecognizer  was first  trained  using the forward-backward  algorithm.  Each word  was \nmodeled  using  8  HMM  nodes  with  2  additional  noise  nodes  at each  end.  During \nclassification,  each  test  word  was segmented  using  the  Viterbi decoding  algorithm \non all word models.  The average matching score and duration of all non-noise nodes \nwere  used  as  a static pattern for  the second stage classifier. \n\n2.1  Classifiers \n\nFour second stage classifiers were used:  (1) Multi-layer perceptron (MLP) classifiers \ntrained  with  back-propagation,  (2)  Gaussian  mixture  classifiers  trained  with  the \nExpectation  Maximization (EM)  algorithm [9],  (3)  RBF classifiers  [8]  with weights \ntrained  using  the pseudoinverse  method  computed  via Singular Value  Decomposi(cid:173)\ntion (SVD),  and  (4)  KNN  classifiers.  Covariance matrices in the Gaussian mixture \nclassifiers were constrained to be diagonal and tied to be the same between  mixture \ncomponents in all classes.  The RBF classifiers  were  of the form \n\nDecide  Class i  =  Argmax  ~ w .. EXP (_IIX - ,1; 112 ) \n\n(1) \n\ni \n\nL..J  \" \n;=1 \n\n2hu~ \n, \n\n\fHMM Speech Recognition with Neural Net Discrimination \n\n197 \n\nwhere \n\ni \ni \nJ \n\n_ \n\nacoustic vector  input, \nclass label, \nnumber of centers, \nweight from  jth center to ith  class output, \njth center and variance,  and \nspread factor. \n\nThe center locations (Pi'S)  were obtained from either k-means or Gaussian mixture \nclustering.  The variances (Uj 's)  were either the variances of the individual k-means \nclusters  or  those  of the  individual  Gaussian  mixture  components,  depending  on \nwhich  clustering  algorithm was  used.  Results for  k = 1 are reported  for  the KNN \nclassifier  because this provided best performance. \nThe Gaussian mixture classifier was selected  as a reference conventional non-discri(cid:173)\nminant classifier.  A Gaussian mixture classifier can provide good models for  multi(cid:173)\nmodal and non-Gaussian distributions by  using many mixture components.  It can \nalso generalize to the more common, well-known unimodal Gaussian classifier which \nprovides poor performance  when  the input  distribution is  not Gaussian.  Very  few \nbenchmarking studies have  been  performed  to evaluate the relative performance of \nGaussian  mixture  and  neural  net  classifiers,  although  mixture  models  have  been \nused  successfully  in HMM  recognizers  [11].  RBF classifiers  were used  because they \ntrain  rapidly,  and  recent  benchmarking studies show  that they  perform as  well  as \nMLP classifiers on speech  problems [8]. \n\nGAUSSIAN \nixtures per \n\nClass \n\nCenters (rom Gaussian mixture clustering, h=150. \nCenters (rom k-means clustering. h=lS0. \n\nTable  1:  Percentage  errors  from  the second  stage  classifier,  averaged  over  all  16 \ntalkers. \n\n2.2  Results of Second Stage Classification \n\nTable  1  shows  the  error  rates  for  the  second  stage  system  of Fig.  1,  averaged \nover  all  talkers.  The second  stage  system improved performance over  the baseline \nHMM  system when  the vocabulary  was small (B,  D and G).  Error rates  decreased \nfrom  7.9%  for  the  baseline  HMM  recognizer  to  4.2%  for  the  RBF  second  stage \nclassifier.  There was no improvement for  the E-set vocabulary task.  The best  RBF \nsecond  stage classifier  degraded  the error  rate from  11.3% with  the  baseline  HMM \nto 12.8%.  In  the E-set  results,  MLP  and RBF classifiers,  with error rates of 13.4% \n\n\f198 \n\nHuang and Lippmann \n\nand  12.8%,  performed  considerably  better  than  the  Gaussian  (21.2%),  Gaussian \nmixture (20.6%) and KNN  classifiers  (36.0%). \n\nThe second stage approach is effective for a very small vocabulary but not for a larger \nvocabulary task.  This may be due to a combination of limited training data and the \nincreased complexity of decision regions as vocabulary size and dimensionality gets \nlarge.  When  the  vocabulary  size  increased  from  3  to  9,  the input  dimensionality \nof the  classifiers  scaled  up  by  a  factor  of 3  (from  48  to  144)  but  the  number  of \ntraining tokens increased only by the same factor  (from 30 to 90).  It is, in general, \npossible  for  the  amount of training tokens required  for  good  performance  to scale \nup  exponentially with the input dimensionality.  MLP  and  RBF  classifiers  appear \nto be affected  by this problem but not as strongly as  Gaussian,  Gaussian mixture, \nand KNN  classifiers. \n\n3  Discriminant  Pre-Processing \nSecond stage classifiers will not work well if the nodal matching scores do not lead to \ngood discrimination.  Current conventional HMM  recognizers  use  non-discriminant \nclassifiers  based  on  ML  estimators  to  generate  these  scores.  In  the  discriminant \npre-processing approach,  the ML  classifiers in an HMM  recognizer  are replaced  by \ndiscriminant classifiers. \nAll  the experiments in  this section  are  based  on  the phonemes  /b,d,43/  from  the \nspeaker dependent TI-46 Word database.  Spectral pre-processing consisted  of ex(cid:173)\ntracting  the  first  12  mel-scaled  cepstral  coefficients  and  ignoring  the  oth  cepstral \ncoefficient  (energy), for each  10 ms frame.  For multi-frame inputs, adjacent frames \nwere  20  msec.  apart  (skipping  every  other  frame).  The  database  was  segmented \nwith a conventional high-performance continuous-observation HMM  recognizer  us(cid:173)\ning forced  Viterbi decoding on the correct  word.  The phonemes fbi, /d/ and /dJ/ \nfrom  the  letters  \"B\",  \"D\"  and  \"G\"  (/#_i/ context)  were  then  extracted.  This \nresulted  in  an  average  of 95  training  and  158  testing  frames  per  talker  per  word \nusing  the  10  training  and  16  testing  words  per  talker  in  the  16  talker  database. \nTalker dependent  results,  averaged over all 16  talkers,  are reported  here. \nPreliminary experiments using  MLP, RBF, KNN,  Gaussian, and Gaussian mixture \nclassifiers indicated that RBF classifiers with Gaussian basis functions and a spread \nfactor  of 50  consistently  yielded  close  to  best  performance.  RBF  classifiers  also \nprovided  much shorter  training times  than MLP  classifiers.  RBF  classifiers  (as in \nEq.  1)  with h = 50  were  thus used in all experiments presented in this section.  The \nparameters of the  RBF classifiers  were  determined  as described  in Sec.  2.1  above. \n\nGaussian  mixture  classifiers  were  used  as  reference  conventional  non-discriminant \nclassifiers.  In  the  preliminary  experiments,  they  also  provided  close  to  best  per(cid:173)\nformance,  and outperformed  KNN  and  unimodal Gaussian  classifiers.  Covariance \nmatrices  were  constrained,  as  described  in  Sec.  2.1.  Although  full  and  indepen(cid:173)\ndent  covariance  matrices  were  advantageous  for  the  unimodal  Gaussian  classifier \nand  Gaussian  mixture  classifiers  with few  mixture  components,  best  performance \nwas  provided  using  many  mixture  components  and  constrained  covariance  matri-\n\n\fHMM Speech Recognition with Neural Net Discrimination \n\n199 \n\n30 \n\n- 20  ~-: \nN -\n\n~ \nClII \nClII  10 \nJI;I \n\n13-\n\n... \n\n0 \n\nj \n\n0 \n\nSO \n\n75 \n\nTOTAL  NUMBER  OF  KMEANS  CENTERS \n\n01  frames \nll.2  frames \n+3  frames \nX4  frames \nOS  frames \n\n\u00a3] \n\n:t \n\n75 \n\nFigure 2:  Frame-level error rates for Gaussian tied-mixture and RBF classifiers as \na function  of the total number of unique centers.  Multi-frame results had context \nframes  adjoined together at the input.  Centers for both classifiers were determined. \nusing  k-means clustering. \n\nces.  A Gaussian  \"tied-mixture\" classifier was also used.  This is a Gaussian mixture \nclassifier  where  all  classes  share  the same  mixture  components  but  have  different \nmixture weights.  It is  trained  in  two  stages.  In  the first  stage,  class  independent \nmixture  centers  are  computed  by  k-means  clustering,  and  mixture  variances  are \nthe variances of the individual k-means clusters.  In the second stage, the ML  esti(cid:173)\nmates of the class  dependent  mixture weights  are computed  while holding mixture \ncomponents fixed. \n\n3.1  Frame Level Results \n\nError rates for  classifying phonemes based on single frames  are shown in Fig.  2 for \nthe Gaussian tied-mixture classifier  (left)  and RBF classifier  (right).  These results \nwere  obtained using  k-means centers.  Superior frame-level error rates were consis(cid:173)\ntently  provided  by  the  RBF  classifier  in  all  experimental  variations of this study. \nThis  is  expected  since  RBF  classifiers  use  an  objective  function  which  is  directly \nrelated to classification error,  whereas  the objective of non-discriminant classifiers, \nmodeling the class dependent probability density functions, is only indirectly related \nto classification error. \n\n3.2  Phone Level Results \n\nIn a single node HMM, classifier scores for the frames  in a phone segment are accu(cid:173)\nmulated to obtain phone-level results.  For conventional HMM  recognizers that use \nnon-discriminant classifiers,  this score  accumulation is done by  assuming indepen(cid:173)\ndent frames,  which  allows the frame-level scores  to be multiplied together: \n\nProb(phone) \n\nProb(Zl' Z2, ... ZN) \nProb(zl)Prob(z2)' .. Prob(zN) \n\n(2) \n\nwhere z ... ZN  are input frames in an N-frame phone.  Eq. 2 does not apply to non(cid:173)\ndiscriminant classifiers.  RBF classifier  outputs  are  not  constrained  to  lie  between \no and  1.  They  do  not  necessarily  behave  like  probabilities  and  do  not  perform \n\n\f200 \n\nHuang and Lippmann \n\n8 \n\n6 -N -\n\n2 \n\nI \n\nI \n\nI \n\n(a)  GAUSS.  TIED  MIX. \n\nI \n(c)  WIDENED  RBF \n\nI \n\nI \n\nI \n\nI \n\nI \n\n(b)  RBF \n\n~  :s  ~ ~ -\n\n-\n\n-\n\n~ \n\nr-\n\no \n\nI \n25 \n\nI \n50 \n\nI \n75 \n\nI \n25 \n\nI \n50 \n\nI \n75 \n\nTOTAL  NUMBER  OF  KMEANS  CENTERS \n\nI \n25 \n\nI \n50 \n\nI \n75 \n\nFigure 3:  Phone-level error  rates  using (a)  Gauasian tied-mixture,  (b)  RBF and \n(c)  5% widened RBF classifiers,  as a function of the total number of unique centers. \nGauasian classifier  phone-level  results  were  obtained by  accumulating frame-level \nscores  via multiplication.  RBF  classifier  frame-level  scores  were  accumulated  via \naddition.  Symbols are as  in Fig.  2. \n\nwell  when  their frame scores  are  multiplied together.  The  RBF  classifier's  frame(cid:173)\nlevel  scores  were  thus  accumulated,  instead,  by  addition.  Phone-level error  rates \nobtained by  accumulating frame-level  scores  from  the  Gaussian  tied-mixture and \nRBF classifiers are shown in Fig. 's 3( a) and (b).  Best performance was provided by \nthe Gaussian tied-mixture classifier with 50 k-means centers and no context frames \n(2.6% error rate,  versus  3.9% for  the  RBF classifier  with 75  centers  and  1 context \nframe). \nThe good phone-level performance provided by the Gaussian tied-mixture classifier \nin  Fig.  3(a)  is  partly  due  to the  near  correctness  of the Gaussian  mixture  distri(cid:173)\nbution  assumption  and  the  independent  frames  assumption  (Eq.  2).  To  address \nthe poor phone-level performance of the  RBF  classifier,  we examine solutions that \nuse  smoothing  to  directly  extend  good  frame-level  results  to  acceptable  phone(cid:173)\nlevel performance.  Smoothing was performed both by  passing the classifier outputs \nthrough  a  sigmoid function l  and  by  increasing  the  spread  (h  in  Eq.  1)  after RBF \nweights were  trained.  Increasing h  was more effective. \nIncreasing  h  has  the effect  of \"widening\"  the  basis  functions.  This  smoothes  the \ndiscriminant  functions  produced  by  the  RBF  classifier  to  compensate  for  limited \ntraining  data.  If basis  function  widening  occurs  before  weights  are  trained,  then \nweights  training will  effectively  compensate for  the increase.  This  was  verified  in \npreliminary experiments,  which showed  that if h  was  increased  before weights were \ntrained,  little  difference  in  performance  was  observed  as  h  varies  from  50  to  200. \nIncreasing h by 5%  after weights were trained resulted in a slightly different frame(cid:173)\nlevel  performance  (sometimes better,  sometimes  worse),  but a significant improve(cid:173)\nment  in  phone-level  results  for  all  experimental  variations of this  study.  In  Fig. \n3(c), a 5%  widening of the basis function improved the performance of the baseline \n\n1 The sigmoid function is of the fonn 31  = 1/ (1 + e-(Z-.5)2)  where :r is  the input  (an output \n\nfrom the RBF classifier) and 31  is the output used for classification. \n\n\fHMM Speech Recognition with Neural Net Discrimination \n\n201 \n\no GAUSS \nll. RBF \n+ Smoothed  RBF \n\n5 \n\nN  4 \n\n--\n\n~  3 \n0 \n~ \n~ \nfI.l \n\n2 \n\n1 \n\n0 \n\n0 \n\n1 \n\n2  34 5  \nNUMBER  OF  FRAMES \n\nFigure  4:  Phone-level  error  rates,  as  a  function  of the  number  of frames,  for \nGaussian mixture with 9 mixtures per class,  and RBF classifiers  with centers from \nthe Gaussian mixture classifier (27  total centers for  this 3 class  task). \n\nRBF classifier.  It did not, however,  improve performance over that provided by the \nGaussian  tied-mixture  classifier  without  context  frames  at the input.  The  lowest \nerror  rate  provided  by  the  smoothed  RBF  is  now  3.4%  using  75  k-means centers \nand 2 context frames  (compared with 2.6% for the Gaussian tied-mixture classifier \nwith  50 centers  and no context). \nError rates for  the Gaussian mixture classifier  with 9 mixtures  per class is plotted \nversus the number of frames in Fig. 4, along with the results for RBF classifiers with \ncenters  taken from  the Gaussian mixture classifier.  Similar behavior  was  observed \nin all experimental variations of this study.  There are three  main observations:  (1) \nThe Gaussian mixture classifier  without context frames  provided best performance \nbut degraded as the number of input frames increased,  (2)  RBF classifiers can out(cid:173)\nperform  Gaussian  mixture  classifiers  with  many  input  frames,  and  (3)  widening \nthe basis functions after weights were trained improved the RBF classifier's  perfor(cid:173)\nmance. \n\n4  Summary \nTwo  techniques  were  explored  that  integrated  discriminant  classifiers  into  HMM \nspeech  recognizers.  In second-stage discrimination, an  RBF second-stage classifier \nhalved  the error  rates  in  a  {BDG}  vocabulary  task  but provided  no  performance \nimprovement  in  an  E-set  vocabulary  task.  For  integrating  at  the  pre-processing \nlevel,  RBF  classifiers  provided superior frame-level  performance over conventional \nGaussian mixture classifiers.  At the phone-level, best performance was provided by \na Gaussian mixture classifier with a single frame input; however, the RBF classifier \noutperformed  the  Gaussian  mixture  classifier  when  the  input  contained  multiple \ncontext  frames.  Both  sets  of experiments  indicated  an  ability for  the  RBF  clas(cid:173)\nsifier  to  integrate  the  large  amount  of information  provided  by  inputs  with  high \ndimensionality.  They  suggest  that  an  HMM  recognizer  integrated  with  RBF  and \nother  discriminant  classifiers  may  provide  improved  recognition  by  providing  bet(cid:173)\nter frame-level discrimination and by utilizing features  that are ignored by current \n\"state-of-the-art\"  HMM  speech  recognizers.  This is  consistent  with the results  of \n\n\f202 \n\nHuang and Lippmann \n\nFranzini  [3]  and  Bourlard  [1],  who  used  many context  frames  in  their implementa(cid:173)\ntion of discriminant pre-processing which embedded MLPs' into HMM  recognizers. \nCurrent  efforts  focus  on  studying  techniques  to  improve  the  performance  of dis(cid:173)\ncriminant classifier  for  phones,  words,  and  continuous speech.  Approaches  include \naccumulating scores from lower level speech units and using objective functions that \ndepend  on higher level speech  units, such  as phones  and words.  Work is also being \nperformed to integrate discriminant classification algorithms into HMM  recognizers \nusing Viterbi  training. \n\nReferences \n[1]  H.  Bourlard and  N.  Morgan.  Merging multilayer perceptrons in hidden Markov mod(cid:173)\nels:  Some  experiments in continuous speech  recognition.  Technical  Report  TR-89-\n033,  International  Computer Science Institute,  Berkeley,  CA.,  July 1989. \n\n[2]  Peter  F.  Brown.  The  Acoustic-Modeling  Problem  in  Automatic Speech  Recognition \n\nPhD thesis,  Carnegie Mellon  University,  May  1987. \n\n[3]  Michael A.  Franzini,  Michael  J. Witbrock, and Kai-Fu Lee.  A connectionist approach \n\nto continuous  speech recognition.  In  Proceedings of the  IEEE ICASSP, May  1989. \n\n[4]  William  Y.  Huang  and  Richard  P.  Lippmann.  Comparisons  between  conventional \nand  neural  net classifiers.  In 1st International Conference on  Neural Network,  pages \nIV-485. IEEE, June 1987. \n\n[5]  Kai-Fu  Lee  and  Sanjoy  Mahajan.  Corrective  and  reinforcement leaning for  speaker(cid:173)\nindependent continuous speech recognition.  Technical Report CMU-CS-89-100, Com(cid:173)\nputer Science Department,  Carnegie-Mellon University,  January  1989. \n\n[6]  Yuchun  Lee  and  Richard  Lippmann.  Practical characteristics of neural  network  and \n\nconventional pattern classifiers on artificial and speech problems.  In Advances in Neu(cid:173)\nral  Information Processing Systems 2,  Denver,  CO.,  1989. IEEE,  Morgan  Kaufmann. \nIn  Press. \n\n[7]  R.  P.  Lippmann.  Review of neural  networks  for  speech  recognition.  Neural  Compu(cid:173)\n\ntation,  1(1):1-38,  1989. \n\n[8]  Richard  P.  Lippmann.  Pattern classification  using  neural  networks.  IEEE  Commu(cid:173)\n\nnications Magazine,  27(11):47-63,  Nov.  1989. \n\n[9]  G.  J.  McLachlan.  Mixture  Models.  Marcel  Dekker,  New  York,  N.  Y.,  1988. \n[10]  D.  B.  Paul.  A speaker-stress resistant HMM isolated word recognizer.  In Proceedings \n\nof the IEEE ICASSP, pages  713-716,  April 1987. \n\n[11]  L.  R.  Rabiner,  B.-H.  Juang,  S.  E.  Levinson,  and  M.  M.  Sondhi.  Recognition  of \nisolated digits using hidden Markov models with continuous mixture densities.  AT&T \nTechnical  Journal,  64(6):1211-1233,  1985. \n\n\f", "award": [], "sourceid": 192, "authors": [{"given_name": "William", "family_name": "Huang", "institution": null}, {"given_name": "Richard", "family_name": "Lippmann", "institution": null}]}