{"title": "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters", "book": "Advances in Neural Information Processing Systems", "page_first": 211, "page_last": 217, "abstract": null, "full_text": "Training Stochastic Model Recognition Algorithms \n\n211 \n\nTraining  Stochastic  Model  Recognition \n\nAlgorithms  as  Networks  can  lead  to  Maximum \nMutual Information  Estimation  of Parameters \n\nJohn s. Bridle \n\nRoyal Signals and  Radar Establishment \n\nGreat  Malvern \n\nWorcs. \n\nUK \n\nWR143PS \n\nABSTRACT \n\nOne  of the  attractions  of neural  network  approaches  to  pattern \nrecognition  is  the  use  of a  discrimination-based  training  method. \nWe  show  that once  we  have  modified  the  output  layer  of a  multi(cid:173)\nlayer perceptron to provide mathematically correct  probability dis(cid:173)\ntributions,  and  replaced  the  usual  squared  error  criterion  with  a \nprobability-based score,  the  result  is equivalent  to  Maximum  Mu(cid:173)\ntual  Information training,  which  has  been  used  successfully  to im(cid:173)\nprove  the  performance  of hidden  Markov models for  speech  recog(cid:173)\nnition.  If the network is specially constructed to perform the recog(cid:173)\nnition computations of a  given kind of stochastic model based clas(cid:173)\nsifier  then we  obtain a  method for  discrimination-based training of \nthe  parameters  of the  models.  Examples  include  an  HMM-based \nword discriminator,  which  we  call an 'Alphanet'. \n\nINTRODUCTION \n\n1 \nIt has often been suggested that one of the attractions of an adaptive neural network \n(NN)  approach  to  pattern  recognition  is  the  availability  of discrimination-based \ntraining  (e.g.  in  Multilayer  Perceptrons  (MLPs)  using  Back-Propagation).  Among \nthe  disadvantages  of  NN  approaches  are  the  lack  of  theory  about  what  can  be \ncomputed  with  any  partir.ular  structure,  what  can  be  learned,  how  to  choose  a \nnetwork architecture for  a given task, and how to deal with data (such as speech)  in \nwhich an underlying sequential structure is ofthe essence.  There have been attempts \nto build internal dynamics into neural networks, using recurrent connections, so that \nthey  might deal with sequences  and  temporal  patterns  [1,  2],  but  there  is  a  lack  of \nrelevant  theory  to inform the choice  of network  type. \n\nHidden  Markov  models  (HMMs)  are  the  basis  of virtually  all  modern  automatic \nspeech  recognition  systems.  They  can  be  seen  as  an  extension  of the  parametric \nstatistical approach to pattern recognition,  to deal (in a  simple but principled way) \nwitli temporal patterning.  Like most parametric models,  HMMs are usually trained \nusing within-class maximum-likelihood (ML) methods, and an EM algorithm due to \nBaum and Welch is  particularly attractive (see  for  instance [3]).  However,  recently \n\n\f212 \n\nBridle \n\nsome success  has  been  demonstrated  using discrimination-based  training methods, \nsuc.h  as  the  so-called  Maximum  Mutual  Information  criterion  [4]  and  Corrective \nTraining[5] . \n\nThis paper addresses  two  important questions: \n\n\u2022  How  can  we  design  Neural  Network  architectures  with at least  the  desirable \nproperties  of methods  based  on  stochastic  models  (such  as  hidden  Markov \nmodels)? \n\n\u2022  What is the relationship between the inherently discriminative neural network \n\ntraining and the analogous  MMI training of stochastic  models? \n\nWe  address  the  first  question  in  two  steps.  Firstly,  to  make sure  that  the  outputs \nof our network  have  the simple  mathematical properties  of conditional probability \ndistributions over class  labels we  recommend a  generalisation of the logistic nonlin(cid:173)\nearity;  this  enables  us  (but  does  not  require  us)  to  replace  the  usual  squared error \ncriterion  with  a  more  appropriate  one,  based  on  relative  entropy.  Secondly,  we \nalso have the option of designing networks which exactly implement the recognition \ncomputations of a  given stochastic model method.  (The resulting 'network' may be \nrather  odd,  and  not  very  'neural',  but  this  is  engineering,  not  biology.)  As  a  con(cid:173)\ntribution  to  the  investigation of the second  question,  we  point  out  that  optimising \nthe relative entropy criterion is  exactly equivalent  to performing Maximum Mutual \nInformation Estimation. \n\nBy  way  of illustration  we  describe  three  'networks'  which  implement  stochastic \nmodel classifiers,  and show  how  discrimination training can  help. \n\n2  TRAINABLE NETWORKS AS PARAMETERISED CON(cid:173)\n\nDITIONAL DISTRIBUTION FUNCTIONS \n\nWe  consider  a  trainable  network,  when  used  for  pattern  classification,  as a  vector \nfunction  Q( re, 8) from  an input vt>ctor  re  to a  set  of indicators of class  membership, \nj  = 1, ... N.  The  parameters  8  modify  the  transfer  function.  In  a  multi(cid:173)\n{Qj}, \nlayer perceptron, for instance, the parameters would be values of weights.  Typically, \nt  = 1, ... T,  of inputs and associated  true \nwe  have  a  training set  of pairs  (ret,ct), \nclass labels, and we  have to find  a  value for  8  which specialises the function so  that \nit is consistent  with the training st't.  A common procedure is  to minimise E( 8), the \nsum  of the  squart's  of the  differt'nces  hetwt'en  the  network  outputs  and  true  class \nindicators, or  targets: \n\n'1' \n\nN \n\nE(8)  =:  L  L(Qj(ret, 8)  - bj,c,)2, \n\nt=l j==l \n\nwhere bj,c  = 1 if j  = c,  otht'rwise O.  E  and Q will be written without the 8 argument \nwhere  the  meaning is  clear,  and  wt'  may drop the t  subscript. \n\nIt  is  well  known  that  the  value  of F(~) which  minimises  the  expected  value  of \n(F(~) - y)2  is  the  expected  value  of  y  given~.  The  expected  value  of bj,e,  is \nP( C  = j  I X  = red,  the probability that the class  associated with ret  is  the  jth  class. \n\n\fTraining Stochastic Model Recognition Algorithms \n\n213 \n\nFrom now on  we  shall assume  that  the desired  output of a  classifier  network is  this \nconditional probability distribution over classes,  given  the input. \n\nThe outputs must satisfy certain simple constraints if they are to be interpretable as \na  probability distribution.  For any input, the outputs must all be positive and they \nmust  sum to unity.  The use  of logistic nonlinearities at the outputs of the  network \nensures  positivity,  and  also  ensures  that  each  output  is  less  than  unity.  These \nconstraints  are  appropriate  for  outputs  that  are  to  be  interpreted  as  probabilities \nof Boolean events,  but  are  not sufficient for  I-from-N classifiers. \nGiven  a  set  of unconstrained  values,  Vj(:e),  we  can ensure  both conditions by using \na  Normalised  Exponential transformation: \n\nQj(~) = eVj(a!) / L eVIe(~) \n\nIe \n\nThis  transformation can  be  considered  a  multi-input generalisation  of the logistic, \noperating on the whole output layer.  It preserves the rank order of its input values, \nand is a  differentiable generalisation of the 'winner-take-all' operation of picking the \nmaximum value.  For  this reason  we  like to refer to it as soft max.  Like  the logistic, \nit  has a  simple implementation in transistor  circuits  [6]. \n\nIf the network is such that we  can be sure the values we  have are all positive, it may \nbe  more  appropriate just to normalise them.  In  particular,  if we  can  treat  them as \nlikelihoods  of the  data given  the  possible  classes,  Lj(~) =  P(X = ~ Ie =i), then \nnormalisation produces  the  required  conditional distribution  (assuming equal prior \nprobabilities for  the  classes). \n\n3  RELATIVE  ENTROPY SCORING  FOR CLASSIFIERS \n\nIn  this section  we  introduce an information-theoretic criterion for  training  I-from(cid:173)\nN  classifier  networks,  to  replace  the  squared  error  criterion,  both  for  its  intrinsic \ninterest  and  because  of the link to discriminative training of stochastic  models. \nthe class  with  highest  likelihood.  This is justified  by \n\nif we assume equal priors P(c) (this can be generalised) and see that the denominator \nP(~) = Lc P(~ I c)P(c)  is  the  same for  all classes. \nIt is  also  usual  to  train  such  classifiers  by  ma:\u00a5:imising  the  data  likelihood  given \nthe  correct  classes.  Maximum  Likelihood  (ML)  training  is  appropriate  if we  are \nchoosing  from  a  family  of pdfs  which  includes  the  correct  one.  In  most  real-life \napplications of pattern  classification  we  do  not  have  knowledge  of the  form  of the \ndata distributions, although  we  may  have  some  useful ideas.  In tbat  case  ML  may \nbe a  rather bad approach to pdf estimation for the purpose  of pattern clauification, \nbecause  what  matters is  the  f'elalive  densities. \n\nAn alternative is  to optimise a  measure of success  in pattern classification, and this \ncan make a  big difference  to performance,  particularly when the assumptions about \nthe form  of the class  pdfs is  badly wrong. \n\n\f214 \n\nBridle \n\nTo  make  the  likelihoods  produced  by  a  SM  classifier  look like  NN  outputs  we  can \nsimply normalise them: \n\nThen we  can  use  Neural  Network  optimisation methods  to adjust  the  parameters. \n\na  SUlll,  weighted  by  the joint probability, of the  MI of the joint events \n\nIe \n\n,.... \n\nP(X=:r,Y=y) \n\nI(X, Y)  =  ,L; P(X:=::r, Y=y)log p{X-=:r)p-(Y~Yf \n\n(~,y) \n\nFor  discrimination  training  of sets  of stochastic  models,  Bahl  et.al.  suggest  max(cid:173)\nimising the Mutual Information, I,  between the training observations and the choice \nof the correspolluing correct  class. \n\nI(X, C) =  ,L; log \n\n,\"\" \n\nt \n\nP(C =.: Ct,X=Zt) \nP(C=cdP(X=z) \n\n,........... \n\n=  ,L; log \n\nt \n\nP(C=Ct IX=zt}P(X=zd \n. \n\nP(C=ct}P(X=z) \n\nP(C=Ct I X  =  zt} should be read as the probability that we choose the correct class \nfor  the tth training example.  If we  are choosing classes according to the conditional \ndistribution  computed  using  parameters  (J  then  P(C=Ct IX =  zd  =  QCt(z,(J), \nand \n\nIf the second  term involving the priors is fixed,  we  are left  with  maximising \n\nLlogQCt(:rt,6) =  -J. \n\nt \n\nThe  RE-based  score  we  use  is  J  ..;;  -- }:;:;;1 L;=l Pjtlog Qj{ zd,  where  Pjt  is  the \nIf as  usual  the \nprobability of class  j  associated  with input  Zt  1ll  the  training set. \ntraining set  specifies  only oue  true class,  Ct  for  each  Zt  then  Pj,t =  [)j,Ct  and \n\nJ  = -- LlogQCt(zt}, \n\nT \n\nt=l \n\nthe sum of the logs of the  outputs for  the  correct  classes. \nJ  can  be  derived  from  the  Relative  Entropy  of distribution  Q with  respect  to  the \ntrue  conditional distribution  P,  averaged over  the input distribution: \n\nJ d:r P(X =  z)G(Q I P),  where  G(Q I P) = - L P(c I z)log ~~(Iz~)' \n\nC \n\ninformation, cross entropy, asymmetric divergence, directed divergence, I-divergence, \nand  Kullback-Leibler  number.  RE  scoring is  the  basis for  the  Boltzmann  Machine \nlearning algorithm  [7]  and  has  also  been  proposed  and used  for  adaptive  networks \nwith  continuous-valued  outputs  [8,  9,  10,  11],  but  usually  in  the  form  appropriate \nto separate logistics and independent  Boolean targets.  An exception is  [12]. \n\nThere  is  another  way  of thinking about  this  'log-of correct-output'  score.  Assume \nthat the way we  would  use  the outputs of the network is  that, rather than choosing \n\n\fTraining Stochastic Model Recognition Algorithms \n\n215 \n\nthe class with the largest output, we  choose randomly, picking from the distribution \nspecified  by  the  outputs.  (Pick  class  j  with  probability  Qj.)  The  probability  of \nchoosing  the  class  Ct  for  training sample  IBt  is  simply  Qet (tee).  The  probability  of \nchoosing the correct  class  labels for  all the  training set is  n;=1 Qet (1Bt).  We  simply \nseek  to maximise this  probability, or what is  equivalent,  to minimise minus its log: \n\nT \n\nJ  =  - L log Qet(ted\u00b7 \n\nt=l \n\nIn order to compute the partial derivatives of J  wrt to parameters of the network, we \nfirst  need  :gj  -=  -Pjt!Qj The details of the  back-propagation depend  on  the form \nof the  network,  but if the final  non-linearity is  a  normalised exponential (softmax), \n\nQj(:l) = exp(Vj(:z:))/ Lt exp(V\" (:z:)), \n\n'\"' \n\nthen  [6]  aV- -= (Qj(:z:t) - bj,et)' \n\n8Jt \n\nWe  see  that  the derivative  before  the output nonlinearity is  the  difference  between \nthe  corresponding  output  and  a  one-from-N  target.  We  conclude  that  softmax \noutput stages and  I-from-N  RE scoring are  natural partners. \n\n\" \n\nJ \n\n4  DISCRIMINATIVE  TRAINING \nIn  stochastic  model  (probability-density)  based  pattern  classification  we  usually \ncompute  likelihoods  of the  data  given  models  for  each  class,  P(IB  I c),  and choose. \nSo  minimising our J  criterion is  also  maximising Bahl's mutual information.  (Also \nsee  [13).) \n\n5  STOCHASTIC MODEL CLASSIFIERS AS  NETWORKS \n5.1  EXAMPLE  ONEs  A  PAIR OF MULTIVARIATE  GAUSSIANS \n\nThe conditional distribution for  a  pair of multivariate  Gaussian densities  with  the \nsame  arbitrary  covariance  matrix  is  a  logistic  function  of a  weighted  sum  of the \ninput  coordinates  (plus  a  constant).  Therefore,  even  if we  make  such  incorrect \nassumptions as equal priors and spherical  unit covariances,  it is still possible to find \nvalues for  the  parameters of the  model  (the  positions of the  means of the assumed \ndistributions)  for  which  the  form  of the  conditional  distribution  is  correct.  (The \nmeans  may  be  far  from  the  means  of  the  true  distributions  and  from  the  data \nmeans.)  Of course  in  this  case  we  have  the  alternative  of using  a  weighted-sum \nlogistic, unit  to  compute  the  conditional  probability:  the  parameters  are  then  the \nweights. \n\n5.2  EXAMPLE TWO:  A  MULTI-CLASS  GAUSSIAN CLASSIFIER \n\nConsider a  model in which  the distributions  for  each  class  are  multi-variate Gaus(cid:173)\nsian,  with  equal  isotropic  unit  variances,  and  different  means,  {mj}.  The  prob(cid:173)\nability  distribution  over  class  labels,  given  an  observation  IB I  is  P( c  =  j  lIB)  = \ne 1'; / L\" e V\",  where  V;  =  -IIIB  - mj 112.  This  can  be  interpreted  as  a  one-layer \nfeed-forward  non-linear network.  The usual weighted sums are replaced  by squared \nEuclidean distances,  and  the  usual  logistic output non-linearities are  replaced  by a \nnormalised exponential. \n\n\f216 \n\nBridle \n\nFor a  particular two-dimensional10-class  problem, derived from  Peterson and  Bar(cid:173)\nney's  formant  data,  we  have  demonstrated  [6]  that  training  such  a  network  can \ncause the ms to move from their  \"natural\" positions at the data means (the in-class \nmaximum likelihood estimates), and this can improve classification performance on \nunseen  data (from 68%  correct  to 78%). \n\n5.3  EXAMPLE  THREE:  ALPHANETS \n\nConsider a  set  of hidden  Markov models  (HMMs),  one  for  each  word,  each  param(cid:173)\neterised  by  a  set  of state  transition  probabilities,  {a~j}' and observation  likelihood \nfunctions  {b~ ('\" H,  where  a~j  is  the  probability  that in  model  k  state  i  will  be  fol(cid:173)\nlowed by state j, and b~ ( \"') is  the likelihood of model k  emi tting observation '\"  from \nstate  j.  For  simplicity  we  insist  that  the  end  of the  word  pattern  corresponds  to \nstate N  of a  model. \nThe  likelihood,  Lie (lett)  of model  k  generating a  given  sequence \",tt  ~ \"'1, \u2022\u2022 \"  \"'M \nis  a  sum,  over  all sequences  of states,  of the joint likelihood of that state sequence \nand  the data: \n\nLIe(ler)  =  L  IT a!'_I\"f b!I(\"'d  with  8M  = N. \n\nM \n\n'I ... IM  t=2 \n\nThis can  be  r.omput.ed  efficiently  via the forward  recursion  [3J \n\nglvlllg \n\nwhich  we  can  think of as  a  recurrent  network.  (Note  that t  is  used  as a  time index \nhere.) \n\nIf the  observation  sequence  \"':'\"  could only  have come  from  one  of a  set  of known, \nequally likely  models,  then the  posterior  probability that it was  from  model k  is \n\np(r=k I \",f!) =  QIe(\",f!)  =  Llc(\",f1 ) /  L Lr(\",r)\u00b7 \n\nr \n\nThese numbers are the output of our special  \"recurrent neural network\"  for isolated \nword discrimination,  which  we  call an  \"Alphanet\"  [14J.  Backpropagation of partial \nderivatives  of the  J  score  has  the  form  of the  backward  recurrence  used  in  the \nBaum-Welch  algorithm,  but  they  include  discriminative  terms,  and  we  obtain the \ngradient of the  relative entropy/mutual information. \n\n6  CONCLUSIONS \n\nDiscrimination-based  training is  different  from  within-class  parameter  estimation, \nand it  may  be  useful.  (Also  see  [15].)  Discrimination-based training for  stochastic \nmodels and for  networks are  not distinct, and in some  cases  can  be  mathematically \nidentical. \n\nThe notion of specially constructed 'network' architectures which implement stochas(cid:173)\ntic  model  recognition  algorithms  provides  a  way  to  construct  fertile  hybrids.  For \ninstance,  a  Gaussian classifier  (or  a  HMM  classifier)  can be  preceeded  by a  nonlin(cid:173)\near  transformation  (perhaps  based  on semilinear  logistics)  and  all  the  parameters \n\n\fTraining Stochastic Model Recognition Algorithms \n\n217 \n\nof the  system  adjusted  together.  This  seems  a  useful  approach  to  automating the \ndiscovery of 'feature detectors'. \n\n\u00a9  British Crown  Copyright  1990 \n\nReferences \n[1]  R  P  Lippmann.  Review  of  neural  networks  for  speech  recognition.  Neural \n\nComputation,  1(1),  1989. \n\n[2]  It L Watrous .  Connectionist speech  recognition using the temporal flow  model. \n\nIn  .Pl'Oc.  IEEE  W ol'kshop  on  Speech  Recognition,  June  1988. \n\n[3]  A  B  Poritz.  Hidden  Markov  models:  a  guided  tour.  In  Proc.  IEEE Int.  Conf. \n\nAcouslics  Speech  and  Signal P1'Ocessillg,  pages  7-13,  1988. \n\n[4]  L  R  Bahl,  P  F  Brown,  P  V de  Souza,  and  R  L  Mercer.  Maximum  mutual \ninformation  estimation  of hidden  Markov  model  parameters.  In  Proc.  IEEE \nTnt.  Conf.  Acoustics  Speech  and  Signal  P,'ocessing,  pages  49-52,  1986. \n\n[5]  L  R Bahl, P  F  Brown,  P  V  de Souza, and R L r.fercer.  A new algorithm for  the \nestimation  of HMM  parameters.  In  P,'Vf.  IEEE  Int.  Con!.  Acoustics  Speech \nand  Signal  Processmg,  pages  493-496, 1988. \n\n[6]  J  S  Bridle.  Probabilistic  interpretation  of feedforward  classification  network \noutput.s,  with relationships to statistical pattern  recognition.  In  F  Fougelman(cid:173)\nSoulie and J  Herault,  editors,  Neuro-computing:  algorithms,  architectures  and \nappfications,  Springer-Verlag,  1989. \n\n[7]  D  HAckley,  G  E Hinton,  and T  J  Sejnowski.  A  learning algorithm for  Boltz(cid:173)\n\nmann machines.  Cognitive  Science,  9:147-168,1985. \n\n[8]  L  Gillick.  Probability scores  for  backpropagation  networks.  July  1987.  Per(cid:173)\n\nsonal  communication. \n\n[9]  G  E Hinton.  Connectionist  LeaJ'ning  Procedures.  Technical  Report  CMU-CS-\n87-115, Carnegie Mellon  University Computer Science  Department, June 1987. \n[10]  E  B  Baum  and  F  Wilczek.  Supervised  learning  of probability  distributions \nIn  D  Anderson,  editor,  Neura,Z  Infol'mation  Processing \n\nby  neural  networks. \nSystems,  pages  52\"-6],  Am.  lnst.  of Physics,  1988. \n\n[11]  S  SoHa,  E  Levin,  and  M  Fleisher.  Accelerated  learning in layered  neural  net(cid:173)\n\nworks.  Complex  Systems,  January  1989. \n\n[12]  E  Yair  and  A  Gersho.  The  Boltzmann  Perceptron  Network:  a  soft  classifier. \nIII  D  Touretzky, editor,  Advances  in Neuml Information  Processing Systems  1, \nSan  Mateo,  CA:  Morgan  Kaufmann,  1989. \n\n[13]  P  S  Gopalakrishnan,  D  Kanevsky,  A  Nadas,  D  Nahamoo,  and  M  A  Picheny. \nDecoder seledion based on cross-entropies .  In Proc.  IEEE Int.  Conf.  Acoustics \nSpeech  and  Signal  Pl'ocessing,  pages  20-23,  1988. \n\n[14]  J  S  Bridle.  Alphanets:  a  recurrent  'lleural' network  architecture  with a  hidden \nMarkov  model  interpretation.  Spee('h  Communication,  Special  N eurospeech \nissue,  February  1990. \n\n[15]  \"L  Niles,  H  Silverman,  G  Tajclllnan,  and  1\\'1  Bush.  How  limited  training data \nIn  Proc. \n\ncan  allow  a  neural  network  to  out-perform  an  'optimal'  classifier. \nIEEE in.t .  Conf.  Acoustics  Speech  and  Signal  Processing,  1989. \n\n\f", "award": [], "sourceid": 195, "authors": [{"given_name": "John", "family_name": "Bridle", "institution": null}]}