{"title": "The Gamma MLP for Speech Phoneme Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 785, "page_last": 791, "abstract": null, "full_text": "The Gamma MLP  for  Speech Phoneme \n\nRecognition \n\nSteve  Lawrence~ Ah Chung Tsoi,  Andrew D.  Back \n\n{lawrence,act,back}Oelec.uq.edu.au \n\nDepartment of Electrical and Computer Engineering \n\nUniversity of Queensland \n\nSt.  Lucia Qld 4072  Australia \n\nAbstract \n\nWe  define  a  Gamma  multi-layer  perceptron  (MLP)  as  an  MLP \nwith the usual synaptic weights replaced by gamma filters  (as pro(cid:173)\nposed by de Vries and Principe  (de Vries and Principe,  1992)) and \nassociated  gain  terms  throughout  all  layers.  We  derive  gradient \ndescent  update equations and apply  the  model  to  the recognition \nof speech  phonemes.  We  find  that  both  the  inclusion  of gamma \nfilters  in  all  layers,  and  the  inclusion  of synaptic  gains,  improves \nthe  performance  of the  Gamma  MLP.  We  compare  the  Gamma \nMLP  with  TDNN,  Back-Tsoi FIR MLP,  and Back-Tsoi I1R  MLP \narchitectures, and a  local approximation scheme.  We find  that the \nGamma MLP results in an substantial reduction  in error rates. \n\n1 \n\nINTRODUCTION \n\n1.1  THE  GAMMA  FILTER \n\nInfinite Impulse Response  (I1R)  filters have a  significant advantage over Finite Im(cid:173)\npulse Response (FIR) filters in signal processing:  the length of the impulse response \nis  uncoupled from  the  number of filter  parameters.  The length of the impulse  re(cid:173)\nsponse  is  related  to  the  memory  depth  of a  system,  and  hence I1R  filters  allow  a \ngreater memory  depth than FIR filters  of the same order.  However,  I1R filters  are \n\n*http://www.neci.nj.nec.com/homepages/lawrence \n\n\f786 \n\nS. LAWRENCE, A. C. TSOI, A. D. BACK \n\nnot  widely  used  in  practical  adaptive  signal  processing.  This  may  be  attributed \nto the fact  that  a)  there  could  be  instability  during  training  and  b)  the  gradient \ndescent training procedures are not guaranteed to locate the global optimum in the \npossibly non-convex error surface (Shynk,  1989). \n\nDe Vries and Principe proposed using gamma filters  (de Vries and Principe,  1992), \na special case of IIR filters,  at the input to an otherwise standard MLP. The gamma \nfilter  is  designed  to retain  the uncoupling of memory  depth  to the  number of pa(cid:173)\nrameters provided by IIR filters,  but to have simple stability conditions. \n\nThe  output  of  a  neuron \nI  1-1) \nI \n. \n\nf (\"\",Nr-l \n\nL--i=O  WkiYi \n\nYk  = \n\nin  a  multi-layer  perceptron  is  computed  using1 \nDe  Vries  and  Principe  consider  adding  short \n\nf  (\"\",Nr-l  \"\",K \n\n(t \n\n.)  1-1 (t \n\n.)) \n\nh \n\nI \n\n-\nYk  -\n\nI \n\n- J  Yi \n\nL--i=O  L--j=O 9kij \n\nterm  memory  with  delays: \n- J  were \n9~ij =  (r!i)! tj-le-/-'~it \nj  =  1, ... , K  .  The  depth  of the  memory  is  controlled \nby  J.t,  and  K  is  the order  of the  filter.  For  the discrete  time  case,  we  obtain  the \nrecurrence  relation:  zo(t)  = x(t)  and  Zj(t)  = (1  - J.t)Zj(t  - 1) + J.tZj-l (t - 1)  for \nj  =  1, ... , K.  In  this form,  the gamma filter  can be interpreted as a  cascaded series \nof filter modules, where each module is a first order IIR filter with the transfer func-\ntion  q-(I-/-,) ,  where  qZj(t)  ~ Zj(t + 1).  We  have a  filter  with  K  poles,  all  located \nat 1 - J.t.  Thus,  the gamma filter  may be considered as a  low  pass filter  for  J.t  < 1. \nThe value of J.t  can be fixed,  or it can be adapted during training. \n\n2  NETWORK MODELS \n\nFigure 1:  A  gamma filter  synapse with an associated gain term 'c'. \n\nWe  have  defined  a  gamma MLP  as a  multi-layer perceptron  where  every  synapse \ncontains  a  gamma filter  and  a  gain  term,  as  shown  in  figure  1.  The  motivation \nbehind  the  inclusion  of the gain  term  is  discussed  later.  A  separate  J.t  parameter \nis  used  for  each filter.  Update equations are derived in  a  manner analogous to the \nstandard MLP  and can be found  in  Appendix  A.  The model is  defined  as follows. \n\nlwhere yi  is  the output of neuron k  in layer I,  Nl  is  the number of neurons in  layer  I, \nWii  is  the weight  connecting neuron  k  in layer  I  to neuron  i  in  layer I  - 1,  yb  = 1  (bias), \nand /  is  commonly a  sigmoid function. \n\n\fThe Gamma MLP for  Speech Phoneme Recognition \n\n787 \n\nDefinition 1  A Gamma MLP  with L layers excluding the input layer  (0,1, ... , L), \ngamma filters of order K, and No, N 1 , ... , NL  neurons per layer, is  defined  as \n\nK \n\nf  (x~ (t)) \nN'-l \nL C~i(t) L wLj (t)Zkij (t) \ni=O \n(1- ILL(t))zkij(t -1) + ILL(t)zki(j_I)(t -1) \ny!-l (t) \n\nj=O \n\n(1) \n\n1 ~j ~ K \n\nj=O \n\nZiij (t) \nZiij (t) \n\nwhere  y(t)  =  neuron output,  c'ki  =  synaptic gain,  f(a)  = \n1,2, ... ,N,(neuronindex),  I  =  0,1, ... ,L(layer),  and  Ziijli=O  = \n0, C~ij li=O  =  1(bias). \n\neO / 2 _e- o / 2 \neO/2+e  0/2,  k \n1, W~ij li=O,#O \n\no \n\nFor  comparison  purposes,  we  have  used  the TDNN  (Time Delay  Neural  Network) \narchitecture2 ,  the  Back-Tsoi  FIR3  and  I1R  MLP  architectures  (Back  and  Tsoi, \n1991a) where every synapse contains an FIR or I1R filter  and a gain term,  and the \nlocal approximation algorithm used by Casdagli (k-NN LA)  (Casdagli,  1991)4.  The \nGamma MLP is a  special case of the I1R  MLP. \n\n3  TASK \n\n3.1  MOTIVATION \n\nAccurate  speech  recognition  requires  models  which  can  account  for  a  high  degree \nof variability in  the data.  Large  amounts of data may  be available  but  it  may  be \nimpractical to use all of the information in standard neural network models. \n\nHypothesis:  As the complexity of a problem increases (higher dimensionality, greater \nvariety of training data), the error surface of a  neural network becomes more com(cid:173)\nplex.  It may contain a number of local minima5  many of which may be much worse \nthan the global minimum.  The training (parameter estimation) algorithms become \n\"stuck\"  in  local  minima which  may  be  increasingly  poor  compared  to  the  global \noptimum.  The problem suffers from the so called  \"curse of dimenSionality\"  and the \n\n2We  use  TDNN to refer  to an  MLP  with  a  time window  of inputs,  not  the replicated \n\narchitecture introduced by Lang (Lang et al.,  1990). \n\n3We  distinguish  the  Back-Tsoi  FIR  network  from  the  Wan  FIR network  in  that  the \nWan  architecture  has  no  synaptic  gains,  and  the  update  algorithms  are  different.  The \nBack-Tsoi update algorithm has provided better convergence in  previous experiments. \n4Casdagli  created  an  affine  model  of the  following  form  for  each  test  pattern:  yi  = \naD + L~=l ai~, where  k  is  the  number  of neighbors,  j  = 1, ... , k,  and  n  is  the  input \ndimension.  The resulting model is  used to find  y  for  the test pattern. \n\n5We note that it can be difficult to distinguish a true local minimum from a long plateau \n\nin the standard backpropagation algorithm. \n\n\f788 \n\nS. LAWRENCE, A.  C. TSOI, A.  D. BACK \n\ndifficulty in optimizing a  function  with limited control over the nature of the error \nsurface. \n\nWe  can  identify  two  main  reasons  why  the  application  of the  Gamma MLP  may \nbe superior to  the standard TDNN  for  speech recognition:  a)  the gamma filtering \noperation allows consideration of the input data using different time resolutions and \ncan account for  more past history of the signal which  can only  be accounted for  in \nan  FIR  or  TDNN  system  by  increasing  the  dimensionality  of the  model,  and  b) \nthe  low  pass  filtering  nature of the  gamma filter  may  create  a  smoother function \napproximation task, and therefore a  smoother error surface for  gradient descent6 . \n\n3.2  TASK  DETAILS \n\nModel Input Window \n\n[~ \n\nTarget Function \n\nNetworl( Output 1 \n\nNetworl( Output 2 \n\nClassification 0 \n\n; \n\nj\n\n: \n\nII \n\n~} ...::..  ...:::'.!'}  ~.} ,.;:!..  \"\"::'I'} \n\n.  ;  i  ~ \nl I ~ \n\n! \n\nFrames of RASTA data \n\n~ Sequence End  ~ \n\nFigure 2:  PLP input data format  and the corresponding network target functions for  the \nphoneme  \"aa\" . \n\nOur data consists of phonemes extracted from  the TIMIT database and organized \nas  a  number  of sequences  as  shown  in  figure  2  (example  for  the  phoneme  \"aa\"). \nOne model  is  trained for  each  phoneme.  Note that the phonemes  are classified in \ncontext,  with  a  number of different  contexts,  and that the surrounding phonemes \nare labelled  only  as  not  belonging  to the  target  phoneme class.  Raw  speech  data \nwas  pre-processed into a sequence of frames using the RASTA-PLP v2.0 software7. \nWe  used  the  default  options  for  PLP  analysis.  The analysis  window  (frame)  was \n20  ms.  Each  succeeding  frame  overlaps  with  the  preceding  frame  by  10  ms.  9 \nPLP coefficients together with the signal power are extracted and used  as features \ndescribing each  frame  of data.  Phonemes used  in  the current tests were  the vowel \n\"aa\"  and  the  fricative  \"s\" .  The  phonemes  were  extracted  from  speakers  coming \nfrom the same demographic region in the TIMIT database.  Multiple speakers were \nused  and  the  speakers  used  in the test set  were  not  contained in  the training set. \nThe training set contained 4000 frames, where each phoneme is roughly 10 frames. \nThe test set contained 2000 frames,  and an additional validation set containing 2000 \nframes  was used  to control generalization. \n\n6If we  consider a very simple network and derive  the relationship of the smoothness of \nthe required function approximation to the smoothness of the error surface this statement \nappears  to  be  valid.  However,  it  is  difficult  to  show  a  direct  relationship  for  general \nnetworks. \n\n7 Obtained from  ftp:/ /ftp.icsi.berkeley.edu/pub/speech/rasta2.0.tar.Z. \n\n\fThe Gamma MLP for  Speech Phoneme Recognition \n\n789 \n\n4  RESULTS \n\nTwo outputs were used in the neural networks as shown by the target functions  in \nfigure 2,  corresponding to the phoneme being present or not.  A confidence criterion \nwas used:  Ymax  x (Ymax - Ymin)  (for soft max outputs).  The initial learning rate was \n0.1,  10 hidden nodes were used, FIR and Gamma orders were 5 (6 taps), the TDNN \nand k-NN models had an input window of 6 steps in time,  the tanh activation func(cid:173)\ntion  was  used,  target  outputs were  scaled  between -0.8  and 0.8,  stochastic update \nwas used, and initial weights were chosen from  a set of candidates based on training \nset performance.  The learning rate was varied over time according to the schedule: \n\",~!(o.Cj(n  C2 N\u00bb))  where'TI  = learning rate,  'TIo  = initial \n'TI  = 'TIo/  (N/2  + \nlearning rate,  N  = total epochs,  n  = current  epoch,  Cl  =  50,  C2  =  0.65.  This  is \nsimilar  to the schedule proposed in  (Darken  and Moody,  1991)  with an additional \nterm to decrease the learning rate towards zero over the final  epochs8 . \n\nmax  1,(cI-\n\n(I  C2)N \n\n( \n\nI  Train  Error %  I  2-NN  I  5-NN  I  1st  layer \n0.43 \n0 .39 \n\nGamma MLP \n\nFIR MLP \n\n17.6 \n7.78 \n\nTDNN \n\nk-NN  LA \n\n0 \n\n0 \n\nI  Test  Error % \nFIR MLP \n\nGamma  MLP \n\nTDNN \n\nI  2-NN  I  5-NN  l i s t  layer \n\n22.2 \n14 .7 \n\n0.97 \n0.16 \n\nk-NN  LA \n\n31 \n\n28.4 \n\nI  Test  False  +ve  I  2-NN  I  5-NN  l i s t  layer \n\nFIR MLP \n\nGamma  MLP \n\nTDNN \n\nk-NN  LA \n\n22.6 \n\n17.4 \n\n13.5 \n7 .94 \n\n0 .67 \n0.45 \n\nI  Test  False -ve  I  2-NN  I  5-NN  l i s t  layer \n2 .6 \n1.2 \n\nGamma MLP \n\nFIR MLP \n\n44.9 \n32 .2 \n\nTDNN \n\nk-NN  LA \n\n53 \n\n56.8 \n\nI  All  layers \n1.5 \n0 .88 \n\n14.5 \n5.73 \n\nI  All  layers \n0 .61 \n0 .33 \n\n20.4 \n13.5 \n\nI  All  layers \n2.0 \n0.47 \n\n11.4 \n7 .01 \n\nI  Gains  1st  layer  I  Gains  all  layers \n\n, \n\n, \n\n27.2 \n6 .07 \n\n0 .59 \n0 .12 \n\n40.9 \n5.63 \n14.4 \n\n19.8 \n1.68 \n0.86 \n\nI  Gams  1st  layer  I  Gams  all  layers  I \n\n, \n\n, \n\n29 \n12.8 \n\n0.14 \n1.0 \n\n41 \n12.7 \n24.5 \n\n21 \n0.50 \n0 .68 \n\nI  Gams  1st  layer  I  Gams  all  layers  I \n\n, \n\n, \n\n4.5 \n6.83 \n\n0 .77 \n0 .34 \n\n31.3 \n8.05 \n13 \n\n49.0 \n1.8 \n0 .27 \n\nI  All  layers \n5 .6 \n2.2 \n\n44.1 \n30.4 \n\nI  Gams  1st  layer  I  Gams  all  layers  I \n\n, \n\n, \n\n92.9 \n28.4 \n\n2.4 \n2 .8 \n\n66.4 \n24.7 \n54.6 \n\n53 \n4.4 \n1.8 \n\nTable  1:  Results  comparing  the  architectures  and  the  use  of  filters  in  all  layers  and \nsynaptic  gains  for  the  FIR  and  Gamma  MLP  models.  The  NMSE  is  followed  by  the \nstandard  deviation.  The  TDNN  results  are  listed  under  an  arbitrary  column  heading \n(gains  and 1st layer/alilayers does  not apply). \n\nThe  results  of the  simulations  are  shown  in  table  19 .  Each  result  represents  an \naverage over four  simulations with different  random seeds - the standard deviation \nof the four  individual  results  is  also shown.  The FIR and  Gamma MLP  networks \nhave  been  tested  both  with  and  without  synaptic  gains,  and  with  and  without \nfilters  in  the  output  layer  synapses.  These  results  are  for  the  models  trained  on \nthe  \"s\"  phoneme,  results for  the  \"aa\"  phoneme exhibit the same trend.  \"Test false \nnegative\"  is  probably  the  most  important  result  here,  and  is  shown  graphically \nin  figure  3.  This  is  the  percentage  of times  a  true  classification  (ie. \nthe  current \n\n8Without this  term we  have  encountered  considerable  parameter fluctuation  over  the \n\nlast  epoch. \n\n9NMSE = 2:~=1 (d(k)  - y(k))2  I  (2:~=1 (d(k) - (2:~=1 d(k)) INr) IN. \n\n\f790 \n\nS. LAWRENCE, A. C. TSOI, A.  D.  BACK \n\nQ) \n\n~ \n~ \nZ \nQ) \n.!!2 '\" \ni \n\nLL \n\nI-\n\n60 \n\n55 \n\n50 \n\n45 \n\n40 \n\n35 \n\n30 \n\n25 \n\n20 \n\n--\" \n\nf------f \n\n~  Ga:~~ ~t~ =-=-~-' \n\nTDNN  -_ .. _.(cid:173)\nk-NN LA  _._._ .. \n\nI  - - -__ \n\nI -r-\u00b7---- ----+ --- -1 \n\n2-NN  5-NN  NG 1 L  NG AL \n\nG  lL  GAL \n\nFigure  3:  Percentage  of  false  negative  classifications  on  the  test  set.  NG=No  gains, \nG=Gains,  lL=filters in  the first  layer  only,  AL=filters in  all  layers.  The error  bars show \nplus and minus one standard deviation.  The synaptic gains  case for  the  FIR MLP  is  not \nshown  as  the  poor  performance  compresses  the remainder  of the graph.  Top  to bottom, \nthe lines  correspond to:  k-NN LA  (left),  TDNN,  FIR MLP,  and Gamma MLP. \n\nphoneme  is  present)  is  incorrectly  reported  as  false.  From  the  table  we  can  see \nthat the Gamma MLP performs Significantly better than the FIR MLP or standard \nTDNN  models  for  this  problem.  Synaptic  gains  and  gamma  filters  in  all  layers \nimprove the performance of the Gamma MLP, while the inclusion of synaptic gains \npresented difficulty for  the FIR MLP. Results for  the IIR MLP are not shown - we \nhave  been  unable to obtain significant  convergencelO .  We  investigated values of k \nnot listed in the table for  the k-NN  LA  model,  but it performed poorly in all cases. \n\n5  CONCLUSIONS \n\nWe  have defined  a  Gamma MLP as an MLP with gamma filters  and gain terms in \nevery synapse.  We have shown that the model performs significantly better on our \nspeech phoneme recognition problem when compared to TDNN, Back-Tsoi FIR and \nIIR MLP architectures, and Casdagli's local approximation model.  The percentage \nof times  a  phoneme  is  present  but  not  recognized  for  the  Gamma MLP  was  44% \nlower than the closest competitor, the Back-Tsoi FIR MLP model. \n\nThe inclusion of gamma filters  in  all layers and the inclusion of synaptic gains im(cid:173)\nproved the performance of the Gamma MLP. The improvement due to the inclusion \nof synaptic gains may be considered non-intuitive to many - we  are adding degrees \nof freedom,  but no additional representational power.  The error surface will be dif(cid:173)\nferent  in  each case,  and the results indicate that the surface for  the synaptic gains \ncase  is  more  amenable  to  gradient  descent.  One  view  of the  situation  is  seen  by \nBack &  Tsoi with their FIR and IIR MLP networks (Back and Tsoi,  1991b):  From \na  signal processing perspective the response of each synapse is determined by pole(cid:173)\nzero positions.  With no synaptic gains, the weights determine both the static gain \nand the pole-zero positions of the synapses.  In an experimental analysis performed \nby  Back &  Tsoi it was observed that some synapses devoted themselves to model-\n\nlOTheoretically,  the  IIR MLP  model  is  the most  powerful  model  used  here.  Though  it \nis  prone  to  stability  problems,  the stability  of  the  model  can  and  was  controlled  in  the \nsimulations performed here  (basically, by reflecting poles that move outside the unit circle \nback inside).  The most obvious hypothesis for the difficulty in training the model is related \nto the error surface and the nature of gradient descent.  We expect the error surface to be \nconsiderably  more  complex for  the IIR  MLP  model,  and  for  gradient  descent  update  to \nexperience increased difficulty optimizing the function. \n\n\fThe Gamma MLP for Speech Phoneme Recognition \n\n791 \n\ning the dynamics of the system in question, while others  \"sacrificed\"  themselves to \nprovide the necessary static gainsll to construct the required nonlinearity. \n\nAPPENDIX A:  GAMMA MLP  UPDATE EQUATIONS \n\n-'1 8 \n\n8J(t) \nI  ( )  = '16\" (t)c\", (t)Z\"i; (t) \nw\",;  t \n\nI \n\nI \n\nI \n\n~W~i;(t) \n\n= \n\n~C~i(t) \n\n~J'~i (t) \n\n= \n\n= \n\no \n(1  - J'~i(t))a~,;(t -1) + J'~i(t)a~iC;_I)(t - 1) \n+z~,(;_I)(t -1) - Z~i;(t - 1) \n\nj=O \n\n1  $j $  K \n\nI=L \n\n1 $j $  K \n\n1 \n(1  - J';,,(t)).B;,,;(t -1) + J';,,(t).B~\"(;_l) (t - 1) \n\nj=O \n\n1  $j $K \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\n(6) \n\n(7) \n\nAcknowledgments \n\nThis  work  has  been  partially  supported  by  the  Australian  Research  Council  (ACT  and \nADB)  and the Australian Telecommunications and Electronics Research  Board (SL). \n\nReferences \n\nBack,  A.  and Tsoi,  A.  (1991a).  FIR and IIR synapses, a  new neural network architecture \n\nfor  time series  modelling.  Neural  Computation,  3(3):337-350. \n\nBack, A.  D.  and Tsoi,  A. C.  (1991b).  Analysis of hidden layer weights in a dynamic locally \nrecurrent  network.  In  Simula,  0.,  editor,  Proceedings  International  Conference  on \nArtificial Neural  Networks,  ICANN-91,  volume 1,  pages 967-976,  Espoo,  Finland. \n\nCasdagli,  M.  (1991).  Chaos and deterministic versus stochastic non-linear modelling.  J.R. \n\nStatistical Society  B,  54(2):302-328. \n\nDarken, C.  and Moody,  J.  (1991).  Note on learning rate schedules for  stochastic optimiza(cid:173)\n\ntion.  In Neural Information Processing Systems 3,  pages 832-838.  Morgan Kaufmann. \n\nde Vries,  B. and Principe, J.  (1992).  The gamma model- a new neural network for temporal \n\nprocessing.  Neural  Networks,  5(4):565-576. \n\nLang,  K.  J.,  Waibel,  A.  H.,  and  Hinton,  G.  E.  (1990).  A  time-delay  neural  network \n\narchitecture for  isolated word  recognition.  Neural  Networks,  3:23-43. \n\nShynk, J .  (1989).  Adaptive IIR filtering.  IEEE ASSP Magazine,  pages 4-21. \n\nllThe neurons were observed to have gone into saturation, providing a constant output. \n\n\f\fPART VII \nVISION \n\n\f\f", "award": [], "sourceid": 1021, "authors": [{"given_name": "Steve", "family_name": "Lawrence", "institution": null}, {"given_name": "Ah", "family_name": "Tsoi", "institution": null}, {"given_name": "Andrew", "family_name": "Back", "institution": null}]}