{"title": "Speech Modelling Using Subspace and EM Techniques", "book": "Advances in Neural Information Processing Systems", "page_first": 796, "page_last": 802, "abstract": null, "full_text": "Speech Modelling Using Subspace and EM \n\nTechniques \n\nGavin Smith \n\nCambridge University \nEngineering Department \n\nCambridge CB2 1PZ \n\nEngland \n\ngas1 oo3@eng.cam.ac.uk \n\nJoao FG de Freitas \n\nComputer Science Division \n\n487 Soda Hall \nUC Berkeley \n\nCA 94720-1776, USA. \njfgf@cs.berkeley.edu 1 \n\nTony Robinson \n\nCambridge University \nEngineering Department \n\nCambridge CB2  IPZ \n\nEngland \n\najr@eng.cam.ac.uk \n\nMahesan Niranjan \nComputer Science \nSheffield University \nSheffield.  S 1 4DP \n\nEngland \n\nm.niranjan@dcs.shef.ac.uk \n\nAbstract \n\nThe speech waveform can be modelled as  a piecewise-stationary linear \nstochastic state space system, and its parameters can be estimated using \nan  expectation-maximisation (EM)  algorithm.  One problem is  the  ini(cid:173)\ntialisation of the EM algorithm. Standard initialisation schemes can lead \nto poor formant trajectories.  But these  trajectories however are impor(cid:173)\ntant for  vowel  intelligibility.  The aim of this paper is to investigate the \nsuitability of subspace identification methods to initialise EM. \nThe  paper  compares  the  subspace  state  space  system  identification \n(4SID) method with the EM algorithm.  The 4SID and EM methods are \nsimilar in that they both estimate a state sequence (but using Kalman fil(cid:173)\nters and Kalman smoothers respectively),  and then estimate parameters \n(but using least-squares and maximum likelihood respectively). The sim(cid:173)\nilarity of 4SID and EM motivates the use of 4SID to initialise EM. Also, \n4SID is non-iterative and requires no initialisation, whereas EM is itera(cid:173)\ntive and requires initialisation.  However 4SID is sub-optimal compared \nto EM in a probabilistic sense. During experiments on real speech, 4SID \nmethods compare favourably with conventional initialisation techniques. \nThey produce smoother formant trajectories, have greater frequency res(cid:173)\nolution, and produce higher likelihoods. \n\n1  Work done while in Cambridge Engineering Dept., UK. \n\n\fSpeech Modelling Using Subspace and EM Techniques \n\n797 \n\n1  Introduction \n\nThis paper models speech using  a stochastic state  space model,  where model parameters \nare  estimated  using  the  expectation-maximisation  (EM)  technique.  One  problem is  the \ninitialisation of the EM algorithm. Standard initialisation schemes can lead to poor formant \ntrajectories.  These trajectories are however important for vowel intelligibility.  This paper \ninvestigates the suitability of subspace state space system identification (4SIO) techniques \n[10,11], which are popular in system identification, for EM initialisation. \n\nSpeech  is  split  into  fixed-length,  overlapping  frames.  Overlap  encourages  temporally \nsmoother parameter transitions between frames.  Oue to  the slow non-stationary behaviour \nof speech,  each frame  of speech is  assumed quasi-stationary  and represented as  a linear \ntime-invariant stochastic state space (SS) model. \n\nXt+l  =  AXt + Wt \nCXt + Vt \n\nYt \n\n(1) \n(2) \n\nThe  system  order is  p.  Xt  E  ~p X l  is  the  state  vector.  A  E  ~p x p and  C  E  ~l x p  are \nsystem parameters.  The output Yt  E  ~ is  the  speech  signal  at  the  microphone.  Process \nand observation noises are modelled as white zero-mean Gaussian stationary noises Wt  E \nf\"V  N(O, Q)  and  Vt  E  ~ f\"V  N(O, R)  respectively.  The  problem  definition  is  to \n~p x l \nestimate parameters e  = (A, c, Q, R) from speech Yt  only. \nThe structure of the paper is as follows . The theory section describes EM and 4SIO applied \nto the parameter estimation of the above SS  model.  The similarity of 4SIO and EM moti(cid:173)\nvates the use of 4SID to initialise EM. Experiments on real speech then compare 4SIO with \nmore conventional initialisation methods. The discussion then compares 4SIO with EM. \n\n2  Theory \n\n2.1  The Expectation-Maximisation (EM) Technique \n\nGiven a sequence of N  observations Yl:N  of a signal such as speech, the maximum like(cid:173)\nlihood estimate  for  the  parameters  is  9ML  =  arg  maxep(Yl:N!e) .  EM  breaks  the \nmaximisation of this  potentially difficult likelihood function  down  into an  iterative max(cid:173)\nimisation of a simpler  likelihood function,  generating a new estimate ek each  iteration. \nRewriting P(Yl:N!e)  in  terms of a hidden  state  sequence Xl:N,  and taking expectations \nover P(Xl:N!Yl:N, ek) \n\n10gp(Yl:N!e) \n10gp(Yl:N!e) \n\n=  10gp(Xl:N,yl :N!e) -logp(Xl:N!Yl :N,e) \n\nEk[logp(Xl:N, Yl:N!e)] - E k[logp(Xl :N!Yl:N, e)] \n\n(3) \n(4) \n\nIterative  maximisation  of the  first  expectation  in  equation  4  guarantees  an  increase  in \n10gp(Yl:N!e). \n\nThis converges to  a local or global maximum depending on the initial parameter estimate \neo.  Refer to  [8]  for more details.  EM can  thus  be  applied to the stochastic state  space \n\n(5) \n\n\f798 \n\nG.  Smith, J.  F.  G.  d.  Freitas,  T.  Robinson and M  Niranjan \n\nmodel of equations 1 and 2 to determine optimal parameters e.  An explanation is  given \nin  [3].  The EM algorithm  applied  to  the  SS  system consists  of two  stages per  iteration. \nFirstly,  given current parameter estimates,  states  are estimated using a  Kalman smoother. \nSecondly,  given  these  states,  new parameters are  estimated by  maximising the  expected \nlog likelihood function.  We employ the Rauch-Tung-Striebel formulation of the  Kalman \nsmoother [2]. \n\n2.2  The State-Space Model \n\nEquations 1 and 2 can be cast in block matrix form and are termed the state sequence and \nblock output equations respectively  [10].  Note that  the  use of blocking and fixed-length \nsignals applies restrictions to the general model in section 1.  i  > P is the block size. \n\nXi+I,i+j \nYI!i \n\nAiXl,j + arWI!i \nriXI,j + HrWI!i + VI!i \n\n(6) \n(7) \n\nXi+I,i+j  is a state sequence matrix; its columns are  the state vectors from time  (i + 1)  to \n(i+j).  XI ,j  is similarly defined. Y W is a Hankel matrix of outputs from time 1 to (i+j-1). \nW  and V  are similarly defined. a i \nis a reversed extended controllability-type matrix, r i \nis  the  extended observability matrix  and Hi is  a  Toeplitz matrix.  These are  all  defined \nbelow where IPxp is an identity matrix. \n\ndef \n\ndef \n\nXI,j \n\na w \n\n~ \n\n[Xl  X2  X3  \u2022\u2022.  Xj] \n\n[Ai- I  A i- 2  ...  I] \n\nr\u00b7~ ~  -\n\n[ci, 1 \n\nY \n\n[  y, \n~f  Y2 \n: \n\nl!i  -\n\nY2 \nY3 \n\nYj \n\nYj+1 \n\nYi  Yi+l \n\nYi+j-l \n\nH~~f \n\n, \n\n1 \n\n[ c1-, \n\n0 \n\n:J \n\nC \n\nA  sequence of outputs can  be separated into two block output equations containing past \nand future outputs denoted with subscriptsp and! respectively. With Yp dg YI!i, Y, dg \nY i+1!2i  and similarly for W  and V,  and Xp  =  XI,j  and X,  =  Xi+I,i+j,  past and \nfuture are related by the equations \n\nde, \n\nde, \n\nAiXp  +arWp \nriXp + HiWp + Vp \nrix, + HiW, + V, \n\n(8) \n(9) \n(10) \n\n2.3  Subspace State Space System Identification (4SID) Techniques \n\nComments throughout this section on 4SIO are largely taken from the work of Van  Over(cid:173)\nschee and Oe Moor [10].  4SIO methods are related to instrumental variable (IV) methods \n[11].  4SIO algorithms are composed of two stages.  Stage one involves the low-rank ap(cid:173)\nproximation and estimation of the extended observability matrix directly from  the output \n\n\fSpeech Modelling Using Subspace and EM Techniques \n\n799 \n\ndata.  For example, consider the future output block equation 10.  Y, undergoes an orthogo(cid:173)\nnal projection onto the row space ofY p' This is denoted by Y, /'J p  = Y, YJ (Y p YJ) ty p, \nwhere t  is the Moore-Penrose inverse. \n\nr iX, /'J p  + HfW, /'J p  + V, /'J p \nrix,/'J p \n\n(11) \n\nStage two involves estimation of system parameters. The singular value decomposition of \nY, /'Jp  allows  the observability  and state  sequence matrices to  be estimated to  within  a \nsimilarity transform from  the column and row spaces respectively.  From these two matri(cid:173)\nces,  system parameters (A, c, Q, R) can be determined by least-squares. \nThere are two interesting comments.  Firstly, the orthogonal projection from stage one co(cid:173)\nincides with a minimum error between true data Y, and its  linear prediction from Y p  in \nthe Frobenius norm.  Greater flexibility  is  obtained by weighting the projection with ma(cid:173)\ntrices  WI  and  W 2  and  analysing  this:  WI (YJi'J p )W2 \u2022  4SID  and  IV  methods differ \nwith respect to these weighting matrices.  Weighting is similar to prefiltering the observa(cid:173)\ntions prior to analysis  to preferentially weight some frequency domain,  as  is common in \nidentification theory [6].  Secondly, the state estimates from stage two can be considered as \noutputs from a parallel bank of Kalman filters, each one estimating a state from the previous \ni observations, and initialised using zero conditions. \n\nThe particular subspace algorithm and software used in this paper is the sto-pos algorithm \nas detailed in  [10].  Although this  algorithm introduces a small bias into some of the pa(cid:173)\nrameter estimates, it guarantees positive realness of the covariance sequence, which in turn \nguarantees the definition of a forward innovations model. \n\n3  Experiments \n\nExperiments are conducted on the phrase  \"in arithmetic\",  spoken by an  adult male.  The \nspeech waveform is obtained from the Eurom 0 database [4]  and sampled at  16 kHz.  The \nspeech waveform is divided into fixed-length, overlapping frames,  the mean is subtracted \nand  then  a  hamming window  is  applied.  Frames  are  15  ms  in  duration,  shifted  7.5  ms \neach frame.  Speech is modelled as detailed in section  1.  All models are order 8.  A frame \nis  assumed silent and no analysis done when the mean energy per sample is  less than  an \nempirically defined threshold. \n\nFor the EM algorithm,  a modified version of the software in  [3]  is used.  The initial state \nvector and covariance matrix are set to zero and identity respectively, and 50 iterations are \napplied. Q is updated by taking its diagonal only in the M-step for numerical stability (see \n[3]). \n\nIn  these experiments,  three schemes  are compared at  initialising parameters for  the  EM \nalgorithm,  that is the estimation of 9 0 .  These schemes  are  compared in  terms of their \nformant trajectories relative to the spectrogram and their likelihoods.  The three schemes \nare \n\n\u2022  4SID. This is the subspace method in section 2.3 with block size 16 . \n\u2022  ARMA. This estimates 9 0  using the customised Matlab armax function!,  which \nmodels the speech waveform as an autoregressive moving average (ARMA) pro(cid:173)\ncess, with order 8 polynomials. \n\nI armax  minimises  a  robustified  quadratic  prediction  error  criterion  using  an  iterative  Gauss(cid:173)\nNewton algorithm, initialised using a four-stage least-squares instrumental variables algorithm [7]. \n\n\f800 \n\nG.  Smith, J  F.  G.  d.  Freitas,  T.  Robinson and M  Niranjan \n\n\u2022  AR(l).  This  uses  a  simplistic  method,  and  models  the  speech  waveform  as  a \nfirst order autoregressive (AR) process with some randomness introduced into the \nestimation.  It still initialises all parameters fully2. \n\nResults are shown in Figures 1 and 2.  Figure 1 shows the speech waveform,  spectrogram \nand formant trajectories for EM with all three initialisation schemes. Here formant frequen(cid:173)\ncies are derived from the phase of the positive phase eigenvalues of A  after 50 iterations of \nEM. Comparison with the spectrogram shows that for this order 8 model,  4SID-EM pro(cid:173)\nduces best formant trajectories.  Figure 2 shows mean average plots of likelihood against \nEM iteration  number for  each  initialisation  scheme.  4SID-EM gives greater likelihoods \nthan ARMA-EM and AR(l)-EM. The difference in formant trajectories between subspace(cid:173)\nEM and ARMA-EM despite the high likelihoods, demonstrates the multi-modality of the \nlikelihood  function.  For AR(l)-EM,  a few  frames  were  not estimated due to  numerical \ninstability. \n\n4  Discussion \n\nBoth the 4SID and EM algorithms employ similar methodologies: states are first estimated \nusing a  Kalman device,  and then these  states are used to estimate system parameters ac(cid:173)\ncording to similar criteria.  However in  EM,  states  are estimated using  past,  present and \nfuture observations with a Kalman smoother; system parameters are then estimated using \nmaximum  likelihood  (ML).  Whereas  in  4SID,  states  are  estimated  using  the  previous i \nobservations only with non-steady state Kalman filters.  System parameters are then esti(cid:173)\nmated using  least-squares (LS) subject to a positive realness constraint for the covariance \nsequence.  Refer also to [5] for a similar comparison. \n\n4SID algorithms are sub-optimal for three reasons.  Firstly, states are estimated using only \npartial observations sequences.  Secondly, the LS criterion is only an approximation to  the \nML criterion.  Thirdly, the positive realness constraint introduces bias.  A positive realness \nconstraint is necessary due to a finite amount of data and any lacking in the SS model. For \nthis  reason,  4SID  methods are  used to  initialise rather than  replace EM  in  these  experi(cid:173)\nments. \n\n4SID methods  also have some advantages.  Firstly,  they are linear and non-iterative,  and \ndo not  suffer from  the  disadvantages typical  of iterative algorithms (including EM)  such \nas sensitivity to initial conditions, convergence to  local minima, and the definition of con(cid:173)\nvergence criteria.  Secondly, they require little prior parameterisation except the definition \nof the system order, which can be determined in situ from observation of the singular val(cid:173)\nues of the orthogonal projection.  Thirdly, the use of the SVD gives numerical robustness \nto  the algorithms.  Fourthly,  they  have  higher frequency  resolution than prediction error \nminimisation methods such as ARMA and AR [1]. \n\n5  Conclusions \n\n4SID methods can be used to initialise EM giving better formant tracks, higher likelihoods \nand better frequency resolution than more conventional initialisation methods.  In the future \nwe hope to compare 4SID methods with EM in a principled probabilistic manner, investi(cid:173)\ngate weighting matrices further, and apply these methods to speech enhancement.  Further \nwork is done by Smith et al.  in [9], and similar work done by Grivel et aI.  in [5]. \n\nAcknowledgements \nWe are grateful for the use of 4SID software supplied with  [10]  and  the EM software of \n\n2Presented in the software in [3],  this method is best used when the dimensions of the state space \n\nand observations are the same. \n\n\fSpeech Modelling Using Subspace and EM Techniques \n\n801 \n\nle4 \n\n<I) \n\n\"0 .5 \nQ.. \nE \nt.s \n\n-le4 \n\n0 \n\n0 \n8 \n\nN \n::I: \n....: \n.......  4 \n0\" \n<I) \n..::: \n\n0 \n0 \n8 \n\n0.1 \n\n0.3 \n\ntime / s \n\n0.5 \n\n0.7 \n\n0.1 \n\n0.3 \n\n0.5 \n\n0.7 \n\ntH-\n\nN \n::I:  + \n....: \n....... \n0\" \n<I) \n..::: \n\n4 \n\nN \n::I: \n....: \n....... \n0\" \n<I) \n..::: \n\n8 \n\n8 \n\nN \n::I:  +  + \n....: \n.......  4 \n0\" \n+1 + \n~ \n0 1 \n\n1+14-\n\n~ \n\n80 \n\n00 \n\n+ \n\n+ \n\n++ + +t  ++  + \n\n+ + \n\n? \n\n+ \n\nFigure 1:  (a) Time waveform and (b) spectrogram for \"in arithmetic\". Formant trajectories \nare estimated using EM and a SS model initialised with three different schemes: (d) 4SID, \n(e) ARMA and (t) AR(l). \n\n\f802 \n\nG.  Smith. J.  F.  G.  d.  Freitas.  T.  Robinson and M.  Niranjan \n\nZoubin Ghahramani [3].  Gavin Smith is  supported by the Schiff Foundation, Cambridge \nUniversity.  At the time of writing, Nando de Freitas was supported by two University of \nthe Witwatersrand Merit Scholarships,  a Foundation for Research Development Scholar(cid:173)\nship (South Africa), an  ORS  award  and a Trinity College External Research Studentship \n(Cambridge). \n\n\u00b71400 \n\n\u00b71700 \n\n,,' \n\n2 \n\n10 \n\niteration nuLr \n\nso \n\nFigure 2:  Likelihood convergence plots for EM and the SS model initialised with \n4SID [- -], ARMA [-] and AR(I) [-.]  for the experiments in Figure 1.  Plots are the mean \naverage over all frames where analysed. \n\n6  References \n\n[1]  Arun,  K.S.  &  Kung,  S.Y.  (1990)  Balanced  Approximation  of Stochastic  Systems. \nSIAM Journal on Matrix Analysis and Applications, vol.  11, no.  1, pp. 42--68. \n\n[2] Gelb, A.  ed., (1974) Applied Optimal Estimation.  Cambridge, MA: MIT Press. \n\n[3] Ghahramani, Z. &  Hinton, G.  (1996) Parameter Estimation for Linear Dynamical Sys(cid:173)\ntems,  Tech.  rep.  CRG-TR-96-2, Dept.  of Computer Science, Univ.  of Toronto.  Software \nat www.gatsby.ucl.ac.ukrzoubinlsoJtware.html. \n\n[4] Grice, M. & Barry, W. (1989) Multi-lingual Speech Input/Output: Assessment, Method(cid:173)\nology and Standardization,  Tech.  rep., University College, London, ESPRIT Project 1541 \n(SAM), extension phase final report. \n\n[5] Grivel, E., Gabrea, M.  &  Najim, M.  (1999)  Subspace State Space Model Identification \nFor Speech Enhancement, Paper 1622, ICASSP'99. \n[6]  Ljung, L. (1987)  System Identification:  Theory for the  User.  Englewood Cliffs,  NJ: \nPrentice-Hall, Inc. \n[7]  Ljung, L. (1991)  System Identification Toolbox For  Use  With  MatLab.  24 Prime Park \nWay,  Natrick, MA, USA: The MathWorks, Inc. \n\n[8] McLachlan, G.J. &  Krishnan, T. (1997) The EM Algorithm and Extensions.  John Wiley \nand Sons Inc. \n\n[9] Smith, G.A. &  Robinson, A.J. &  Niranjan, M. (2000) A Comparison Between the EM \nand Subspace Algorithms for  the Time-Invariant Linear Dynamical System.  Tech.  rep. \nCUEDIF-INFENGffR.366, Engineering Dept., Cambridge Univ., UK. \n\n[10]  Van  Overschee, P.  &  De  Moor,  B. (1996)  Subspace Identification for Linear Sys(cid:173)\ntems:  Theory,  Implementation, Applications.  Dordrecht, Netherlands:  Kluwer Academic \nPublishers. \n\n[11]  Viberg, M.  &  Wahlberg,  B.  &  Ottersten,  B.  (1997)  Analysis of State Space System \nIdentification Methods Based on Instrumental Variables and Subspace Fitting.  Automatica, \nvol.  33, no.  9, pp.  1603-1616. \n\n\f", "award": [], "sourceid": 1787, "authors": [{"given_name": "Gavin", "family_name": "Smith", "institution": null}, {"given_name": "Jo\u00e3o", "family_name": "de Freitas", "institution": null}, {"given_name": "Tony", "family_name": "Robinson", "institution": null}, {"given_name": "Mahesan", "family_name": "Niranjan", "institution": null}]}