{"title": "Reconstruction of Sequential Data with Probabilistic Models and Continuity Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 414, "page_last": 420, "abstract": null, "full_text": "Reconstruction of Sequential Data with \n\nProbabilistic Models and Continuity Constraints \n\nMiguel A. Carreira-Perpifian \n\nDept. of Computer Science, University of Sheffield, UK \n\nmiguel@dcs.shefac.uk \n\nAbstract \n\nWe consider the problem of reconstructing a temporal discrete sequence \nof multidimensional real vectors when part of the data is  missing, under \nthe  assumption  that  the  sequence  was  generated  by  a  continuous  pro(cid:173)\ncess.  A particular case of this problem is  multivariate regression, which \nis  very difficult when  the  underlying mapping is one-to-many.  We  pro(cid:173)\npose  an  algorithm  based  on  a joint probability  model  of the  variables \nof interest,  implemented using a  nonlinear latent  variable  model.  Each \npoint  in  the  sequence  is  potentially reconstructed  as  any  of the  modes \nof the conditional distribution of the missing  variables given the present \nvariables (computed using an exhaustive mode search in a Gaussian mix(cid:173)\nture).  Mode selection is determined by  a dynamic programming search \nthat  minimises a geometric measure of the reconstructed sequence,  de(cid:173)\nrived from continuity constraints.  We illustrate the algorithm with a toy \nexample  and  apply  it  to  a real-world  inverse problem,  the  acoustic-to(cid:173)\narticulatory mapping.  The results  show  that the  algorithm outperforms \nconditional mean imputation and multilayer perceptrons. \n\n1  Definition of the problem \n\nConsider  a  mobile  point  following  a  continuous trajectory  in  a  subset  of ]RD.  Imagine \nthat  it  is  possible to  obtain  a finite  number of measurements of the position of the point. \nSuppose that these measurements are corrupted by noise and that sometimes part of, or all, \nthe variables are missing.  The problem considered here is to reconstruct the sequence from \nthe part of it which is observed.  In the particular case where the present variables and the \nmissing ones  are the  same for every  point,  the  problem is  one of multivariate regression. \nIf the  pattern  of missing  variables  is  more  general,  the  problem  is  one  of missing  data \nreconstruction. \n\nConsider the problem of regression.  If the present variables uniquely identify the missing \nones  at every  point of the data set,  the problem can  be  adequately  solved  by  a  universal \nfunction approximator, such as  a multilayer perceptron.  In  a probabilistic framework,  the \nconditional mean of the missing variables given the present ones will minimise the average \nsquared reconstruction error [3].  However, if the underlying mapping is one-to-many, there \nwill  be  regions  in  the  space for  which  the  present variables do  not identify  uniquely  the \nmissing  ones.  In  this  case,  the  conditional  mean  mapping  will  fail,  since  it  will  give  a \ncompromise  value-an average of the correct ones.  Inverse  problems,  where the  inverse \n\n\fProbabilistic Sequential Data Reconstruction \n\n415 \n\nof a  mapping is  one-to-many,  are  of this  type.  They  include the  acoustic-to-articulatory \nmapping in speech [15], where different vocal tract shapes may produce the same acoustic \nsignal, or the robot arm problem [2], where different configurations of the joint angles may \nplace the hand in the same position. \n\nIn some situations, data reconstruction is a means to some other objective, such as classifi(cid:173)\ncation or inference. Here, we deal solely with data reconstruction of temporally continuous \nsequences according to  the squared error.  Our algorithm does  not apply for data sets that \neither lack continuity (e.g. discrete variables) or have lost it (e.g.  due to undersampling or \nshuffling). \n\nWe follow a statistical learning approach:  we attempt to reconstruct the sequence by learn(cid:173)\ning  the  mapping  from  a  training  set  drawn from  the  probability  distribution  of the  data, \nrather than  by  solving  a  physical  model  of the  system.  Our  algorithm can  be  described \nbriefly as follows.  First, a joint density model of the data is learned in an unsupervised way \nfrom  a sample of the datal .  Then, pointwise reconstruction is  achieved by  computing all \nthe  modes of the conditional distribution  of the  missing  variables given  the present ones \nat the current point.  In principle, any of these modes is potentially a plausible reconstruc(cid:173)\ntion.  When  reconstructing a sequence,  we  repeat this  mode search  for  every point in  the \nsequence,  and then  find  the  combination of modes that  minimises a geometric sequence \nmeasure, using dynamic programming. The sequence measure is derived from local conti(cid:173)\nnuity constraints, e.g. the curve length. \n\nThe algorithm is detailed in \u00a72 to \u00a74.  We illustrate it with a 2D toy problem in \u00a75 and apply \nit to  an  acoustic-to-articulatory-like problem in \u00a76.  \u00a77 discusses the results and compares \nthe approach with previous work. \nOur  notation  is  as  follows.  We  represent the  observed  variables  in  vector form  as  t  = \n(tl' ... , t D)  E  ~D. A data set (possibly a temporal sequence) is represented as  {t n } ~=l . \nGroups of variables are represented by  sets of indices I, J  E  {I, ... , D}, so  that if I  = \n{I, 7, 3}, then tI = (tlt7t3). \n\n2  Joint generative modelling using latent variables \n\nOur starting point is a joint probability model of the observed variables p( t).  From it,  we \ncan compute conditional distributions of the form p( t..71 tI) and, by picking representative \npoints,  derive a  (multivalued) mapping  tI  ~ t..7.  Thus,  contrarily  to  other approaches, \ne.g.  [6],  we  adopt multiple pointwise  imputation.  In  \u00a74  we  show how  to  obtain a single \nreconstructed sequence of points. \n\nAlthough density estimation requires more parameters than mapping approximation, it has \na fundamental advantage [6]:  the density  model represents the relation  between any  vari(cid:173)\nables,  which allows to  choose any  missing/present variable combination.  A mapping ap(cid:173)\nproximator treats asymmetrically some variables as inputs (present) and the rest as outputs \n(missing) and can't easily deal with other relations. \n\nThe existence of functional  relationships (even one-to-many) between the  observed  vari(cid:173)\nables indicates that the data must span a low-dimensional manifold in the data space.  This \nsuggests the  use of latent variable  models for  modelling the joint density.  However,  it  is \npossible to use other kinds of density models. \n\nIn  latent variable  modelling the  assumption is  that the observed high-dimensional data t \nis  generated from  an  underlying  low-dimensional process  defined  by  a  small  number L \nof latent  variables  x  =  (Xl, ... , xL)  [1] .  The  latent  variables  are  mapped  by  a  fixed \n\nI In our examples we only use complete training data (i.e., with no missing data), but it is perfectly \npossible to estimate a probability model with incomplete training data by using an EM algorithm [6]. \n\n\f416 \n\nM  A.  Carreira-Perpiful.n \n\ntransformation  into  a  D-dimensional  data  space  and  noise  is  added  there.  A  particular \nmodel is specified by three parametric elements:  a prior distribution in latent space p(x), a \nsmooth mapping f  from  latent space to data space and a noise model in data space p(tlx). \nMarginalising the joint probability density function p(t, x)  over the latent space gives the \ndistribution  in  data space, p(t).  Given  an  observed sample in  data space  {t n };;=l'  a pa(cid:173)\nrameter estimate can  be  found  by  maximising  the  log-likelihood,  typically  using  an  EM \nalgorithm.  We  consider the  following  latent  variable  models,  both of which  allow  easy \ncomputation of conditional distributions of the form p( tJ It I  ): \n\nFactor analysis  [1], in which the mapping is linear, the prior in latent space is  unit Gaus(cid:173)\nsian and the noise model  is  diagonal Gaussian.  The density  in data space is  then \nGaussian with a constrained covariance matrix.  We  use  it as  a baseline for com(cid:173)\nparison with more sophisticated models. \n\nThe generative topographic mapping (GTM)  [4]  is  a  nonlinear  latent  variable  model, \n\nwhere the mapping is  a generalised linear model, the prior in  latent space is  dis(cid:173)\ncrete uniform and the noise model is isotropic Gaussian.  The density in data space \nis then a constrained mixture of isotropic Gaussians. \n\nIn  latent  variable  models  that  sample  the  latent space prior distribution  (like  GTM),  the \nmixture centroids  in  data space  (associated to  the  latent space samples)  are  not trainable \nparameters. We can then improve the density model at a higher computational cost with no \ngeneralisation loss by increasing the number of mixture components. Note that the number \nof components required  will  depend exponentially  on  the  intrinsic  dimensionality  of the \ndata (ideally coincident with that of the latent space, L) and not on the observed one, D. \n\n3  Exhaustive mode finding \n\nGiven a conditional distribution p(tJltI), we  consider all  its  modes as  plausible predic(cid:173)\ntions for tJ.  This requires an  exhaustive mode search in  the space of t J .  For Gaussian \nmixtures, we do this by  using a maximisation algorithm starting from each centroid2 ,  such \nas a fixed-point iteration or gradient ascent combined with quadratic optimisation  [5].  In \nthe particular case where all  variables are missing, rather than performing a mode search, \nwe  return  as  predictions all  the  component centroids.  It  is  also  possible  to  obtain  error \nbars at each mode by locally approximating the density function by  a normal distribution. \nHowever,  if the dimensionality of tJ is  high,  the error bars become very  wide due to  the \ncurse of the dimensionality. \n\nAn advantage of multiple pointwise imputation is the easy incorporation of extra constraints \non the missing variables.  Such constraints might include keeping only those modes that lie \nin  an interval dependent on the present variables [8]  or discarding  low-probability (spuri(cid:173)\nous) modes-which speeds up the reconstruction algorithm and may make it more robust. \n\nA faster way to generate representative points of p(tJltI) is simply to draw a fixed number \nof samples from  it-which may also give robustness to poor density models.  However, in \npractice this resulted in a higher reconstruction error. \n\n4  Continuity constraints and dynamic programming (D.P) search \n\nApplication  of the  exhaustive mode  search  to  the  conditional  distribution  at every  point \nof the  sequence produces  one  or more  candidate reconstructions  per point.  To  select  a \n\n2 Actually,  given  a  value  of tz, most  centroids  have  negligible  posterior probability  and  can  be \nremoved  from  the  mixture  with  practically  no  loss  of accuracy.  Thus,  a  large  number  of mixture \ncomponents may be used  without deteriorating excessively the computational efficiency. \n\n\fProbabilistic Sequential Data Reconstrnction \n\n417 \n\ntrajectory \nfactor an. \nmean \ndpmode \n\n..., \n\nN \n\n0 \n\n-2 \n\n-4 \n\n-6 \n\n- 6 \n\n-4 \n\n-2 \n\ntl \n\nAverage squared reconstruction error \n\nMissing \npattern \n\nh \ntl \n\ntl or t2 \n\n10% \n50% \n90% \n\nFactor \nanalysis \n3.8902 \n4.3226 \n4.2020 \n1.0983 \n6.2914 \n21.4942 \n\nMLP\" \n\n0.2046 \n2.5126 \n\n-\n-\n-\n-\n\nmean \n0.2044 \n2.4224 \n1.2963 \n0.3970 \n4.6530 \n20.7877 \n\nGTM \ndpmode \n0.2168 \n0.0522 \n0.1305 \n0.0253 \n0. 1176 \n2.2261 \n\ncmode \n0.2168 \n0.0522 \n0. 1305 \n0.0251 \n0.0771 \n0.0643 \n\naThe  MLP  cannot  be  applied  to  varying  patterns  of \n\nmissing data. \n\nTable  1:  Trajectory reconstruction for a 2D problem.  The table gives the average squared \nreconstruction error when t2  is missing (row 1), tl is missing (row 2), exactly one variable \nper point is missing at random (row 3) or a percentage of the values are missing at random \n(rows 4-6).  The graph shows the reconstructed trajectory when tl is missing:  factor anal(cid:173)\nysis (straight, dotted line), mean (thick, dashed), dpmode (superimposed on the trajectory). \n\nsingle reconstructed sequence, we define a local continuity constraint:  consecutive points \nin time should also lie nearby in data space.  That is,  if 8 is some suitable distance in  JR.D, \n8 (tn, tn+ 1)  should be small. Then we define a global geometric measure ~ for a sequence \n\n{tn};;=1  as ~ ({t n};;=I) ~f '2:.::11 8 (tn, tn+t). We take 8 as the Euclidean distance, so \n\n~ becomes simply the length of the sequence (considered as a polygonal line).  Finding the \nsequence of modes with minimal ~ is efficiently achieved by dynamic programming. \n\n5  Results with a toy problem \n\nTo  illustrate the  algorithm, we generated a 2D data set from  the curve (tl, t2)  =  (x, x + \n3 sin(x))  for x  E  [-211',211'],  with  normal  isotropic noise  (standard deviation 0.2) added. \nThus,  the  mapping  tl  -+  t2  is  one-to-one but the  inverse one,  t2  -+  tl, is  multivalued. \nOne-dimensional factor analysis  (6  parameters) and GTM  models  (21  parameters)  were \nestimated from a 1000-point sample, as well as two 48-hidden-unit multilayer perceptrons \n(98 parameters), one for each mapping. For GTM we tried several strategies to select points \nfrom the conditional distribution: mean (the conditional mean), dpmode (the mode selected \nby dynamic programming) and cmode (the closest mode to the actual value of the missing \nvariable).  The  cmode,  unknown  in  practice,  is  used  here  to  compute a  lower bound on \nthe performance of any  mode-based strategy.  Other strategies, such as picking the global \nmode, a random mode or using a local (greedy) search instead of dynamic programming, \ngave worse results than the dpmode. \n\nTable 1 shows the results for reconstructing a IOO-point trajectory.  The nonlinear nature of \nthe problem causes factor analysis to break down in all cases.  For the one-to-one mapping \ncase  (t2  missing)  all  the other methods perform well  and recover the  original  trajectory, \nwith mean attaining the lowest error, as predicted by the theory3.  For the one-to-many case \n(tl missing, see fig .), both the MLP and the mean are unable to track more than one branch \nof the  mapping,  but the  dpmode still  recovers the original mapping.  For random missing \n\n3 A  combined  strategy  could  retain  the  optimality  of the  mean  in  the  one-to-one  case  and  the \nadvantage of the modes  in the one-to-many case, by choosing the conditional mean (rather than the \nmode) when the conditional distribution is unimodal, and all the modes otherwise. \n\n\f418 \n\nM  A.  Carreira-Perpinan \n\nMissing \npattern \nPLP \nEPG \n10% \n50% \nblocks \n\nFactor \nanalysis  mean \n0.6217 \n0.9165 \n2.3729 \n3.7177 \n0.0947 \n0.2046 \n1.1285 \n0.7540 \n0.1669 \n0.1950 \n\nGTM \ndpmode \n0.6250 \n2.0613 \n0.0903 \n0.6527 \n0.1005 \n\ncmode \n0.4587 \n1.0538 \n0.0841 \n0.6023 \n0.0925 \n\nTable 2:  Average squared reconstruction error for an  utterance.  The last row corresponds \nto a missing pattern of square blocks totalling 10% of the utterance. \n\npatterns4 ,  the dpmode is able to cope well  with high amounts of missing data. \n\nThe  consistently  low  error of the  cmode  shows  that  the  modes  contain  important  infor(cid:173)\nmation about the possible options to  predict the  missing  values.  The performance of the \ndpmode,  close  to  that  of the  cmode  even  for  large  amounts  of missing  data,  shows  that \napplication of the continuity constraint allows to recover that information. \n\n6  Results with real speech data \n\nWe  report a preliminary experiment using  acoustic and e1ectropalatographic (EPG) data5 \nfor the utterance \"Put your hat on the hatrack and your coat in the cupboard\" (speaker FG) \nfrom  the  ACCOR  database  [10].  12th-order perceptual  linear prediction coefficients  [7] \nplus the log-energy were computed at 200 Hz from  its acoustic waveform.  The EPG data \nconsisted of 62-bit frames sampled at 200 Hz, which we consider as 62-dimensional vectors \nof real numbers.  No further preprocessing of the data was carried out.  Thus, the resulting \nsequence consisted of over 600 75-dimensional real vectors. We constructed a training set \nby  picking, in  random order, 80% of these  vectors.  The whole utterance was used for the \nreconstruction test. \n\nWe  trained  two  density  models:  a 9-dimensional factor  analysis  (825  parameters) and  a \ntwo-dimensional6  GTM (3676 parameters) with a 20  x  20 grid (resulting in  a mixture of \n400 isotropic Gaussians in the 75-dimensional data space). Table 2 confirms again that the \nlinear method (factor analysis) fares  worst (despite its use of a latent space of dimension \nL  =  9).  The dpmode attains almost always a lower error than the conditional mean, with up \nto a 40% improvement (the larger the higher the amount of missing data).  When a shuffled \nversion of the utterance (thus having lost its continuity) was reconstructed, the error of the \ndpmode was consistently higher than that of the mean, indicating that the application of the \ncontinuity constraint was responsible for the error decrease. \n\n7  Discussion \n\nUsing a joint probability model allows flexible construction of predictive distributions for \nthe missing data:  varying patterns of missing data and multiple pointwise imputations are \npossible, as opposed to standard function approximators. We have shown that the modes of \nthe conditional distribution of the missing  variables given the present ones are potentially \n\n4Note that the nature of the missing pattern (missing at  random,  missing completely  at  random, \n\netc.  [9]) does not matter for reconstruction-although it does for estimation. \n\n5 An  EPG  datum  is  the  (binary)  contact  pattern  between  the  tongue  and  the  palate  at  selected \n\nlocations in the latter. Note that it is an incomplete articulatory representation of speech. \n\n6 A latent space of 2 dimensions is clearly too low  for this data,  but the computational complexity \n\nof GTM prevents the use of a higher one. Still, its nonlinear character compensates partly for this. \n\n\fProbabilistic Sequential Data Reconstruction \n\n419 \n\nplausible reconstructions of the missing values, and that the application of local continuity \nconstraints-when they hold-can help to recover the actually plausible ones. \n\nPrevious work  The  key  aspects  of our approach  are  the  use  of a joint density  model \n(learnt in  an unsupervised way), the exhaustive mode search, the definition of a geometric \ntrajectory measure derived from continuity constraints and its implementation by  dynamic \nprogramming.  Several of these ideas have been applied earlier in the literature, which we \nreview briefly. \n\nThe  use  of the joint density  model for prediction  is  the  basis  of the statistical  technique \nof multiple imputation  [9].  Here,  several  versions of the complete data set are generated \nfrom the appropriate conditional distributions, analysed by standard complete-data methods \nand the results combined to produce inferences that incorporate missing-data uncertainty. \nGhahramani and Jordan  [6]  also proposed the use of the joint density model to generate a \nsingle estimate of the missing variables and applied it to a classification problem. \n\nConditional distributions have been approximated by MLPs rather than by density estima(cid:173)\ntion [16], but this lacks flexibility to  varying patterns of missing data and requires an extra \nmodel of the input variables distribution (unless assumed uniform). \n\nRohwer and van der Rest [12]  introduce a cost function with a description length interpre(cid:173)\ntation  whose minimum  is  approximated by  the densest mode of a distribution.  A  neural \nnetwork trained with this cost function can learn one branch of a multivariate mapping, but \nis unable to select other branches which may be correct at a given time. \n\nContinuity  constraints  implemented  via  dynamic  programming  have  been  used  for  the \nacoustic-to-articulatory mapping problem  [15].  Reasonable  results  (better than  using  an \nMLP to approximate the mapping) can be obtained using a large codebook of acoustic and \narticulatory  vectors.  Rahim  et al.  [11]  achieve similar quality  with  much  less  computa(cid:173)\ntional requirements using an assembly of MLPs, each one trained in a different area of the \nacoustic-articulatory space,  to  locally approximate the mapping.  However, clustering the \nspace is  heuristic  (with  no  guarantee that  the  mapping is  one-to-one in  each region) and \ntraining the assembly is difficult.  It also lacks flexibility to varying missingness patterns. \n\nA number of trajectory measures have been used in the robot arm problem literature [2] and \nminimised by dynamic programming, such as the energy, torque, acceleration, jerk, etc. \n\nTemporal  modelling \nIt  is  important  to  remark  that  our  approach  does  not  attempt  to \nmodel the temporal evolution of the system.  The joint probability model is estimated stat(cid:173)\nically.  The temporal aspect of the data appears indirectly and a posteriori through the ap(cid:173)\nplication of the continuity constraints to select a trajectory?  In  this respect, our approach \ndiffers from  that of dynamical systems or from  models based in  Markovian assumptions, \nsuch as hidden Markov models or other trajectory models [13,  14].  However, the fact that \nthe duration or speed of the trajectory plays no role in the algorithm may make it invariant \nto time warping (e.g. robust to fast/slow speech styles). \n\nChoice of density model  The fact that the modes are a key aspect of our approach make \nit sensitive to the density model.  With finite mixtures, spurious modes can appear as ripple \nsuperimposed on the density function in regions where the mixture components are sparsely \ndistributed  and  have  little  interaction.  Such  modes  can  lead  the  DP  search  to  a  wrong \ntrajectory.  Possible solutions are to improve the density model (perhaps by increasing the \nnumber of components, see \u00a72, or by regularisation), to smooth the conditional distribution \nor to look for bumps (regions of high probability mass) instead of modes. \n\n7However, the method may be derived by assuming a distribution over the whole sequence with a \n\nnormal, Markovian dependence between adjacent frames. \n\n\f420 \n\nM.  A.  Carreira-Perpifuin \n\nComputational cost  The DP search has complexity O(N M2),  where  M  is  an  average \nof the number of modes per sequence point and N  the number of points in the sequence. In \nour experiments M  is usually small and the DP search is fast even for long sequences. The \nbottleneck of the reconstruction part of the algorithm is obtaining the modes of the condi(cid:173)\ntional distribution for every point in  the sequence when there are many missing variables. \n\nFurther work  We  envisage more thorough experiments using data from  the  Wisconsin \nX-ray  microbeam  database  and comparing  with  recurrent MLPs  or an  MLP committee, \nwhich  may  be  more  suitable for  multi valued mappings.  Extensions of our algorithm  in(cid:173)\nclude different geometric measures (e.g. curvature-based rather than length-based), differ(cid:173)\nent strategies for multiple pointwise imputation (e.g. bump searching) or multidimensional \nconstraints  (e.g.  temporal  and  spatial).  Other practical  applications  include  audiovisual \nmappings for speech, hippocampal place cell reconstruction and wind vector retrieval from \nscatterometer data. \n\nAcknowledgments \n\nWe thank Steve Renals for useful conversations and for comments about this paper. \n\nReferences \n\n[1]  D. J.  Bartholomew.  Latent Variable  Models and Factor Analysis.  Charles Griffin &  Company \n\nLtd., London,  1987. \n\n[2]  N. Bernstein.  The  Coordination and Regulation 0/ Movements.  Pergamon, Oxford,  1967. \n[3]  C.  M.  Bishop.  Neural Networks/or Pattern Recognition.  Oxford University Press,  1995. \n[4]  C.  M. Bishop, M. Svensen, and C.  K.  I. Williams.  GTM: The generative topographic mapping. \n\nNeural Computation,  10(1):215-234, Jan.  1998. \n\n[5]  M.  A.  Carreira-Perpifian.  Mode-finding  in  Gaussian  mixtures.  Technical  Report CS-99-03, \nDept.  of  Computer  Science,  University  of  Sheffield,  UK,  Mar.  1999.  Available  online  at \nhttp://vvv.dcs.shef.ac.uk/-miguel/papers/cs-99-03.html. \n\n[6]  Z.  Ghahramani  and  M.  I.  Jordan.  Supervised  learning  from  incomplete  data  via  an  EM  ap(cid:173)\n\nproach.  In NIPS 6, pages  120-127,1994. \n\n[7]  H.  Hermansky.  Perceptual  linear predictive (PLP)  analysis of speech.  1.  Acoustic Soc.  Amer., \n\n87(4):1738-1752, Apr.  1990. \n\n[8]  L. Josifovski, M.  Cooke,  P. Green, and A.  Vizinho.  State based imputation of missing data for \nrobust speech recognition and speech enhancement.  In Proc.  Eurospeech 99. pages 2837-2840, \n1999. \n\n[9]  R.  1.  A. Little and  D.  B.  Rubin.  Statistical Analysis with Missing  Data.  John Wiley  &  Sons, \n\nNew York,  London, Sydney,  1987. \n\n[10]  A.  Marchal and W.  J. Hardcastle.  ACCOR: Instrumentation and database for the cross-language \n\nstudy of coarticulation.  Language and Speech,  36(2, 3): 137-153,  1993. \n\n[11]  M.  G.  Rahim,  C.  C.  Goodyear,  W.  B.  Kleijn,  J. Schroeter,  and  M. M.  Sondhi .  On the  use of \nneural networks in articulatory speech synthesis.  1. Acoustic Soc.  Amer., 93(2): 1109-1121, Feb. \n1993. \n\n[12]  R.  Rohwer and J.  C.  van der Rest.  Minimum description length, regularization, and multi modal \n\ndata.  Neural Computation,  8(3):595-609, Apr.  1996. \n\n[13]  S.  Roweis.  Constrained hidden Markov models.  In NIPS 12 (this volume), 2000. \n[14]  L. K.  Saul and M.  G. Rahim.  Markov processes on curves for automatic speech recognition.  In \n\nNIPS 11,  pages 751-757,  1999. \n\n[15]  1.  Schroeter and M. M.  Sondhi.  Techniques  for estimating  vocal-tract shapes  from  the speech \n\nsignal.  IEEE Trans.  Speech and Audio Process., 2(1): 133-150, Jan.  1994. \n\n[16]  V.  Tresp,  R.  Neuneier,  and  S.  Ahmad.  Efficient  methods  for  dealing  with  missing  data  in \n\nsupervised learning.  In NiPS 7,  pages 689-696,  1995. \n\n\f", "award": [], "sourceid": 1660, "authors": [{"given_name": "Miguel", "family_name": "Carreira-Perpi\u00f1\u00e1n", "institution": null}]}