{"title": "Family Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 402, "page_last": 408, "abstract": null, "full_text": "Family  Discovery \n\nStephen M.  Omohundro \n\nNEC  Research Institute \n\n4 Independence Way,  Princeton,  N J 08540 \n\nom@research.nj.nec.com \n\nAbstract \n\n\"Family discovery\"  is the task of learning the dimension and struc(cid:173)\nture  of a  parameterized  family  of  stochastic  models.  It is  espe(cid:173)\ncially appropriate when the training examples are partitioned into \n\"episodes\"  of samples  drawn  from  a  single  parameter  value.  We \npresent  three  family  discovery  algorithms  based on  surface  learn(cid:173)\ning and show that they significantly improve performance over two \nalternatives on a  parameterized classification task. \n\n1 \n\nINTRODUCTION \n\nHuman listeners improve their ability to recognize speech by identifying the accent \nof the speaker.  \"Might\" in an American accent is similar to \"mate\" in an Australian \naccent.  By first  identifying the accent,  discrimination  between  these  two  words  is \nimproved.  We  can imagine locating a speaker in a  \"space of accents\"  parameterized \nby features  like  pitch,  vowel  formants,  \"r\" -strength,  etc.  This paper considers  the \ntask of learning such  parameterized models from  data. \n\nMost  speech  recognition  systems  train  hidden  Markov  models  on  labelled  speech \ndata.  Speaker-dependent systems train on speech from  a  single  speaker.  Speaker(cid:173)\nindependent  systems  are  usually  similar,  but  are  trained  on  speech  from  many \ndifferent  speakers in the hope that they will  then recognize them all.  This kind  of \ntraining ignores speaker identity and is likely to result in confusion between pairs of \nwords  which  are given the same pronunciation by speakers  with different  accents. \n\nSpeaker-independent recognition systems could more closely  mimic  the human ap(cid:173)\nproach by using a learning paradigm we call  \"family discovery\".  The system would \nbe trained on  speech data partitioned into  \"episodes\"  for  each speaker.  From this \ndata, the system would construct a parameterized family of models representing dif-\n\n\fFamily  Discovery \n\n403 \n\nAffine \nFamily \n\nAffine Patch \nFamily \n\nCoupled Map \nFamily \n\nFigure 1:  The structure of the three family  discovery algorithms. \n\nferent  accents.  The learning algorithms presented in this paper could determine the \ndimension  and  structure of the  parameterization.  Given  a  sample  of new  speech, \nthe best-fitting accent model would  be used for recognition. \n\nThe same paradigm applies to many other recognition tasks.  For example, an OCR \nsystem  could  learn  a  parameterized family  of font  models  (Revow,  et.  al.,  1994). \nGiven new text, the system would identify the document's font  parameters and use \nthe corresponding character recognizer. \n\nIn general, we  use  \"family discovery\"  to refer to the task of learning the dimension \nand  structure  of  a  parameterized  family  of stochastic  models.  The  methods  we \npresent  are  equally  applicable  to  parameterized  density  estimation,  classification, \nregression,  manifold  learning,  reinforcement  learning,  clustering,  stochastic  gram(cid:173)\nmar learning,  and other stochastic settings.  Here we  only discuss classification and \nprimarily consider training examples which are explicitly partitioned into episodes. \n\nThis approach fits  naturally into the neural network literature on  \"meta-learning\" \n(Schmidhuber,  1995)  and  \"network transfer\"  (Pratt,  1994).  It may also  be consid(cid:173)\nered  as  a  particular  case  of the  \"bias  learning\"  framework  proposed  by Baxter at \nthis conference  (Baxter,  1996). \n\nThere are two primary alternatives to family  discovery:  1)  try to fit  a single model \nto the data from  all  episodes or 2)  use  separate models  for  each  episode.  The first \napproach ignores the information that the different training sets came from distinct \nmodels.  The second approach eliminates the possibility of inductive  generalization \nfrom  one set to another. \n\nIn  Section  2,  we  present three algorithms  for  family  discovery  based on techniques \nfor \"surface learning\"  (Bregler and Omohundro, 1994 and 1995).  As shown in Figure \n1, the three alternative representations of the family  are:  1)  a single affine subspace \nof the parameter space,  2)  a  set of local affine  patches smoothly  blended  together, \nand 3)  a  pair of coupled  maps from the parameter space  into the model space and \nback.  In Section 3,  we  compare these three approaches to the two  alternatives on a \nparameterized classification task. \n\n\f404 \n\nS.  M. OMOHUNDRO \n\n2  THE FIVE ALGORITHMS \n\nLet the space of all classifiers under consideration be parameterized by 0 and assume \nthat different values of 0 correspond to different classifiers (ie.  it is identifiable).  For \nexample,  0  might  represent  the means,  covariances,  and class  priors of a  classifier \nwith  normal  class-conditional densities.  O-space  will  typically  have  a  much  higher \ndimension  than the  parameterized  family  we  are  seeking.  We  write P9(X)  for  the \ntotal probability that the classifier 0  assigns to a  labelled or unlabelled example x. \n\nThe true models are drawn from a d-dimensional family parameterized by , .  Let the \ntraining set  be partitioned into  N  episodes where episode i  consists  of Ni  training \nexamples  tij,  1  :S  j  :S  Ni  drawn  from  a  single  underlying  model  with  parameter \n0:.  A  family  discovery  learning  algorithm  uses  this  training data to estimate  the \nunderlying parameterized family. \n\nFrom a parameterized family,  we may define the projection operator P  from O-space \nto itself which takes each 0 to the closest member of the family.  Using this projection \noperator,  we  may  define  a  \"family  prior\"  on  O-space  which  dies  off exponentially \nwith  the  square  distance  of a  model  from  the  family  mp(O)  ex  e-(9-P(9))2.  Each \nof the family  discovery algorithms chooses a  family  so as  to maximize the posterior \nprobability  of  the  training  data  with  respect  to  this  prior.  If the  data  is  very \nsparse,  this  MAP  approximation to a  full  Bayesian solution can  be  supplemented \nby  \"Occam\"  terms  (MacKay,  1995)  or by using a  Monte  Carlo approximation. \n\nThe outer loop of each of the algorithms performs the optimization of the fit  of the \ndata by  re-estimation in  a  manner similar to the Expectation Maximization  (EM) \napproach  (Jordan  and  Jacobs,  1994).  First,  the training  data in  each episode i  is \nindependently  fit  by  a  model  Oi.  Then  the  dimension  of the family  is  determined \nas  described later and the family  projection operator P  is  chosen to maximize the \n\nprobability  that  the  episode  models  Oi  came  from  that  family  ni  mp(Oi).  The \n\nepisode  models  Oi  are  then  re-estimated  including  the  new  prior  probability  mp. \nThese newly  re-estimated models are influenced  by the other episodes through mp \nand  so  exhibit  training  set  \"transfer\".  The  re-estimation  loop  is  repeated  until \nnothing changes. \n\nThe learned family can then be used to classify a  set of Ntest  unlabelled test exam(cid:173)\nples  Xk,  1 :S  k  :S  Ntest  drawn from a model O;est  in the family.  First, the parameter \nOtest  is  estimated by selecting the member of the family  with the highest likelihood \non the test samples.  This model is  then used to perform the classification.  A  good \napproximation to the best-fit family member is often to take the image of the best-fit \nmodel in the entire O-space  under the projection operator P. \n\nIn the next five  sections,  we  describe the two alternative approaches and the three \nfamily discovery algorithms.  They differ only in their choice of family representation \nas  encoded in the projection operator P. \n\n2.1  The Single Model Approach \n\nThe first  alternative approach is  to train a single model on all of the training data. \nIt  selects  0  to  maximize  the total  likelihood  L( 0)  =  n~l n~l P9 (tij ).  New  test \ndata is  classified by this single selected model. \n\n\fFamily  Discovery \n\n405 \n\n2.2  The Separate Models Approach \n\nThe second alternative approach fits  separate models for  each training }\u00a3isode.  It \nchooses  Bi  for  1::; i::; N  to maximize the episode likelihood  Li(Bi) =  TIj~IPIJ(tij). \nGiven  new  test  data, it  determines  which  of the  individual  models  Bi  fit  best  and \nclassifies  the data with it. \n\n2.3  The Affine  Algorithm \n\nThe  affine  model  represents  the  underlying  model  family  as  an  affine  subspace  of \nthe model  parameter space.  The projection operator Pal line  projects a  parameter \nvector  B orthogonally  onto  the  affine  subspace.  The  subspace  is  determined  by \nselecting the  top  principal vectors  in  a  principal components analysis  of the  best(cid:173)\nfit  episode  model  parameters.  As  described  in  (Bregler  &  Omohundro,  1994)  the \ndimension is  chosen  by looking for  a gap in the principal values. \n\n2.4  The Affine Patch Algorithm \n\nThe  second  family  discovery  algorithm  is  based  on  the  \"surface  learning\"  proce(cid:173)\ndure  described  in  (Bregler  and  Omohundro,  1994).  The family  is  represented  by \na  collection  of local  affine  patches  which  are  blended  together  using  Gaussian  in(cid:173)\nfluence  functions.  The projection mapping Ppatch  is  a  smooth convex combination \nof  projections  onto  the  affine  patches  Ppatch(B)  =  2::=1 10: (B)Ao: (B)  where  Ao:  is \nthe  projection operator for  an  affine  patch and Io:(B)  =  E:\"J:)(IJ)  is  a  normalized \nGaussian blending function. \n\nThe patches are initialized using k-means clustering on the episode models to choose \nk  patch centers.  A local principal components analysis is  performed on the episode \nmodels  which  are  closest  to  each  center.  The family  dimension  is  determined  by \nexamining how the principal values scale as successive nearest neighbors are consid(cid:173)\nered.  Each patch may be thought of as a  \"pancake\" lying in the surface.  Dimensions \nwhich  belong  to the  surface  grow  quickly  as  more neighbors  are  considered  while \ndimensions across the surface grow only because of the curvature of the surface. \n\nThe  Gaussian  influence  functions  and the  affine  patches  are then  updated  by  the \nEM  algorithm  (Jordan and  Jacobs,  1994).  With the  affine  patches held  fixed,  the \nGaussians Go:  are refit to the errors each patch makes in approximating the episode \nmodels.  Then with  the Gaussians held fixed,  the affine  patches Ao:  are refit  to the \nepsiode  models  weighted  by  the the  corresponding  Gaussian  Go:.  Similar  patches \nmay be  merged together to form  a  more parsimonious model. \n\n2.5  The  Coupled Map Algorithm \n\nThe affine patch approach has the virtue that it can represent topologically complex \nfamilies (eg.  families representing physical objects might naturally be parameterized \nby the rotation group which is topologically a projective plane).  It cannot, however, \nprovide an explicit  parameterization of the family  which  is  useful  in some applica(cid:173)\ntions  (eg.  optimization searches).  The third family  discovery  algorithm  therefore \nattempts to directly learn a  parameterization of the model family. \n\nRecall  that the  model  parameters define  B-space,  while  the  family  parameters de-\n\n\f406 \n\nS.  M.  OMOHUNDRO \n\nfine  'Y-space.  We  represent  a  family  by  a  mapping  G  from  B-space  to 'Y-space  to(cid:173)\ngether with  a  mapping  F  from  'Y-space  back to B-space.  The projection operation \nis  Pmap(B)  =  F(G(B)).  The map G(O)  defines  the  family  parameter l' on  the full \nO-space. \n\nThis representation is  similar to an  \"auto-associator\" network in which we  attempt \nto  \"encode\"  the  best-fit  episode  parameters  Oi  in  the  lower  dimensional  'Y-space \nby  the  mapping G  in  such  a  way  that they  can  be  correctly  reconstructed  by  the \nfunction  F.  Unfortunately,  if we  try to train  F  and G  using  back-propagation on \nthe identity error function,  we  get no training data away from  the family.  There is \nno reason for G to project points away from the family to the closest family  member. \nWe  can rectify this  by training F  and G  iteratively.  First an arbitrary G  is  chosen \nand F  is  trained to send  the images 'Yi  =  G(Oi)  back to 0i'  G  is  trained,  however, \non images  under F  corrupted  by  additive spherical  Gaussian noise!  This  provides \nsamples  away from  the family  and on average the training signal sends  each point \nin B space to the closest family  member. \n\nTo avoid iterative training, our experiments used a simpler approach.  G was taken to \nbe the affine  projection operator defined  by a global principal components analysis \nof the best-fit episode model parameters.  Once G is defined, F  is chosen to minimize \nthe difference  between F(G(Oi))  and Oi  for  each best-fit episode parameter Oi. \n\nAny form  of trainable nonlinear mapping could  be used for  F  (eg.  backprop neural \nnetworks or radial basis function networks).  We represent F  as a mixture of experts \n(Jordan and Jacobs,  1994)  where each expert is  an affine  mapping and the mixture \ncoefficients are Gaussians.  The mapping is  trained by the EM  algorithm. \n\n3  ALGORITHM  COMPARISON \n\nTo  compare these  five  algorithms,  we  consider  a  two-class  classification  task with \nunit-variance  normal  class-conditional  distributions  on  a  5-dimensional  feature \nspace.  The means of the class  distributions are parameterized by  a  nonlinear two(cid:173)\nparameter family: \n\nml  =  (1'1  + ~cos\u00a2\u00bbe~1 + ('Y2  + ~sin\u00a2\u00bbe~2 \nm2  =  ('Yl  - ~ cos \u00a2> ) e~1  + ('Y2  - ~ sin \u00a2> ) l2 . \n\nwhere  0  ~ 1'1, 1'2  ~ 10  and  \u00a2>  =  ('Yl  + 1'2)/3.  The class  means  are kept  at  a  unit \ndistance apart, ensuring significant  class  overlap over the whole family.  The angle \n\u00a2>  varies  with  the  parameters  so  that  the  correct  classification  boundary  changes \norientation  over  the  family.  This  choice  of  parameters  introduces  sufficient  non(cid:173)\nlinearity in the task to distinguish the non-linear algorithms from  the linear one. \n\nFigure 1 shows the comparative performance of the 5 algorithms.  The x-axis is  the \ntotal number of training examples.  Each set of examples consisted of approximately \nN  =  ..;x episodes of approximately  Ni  =  ..;x examples each.  The classifier param(cid:173)\neters for  an  episode  were  drawn  uniformly  from  the  classifier  family.  The episode \ntraining  examples  were  then  sampled  from  the  chosen  classifier  according  to  the \nclassifier's distribution.  Each of the  5 algorithms  was  then trained on these  exam(cid:173)\nples.  The number of patches in the surface patch algorithm and the number of affine \ncomponents in the surface map algorithm were both taken to be the square-root of \n\n\fFamily  Discovery \n\n407 \n\n0.52  r---.---.---\"\"T\"\"----r----,-----r---r---~-__, \n\nI!? \ng \nw \n'0 \nc: \n0 :u \nI!! \nu. \n\n0.5 \n\n0.48 \n\n0.46 \n\n0.44 \n\n0.42 \n\n0.4 \n\n0.38 \n\n0.36 \n\n0.34 \n\n400 \n\n600 \n\n800 \n\n1000 \n\n1200 \n\nNumber of  Examples \n\nSingle model  -+(cid:173)\n\nSeparate models  -+-_. \n\nAffine family  -EJ -(cid:173)\n\nAffine  Patch family  \u00b7\u00b7x\u00b7\u00b7\u00b7\u00b7 \nMap Mixture family  -A-.-\n\n1400 \n\n1600 \n\n1800 \n\n2000 \n\nFigure  2:  A  comparison of the  5 family  discovery  algorithms  on the  classification \ntask. \n\nthe number of training episodes. \n\nThe y-axis shows the percentage correct for  each algorithm on an independent test \nset.  Each  test  set  consisted  of  50  episodes  of 50  examples  each.  The algorithms \nwere  presented  with  unlabelled  data and  their classification  predictions  were  then \ncompared with the correct classification label. \n\nThe  results  show  significant  improvement  through  the  use  of family  discovery  for \nthis  classification  task.  The  single  model  approach  performed  significantly  worse \nthan any of the other approaches, especially for  larger numbers of episodes  (where \nthe family discovery becomes possible).  The separate model approach improves with \nthe number of episodes,  but is  nearly always  bested by the  approaches  which  take \nexplicit account of the underlying parameterized family.  Because of the nonlinearity \nin this task,  the simple affine  model  performs  more poorly than the two nonlinear \nmethods.  It is simple to implement, however, and may well be the method of choice \nwhen the parameters aren 't so nonlinear.  From this data, there is not a clear winner \nbetween the surface patch and surface map approaches. \n\n4  TRAINING SET DISCOVERY \n\nThroughout this paper, we  have assumed that the training set was  partitioned into \nepisodes  by  the teacher.  Agents  interacting with the world  may not  be  given  this \nexplicit  information.  For  example,  a  speech  recognition  system  may  not  be  told \nwhen it is  conversing with a new speaker.  Similarly, a character recognition system \n\n\f408 \n\ns. M.  OMOHUNDRO \n\nwould probably not be given explicit information about font  changes.  Learners can \nsometimes use the data itself to detect these changes, however.  In many situations \nthere is  a  strong prior that successive events  are likely to have come from  a  single \nmodel  with  only  occasional  model  changes.  The  EM  algorithm  is  often  used  for \nsegmenting  unlabelled  speech.  It  may  be  used  in  a  similar  manner  to  find  the \ntraining set  episode  boundaries.  First,  a  clustering algorithm  is  used  to  partition \nthe  training  examples  into  episodes.  A  parameterized  family  is  then  fit  to  these \nepisodes.  The data is  then repartitioned according to the similarity of the induced \nfamily parameters and the process is repeated until it converges.  A similar approach \nmay  be  applied  when  the  model  parameters  vary  slowly  with  time  rather  than \noccasionally jumping discontinously. \n\nAcknowledgements \n\nI'd  like  to  thank  Chris  Bregler  for  work  on  the  affine  patch  approach  to  surface \nlearning,  Alexander Linden  for  suggesting  coupled  maps  for  surface  learning,  and \nPeter Blicher for  discussions. \n\nReferences \n\nBaxter, J.  (1995)  Learning model bias.  This volume. \n\nBregler,  C.  &  Omohundro,  S.  (1994)  Surface learning with applications to lipread(cid:173)\ning.  In J.  Cowan,  G.  Tesauro  and  J.  Alspector  (eds.),  Advances  in  Neural  Infor(cid:173)\nmation  Processing  Systems  6,  pp.  43-50.  San  Francisco,  CA:  Morgan  Kaufmann \nPublishers. \n\nBregler,  C.  &  Omohundro,  S.  (1995)  Nonlinear image interpolation using  manifold \nlearning.  In  G.  Tesauro,  D.  Touretzky  and  T.  Leen  (eds.),  Advances  in  Neural \nInformation  Processing  Systems  7.  Cambridge, MA:  MIT Press. \n\nBregler, C.  &  Omohundro,  S.  (1995)  Nonlinear manifold  learning for  visual speech \nrecognition.  In W . Grimson (ed.),  Proceedings of the Fifth  International Conference \non  Computer  Vision. \n\nJordan, M.  &  Jacobs, R.  (1994)  Hierarchical mixtures of experts and the EM algo(cid:173)\nrithm.  Neural  Computation,  6:181-214. \n\nMacKay, D.  (1995)  Probable networks and plausible predictions - a  review of prac(cid:173)\ntical  Bayesian methods for  supervised neural networks.  Network,  to appear. \n\nPratt, L.  (1994)  Experiments on the transfer of knowledge between neural networks. \nIn S.  Hanson, G.  Drastal, and R.  Rivest  (eds.) ,  Computational Learning  Theory  and \nNatural  Learning  Systems,  Constraints  and  Prospects,  pp.  523-560.  Cambridge, \nMA:  MIT Press. \n\nRevow,  M.,  Williams,  C.  and Hinton,  G.  (1994)  Using generative models for  hand(cid:173)\nwritten digit  recognition.  Technical report,  University of Toronto. \n\nSchmidhuber,  J.  (1995)  On  learning  how  to  learn  learning  strategies.  Technical \nReport FKI-198-94, Fakultat fur  Informatik, Technische  Universitat  Munchen. \n\n\f", "award": [], "sourceid": 1050, "authors": [{"given_name": "Stephen", "family_name": "Omohundro", "institution": null}]}