{"title": "Learning the Structure of Similarity", "book": "Advances in Neural Information Processing Systems", "page_first": 3, "page_last": 9, "abstract": null, "full_text": "Learning the structure of similarity \n\nJoshua B.  Tenenbaum \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts  Institute of Technology \n\nCambridge, MA  02139 \njbt~psyche.mit.edu \n\nAbstract \n\nThe  additive  clustering (ADCL US)  model (Shepard &  Arabie,  1979) \ntreats  the  similarity of two  stimuli  as  a  weighted  additive  measure \nof their  common features.  Inspired  by  recent  work  in  unsupervised \nlearning with  multiple cause  models,  we  propose  anew, statistically \nwell-motivated  algorithm  for  discovering  the  structure  of  natural \nstimulus classes  using the ADCLUS model, which promises substan(cid:173)\ntial gains  in  conceptual  simplicity,  practical  efficiency,  and  solution \nquality over earlier efforts.  We  also present  preliminary results  with \nartificial data and two classic similarity data sets. \n\n1 \n\nINTRODUCTION \n\nThe capacity to judge one stimulus, object, or concept as similado another is thought \nto  play  a  pivotal  role  in  many  cognitive  processes,  including  generalization,  recog(cid:173)\nnition,  categorization,  and  inference.  Consequently,  modeling  subjective  similarity \njudgments in  order  to  discover  the  underlying structure  of stimulus representations \nin  the  brain/mind holds  a  central  place  in  contemporary  cognitive science.  Mathe(cid:173)\nmatical models of similarity can be divided  roughly into two families:  spatial models, \nin  which  stimuli  correspond  to  points  in  a  metric  (typically  Euclidean)  space  and \nsimilarity is treated  as  a  decreasing function of distance;  and  set-theoretic models, in \nwhich stimuli are represented  as members of salient subsets  (presumably correspond(cid:173)\ning to natural classes  or features in the world)  and similarity is treated as  a  weighted \nsum of common and  distinctive subsets. \n\nSpatial models,  fit  to  similarity judgment  data with familiar  multidimensional scal(cid:173)\ning  (MDS) techniques,  have yielded concise  descriptions of homogeneous,  perceptual \ndomains (e.g.  three-dimensional  color  space),  often  revealing  the salient  dimensions \nof stimulus  variation  (Shepard,  1980).  Set-theoretic  models  are  more  general ,  in \nprinciple  able  to  accomodate  discrete  conceptual  structures  typical  of higher-level \ncognitive  domains,  as  well  as  dimensional stimulus structures  more  common in  per-\n\n\f4 \n\n1.  B. TENENBAUM \n\nception  (Tversky,  1977).  In  practice,  however,  the  utility of set-theoretic  models is \nlimited by the  hierarchical  clustering techniques  that  underlie  conventional methods \nfor  discovering  the  discrete  features  or  classes  of stimuli.  Specifically,  hierarchical \nclustering  requires  that any  two classes  of stimuli correspond  to  disjoint or properly \ninclusive subsets,  while psychologically natural classes  may correspond  in general  to \narbitrarily overlapping subsets  of stimuli.  For  example,  the  subjective  similarity of \ntwo  countries  results  from  the  interaction  of multiple  geographic  and  cultural  fac(cid:173)\ntors,  and  there is  no reason  a  priori  to expect  the subsets of communist, African,  or \nFrench-speaking nations to  be either  disjoint or properly inclusive. \nIn this paper we  consider  the  additive  clustering  (ADCL US)  model (Shepard &  Ara(cid:173)\nbie, 1979), the simplest instantiation of Tversky 's  (1977) general contrast model that \naccommodates the  arbitrarily  overlapping  class  structures  associated  with  multiple \ncauses  of similarity.  Here,  the  similarity  of two  stimuli  is  modeled  as  a  weighted \nadditive measure of their common clusters: \n\nK \n\nSij  = I: wkfikfJk + C, \n\nk=l \n\n(1) \n\nwhere  Sij  is  the  reconstructed  similarity of stimuli i  and  j, the  weight  Wk  captures \nthe  salience  of cluster  k,  and  the  binary  indicator  variable  fik  equals  1 if stimulus i \nbelongs to cluster k  and 0 otherwise.  The additive constant c is necessary  because the \nsimilarity data are  assumed  to  be on  an interval scale. 1  As  with  conventional  clus(cid:173)\ntering models, ADCLUS recovers  a system of discrete subsets of stimuli, weighted by \nsalience,  and  the  similarity of two  stimuli  increases  with  the  number  (and  weight) \nof their common subsets.  ADCLUS,  however,  makes none of the structural  assump(cid:173)\ntions  (e.g.  that  any  two  clusters  are  disjoint  or  properly  inclusive)  which  limit the \napplicability of conventional set-theoretic  models.  Unfortunately  this flexibility  also \nmakes  the  problem of fitting  the  ADCL US  model  to  an  observed  similarity matrix \nexceedingly  difficult. \n\nPrevious  attempts to fit  the  model have followed  a  heuristic  strategy  to  minimize a \nsquared-error  energy function , \n\nE  = I:(Sij - Sij)2 = I:(Sij  - I: wklikfJk)2, \n\nitj \n\nitj \n\nk \n\n(2) \n\nby alternately solving for  the best  cluster  configurations fik  given the current weights \nWk  and  solving  for  the  best  weights  given  the  current  clusters  (Shepard  &  Arabie, \n1979;  Arabie  &  Carroll,  1980).  This  strategy  is  appealing  because  given  the  clus(cid:173)\nter  configuration,  finding  the  optimal weights  becomes  a  simple linear least-squares \nproblem.2  However, finding good cluster configurations is  a  difficult  problem in com(cid:173)\nbinatorial  optimization,  and  this  step  has  always  been  the  weak  point  in  previous \nwork.  The original ADCLUS  (Shepard  &  Arabie,  1979)  and later  MAPCLUS  (Ara(cid:173)\nbie &  Carroll,  1980) algorithms employ ad hoc techniques of combinatorial optimiza(cid:173)\ntion  that sometimes yield  unexpected  or  uninterpretable  final  results.  Certainly, no \nrigorous  theory  exists  that  would  explain  why  these  approaches  fail  to  discover  the \nunderlying structure of a  stimulus set  when  they  do. \n\nEssentially, the ADCL US  model is so  challenging to fit  because  it generates  similar(cid:173)\nities  from  the  interaction of many independent  underlying  causes.  Viewed  this  way, \nmodeling the structure of similarity looks very similar to the  multiple-cause learning \n\n1 In  the  remainder  of  this  paper,  we  absorb  c into  the sum  over  k,  taking  the sum  over \n\nk = 0,  ... , K , defining  Wo  ==  c,  and fixing  !iO = 1, (Vi) . \n\n2Strictly  speaking,  because the weights are typically  constrained  to be nonnegative,  more \n\nelaborate  techniques  than standard linear  least-squares  procedures  may  be  required. \n\n\fLearning the Structure of Similarity \n\n5 \n\nproblems that are currently a  major focus  of study in the  neural computation litera(cid:173)\nture (Ghahramani, 1995; Hinton, Dayan, et al., 1995; Saund,  1995; Neal, 1992).  Here \nwe  propose  a  novel  approach  to  additive  clustering,  inspired  by  the  progress  and \npromise  of  work  on  multiple-cause  learning  within  the  Expectation-Maximization \n(EM)  framework  (Ghahramani,  1995;  Neal,  1992).  Our  BM  approach  still  makes \nuse  of the  basic  insight  behind  earlier  approaches,  that  finding  {wd  given  {lid is \neasy,  but obtains better performance from treating the unknown cluster memberships \nprobabilistically as hidden variables (rather than parameters of the model), and per(cid:173)\nhaps  more importantly, provides  a  rigorous  and  well-understood  theory.  Indeed,  it \nis  natural  to  consider  {/ik}  as  \"unobserved\"  features  of  the  stimuli,  complement(cid:173)\ning the observed  data {Sij}  in the similarity matrix.  Moreover,  in some experimental \nparadigms, one or more of these features may be considered observed data, if subjects \nreport  using  (or  are  requested  to use)  certain  criteria in  their similarity judgments. \n\n2  ALGORITHM \n\n2.1  Maximum likelihood formulation \n\nWe  begin  by formulating the additive clustering problem in  terms of maximum like(cid:173)\nlihood  estimation  with  unobserved  data.  Treating  the  cluster  weights  w  = {Wk} \nas  model  parameters  and  the  unobserved  cluster  memberships I  =  {lik}  as  hidden \ncauses  for  the observed  similarities S = {Sij},  it is  natural to consider  a  hierarchical \ngenerative  model for  the  \"complete data\"  (including observed  and  unobserved  com(cid:173)\nponents)  of the form  p(s, Ilw)  =  p(sl/, w)p(flw).  In  the  spirit  of earlier  approaches \nto ADCLUS that seek to minimize a squared-error energy function,  we  take p(sl/, w) \nto be gaussian  with common variance u 2 : \np(sl/, w)  ex:  exp{ -~ 'L:(Sij  - Sij )2}  =  exp{ -~ 'L:(Sij  - 'L: wklik/ik)2}.  (3) \n\n2u  itj \n\n2u  itj \n\nk \n\nNote  that  logp(sl/, w)  is  equivalent  to  -E/(2u2 )  (ignoring  an  additive  constant), \nwhere  E  is  the  energy  defined  above.  In  general,  priors  p(flw)  over  the  cluster \nconfigurations may be useful  to favor  larger or  smaller clusters,  induce  a  dependence \nbetween  cluster  size  and  cluster  weight,  or  bias  particular  kinds  of class  structures, \nbut  only  uniform priors  are  considered  here.  In  this  case  -E /(2u 2 )  also  gives  the \n\"complete data\"  loglikelihood logp(s, Ilw). \n\n2.2  The EM  algorithm for  additive clustering \n\nGiven this probabilistic model, we  can now  appeal  to the  EM  algorithm as  the basis \nfor  a  new  additive  clustering  technique.  EM  calls  for  iterating  the  following  two(cid:173)\nstep  procedure,  in order  to obtain successive  estimates of the parameters  w  that  are \nguaranteed never  to decrease  in likelihood (Dempster et al., 1977).  In  the E-step,  we \ncalculate \n\nQ(wlw(n))  = L,: p(f' Is, wen)) logp(s,f/lw) =  2 \\  (-E}3,w(n). \n\n(4) \n\nl' \n\nu \n\nQ(wlw(n)  is  equivalent  to the expected  value of E  as  a function of w,  averaged over \nall  possible  configurations  I'  of the  N K  binary  cluster  memberships,  given  the ob(cid:173)\nserved  data s  and the current  parameter estimates wen).  In  the M-step, we  maximize \nQ(wlw(n)  with respect  to  w to obtain w(n+l). \nEach  cluster  configuration  I'  contributes  to  the  mean  energy  in  proportion  to  its \nprobability under  the gaussian generative model in (3).  Thus the number of configu(cid:173)\nrations making significant contributions depends on the model variance u 2 .  For large \n\n\f6 \n\nJ.  B. TENENBAUM \n\nU 2 ,  the probability is  spread  over  many configurations.  In the  limiting case  u 2  ---+  0, \nonly  the  most  likely  configuration  contributes,  making EM  effectively  equivalent  to \nthe  original  approaches  presented  in  Section  1  that  use  only  the  single  best  cluster \nconfiguration to solve for  the best  cluster  weights  at each iteration. \n\nIn line with the basic insight embodied less  rigorously in these earlier algorithms, the \nM-step  still  reduces  to a  simple  (constrained)  linear  least-squares  problem,  because \nthe  mean  energy  (E}  =  L:i#j  (srj  - 2Sij  L:k Wk(fik!ik} + L:kl WkWl(fik!jk!il!il}) , \nlike  the  energy  E,  is  quadratic  in  the  weights  Wk.  The  E-step,  which  amounts  to \ncomputing  the  expectations  mijk  = (fik!ik}  and  mijkl  = (fik !ik!il/j I} ,  is  much \nmore involved, because  the  required  sums over  all  possible  cluster  configurations f' \nare intractable for  any realistic case.  We  approximate these calculations using Gibbs \nsampling, a Monte Carlo method that has been successfully applied to learning similar \ngenerative models with  hidden variables (Ghahramani, 1995;  Neal  1992).3 \n\nFinally, the algorithm should produce  not  only estimates of the cluster  weights,  but \nalso a final cluster configuration that may be interpreted as the psychologically natural \nfeatures or classes of the relevant domain.  Consider the expected cluster memberships \nPik  = (fik}$  w(n) ,  which give the probability that stimulus i  belongs to cluster k, given \nthe observed  similarity matrix and the current  estimates of the weights.  Only when \nall Pik  are close to 0 or 1, i.e.  when u 2  is small enough that all the probability becomes \nconcentrated  on the most likely cluster  configuration and its neighbors, can we  fairly \nassert  which  stimuli belong to which  classes. \n\n2.3  Simulated annealing \n\nTwo major computational bottlenecks  hamper the efficiency  of the  algorithm as  de(cid:173)\nscribed  so  far.  First,  Gibbs sampling may take  a  very  long  time to  converge  to the \nequilibrium distribution,  particularly when  u 2  is  small relative  to the typical energy \ndifference  between neighboring cluster configurations.  Second, the likelihood surfaces \nfor  realistic data sets are typically riddled with local maxima. We solve both problems \nby annealing on the variance.  That is,  we  run Gibbs sampling using an effective  vari(cid:173)\nance u;\" \ninitially much greater  than the  assumed  model variance u2 ,  and  decrease \nu;\" \ntowards  u 2  according  to the following  two-level  scheme.  We  anneal  within the \nnth  iteration  of EM  to speed  the  convergence  of the  Gibbs  sampling E-step  (Neal, \n1993) , by  lowering  u;jJ  from some high starting value  down  to a  target  U~arg(n) for \nthe nth  EM  iteration .  We also anneal between iterations of EM  to avoid local maxima \n(Ros~ et  al.,  1990),  by  intializing U~arg(o) at a  high  value and taking U~arg(n) ---+  u 2 \nas  n  Increases. \n\n3  RESULTS \n\nIn all of the examples below, one run of the algorithm consisted of 100-200 iterations \nof EM,  annealed  both  within  and  between  iterations.  Within  each  E-step,  10-100 \ncycles  of Gibbs sampling were  carried out at the target temperature UTarg  while the \nstatistics  for  mik  and  mijk  were  recorded.  These  recorded  cycles  were  preceeded \nby  20-200  unrecorded  cycles,  during  which  the  system  was  annealed  from  a  higher \ntemperature (e.g.  8u~arg) down  to  U~arg, to ensure  that statistics were  collected  as \nclose  to  equilibrium  as  possible.  The  precise  numbers  of recorded  and  unrecorded \niterations  were  chosen  as  a  compromise between  the  need  for  longer  samples  as  the \n\n3We  generally  also  approximate  miJkl  ~ miJkmi;\"l,  which  usually  yields  satisfactory  re(cid:173)\n\nsults  with  much  greater efficiency. \n\n\fLearning the Structure of Similarity \n\n7 \n\nTable 1:  Classes and weights  recovered  for  the integers 0-9. \n\nRank  Weight  Stimuli in  class \n\n2 \n012 \n\n4 \n\n3 \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n\n.444 \n.345 \n.331 \n.291 \n.255 \n.216 \n.214 \n.172 \n\n8 \n\nInterpretation \npowers of two \nsmall numbers \n\n9  multiples of three \n\nlarge numbers \nmiddle numbers \nodd numbers \nsmallish numbers \nlargish numbers \n\n6 \n6  789 \n\n5 \n\n7 \n\n9 \n\n2  345 6 \n\n3 \n\n1 \n1  2  3  4 \n\n4  5  6  7  8 \n\nVariance  accounted for  = 90.9%  with  8 clusters  (additive  constant = .148). \n\nnumber  of  hidden  variables  is  increased  and  the  need  to  keep  computation  times \npractical. \n\n3.1  Artificial data \n\nWe first  report results with artificial data, for which the true cluster memberships and \nweights are known, to verify that the algorithm does in fact find the desired structure. \nWe  generated  10  data sets  by  randomly assigning  each  of 12  stimuli independently \nand  with  probability  1/2 to  each  of 8  classes,  and  choosing  random  weights for  the \nclasses  uniformly from  [0.1,0.6].  These  numbers  are  grossly  typical  of the  real  data \nsets  we  examine  later  in  this  section.  We  then  calculated  the observed  similarities \nfrom  (1),  added  a  small amount of random  noise  (with standard  deviation equal  to \n5%  of the mean noise-free similarity), and symmeterized the similarity matrix. \nThe crucial free  parameter is  K, the assumed number of stimulus classes.  When the \nalgorithm was  configured  with  the  correct  number of clusters  (K  =  8),  the  original \nclasses and weights were recovered  during the first  run of the algorithm on all 10 data \nsets,  after  an  average  of 58  EM  iterations  (low  30,  high  92).  When  the  algorithm \nwas  configured  with  K  =  7  clusters,  one  less  than  the  correct  number,  the  seven \nclasses  with  highest  weight  were  recovered  on  9/10  first  runs.  On  these  runs,  the \nrecovered  weights and true  weights had  a  mean correlation of 0.948  (p  < .05 on each \nrun).  When  configured  with  K  =  5,  the  first  run  recovered  either  four  of the  top \nfive  classes  (6/10 trials)  or  three  of the  top five  (4/10 trials).  When configured  with \ntoo  many clusters  (K  =  12),  the  algorithm typically  recovered  only  8  clusters  with \nsignificantly  non-zero  weights,  corresponding  to  the  8  correct  classes.  Comparable \nresults  are not available for  ADCLUS or MAPCLUS, but at least we  can be satisfied \nthat our algorithm achieves  a  basic level  of competence  and robustness. \n\n3.2  Judged similarities of the integers 0-9 \n\nShepard  et  al.  (1975)  had  subjects  judge  the  similarities of the  integers  0  through \n9,  in  terms of the  \"abstract  concepts\"  of the  numbers.  We  analyzed  the  similarity \nmatrix (Shepard,  personal communication) obtained by pooling data across  subjects \nand  across  three  conditions  of stimulus  presentation  (verbal,  written-numeral,  and \nwritten-dots).  We  chose  this  data  set  because  it  illustrates  the  power  of additive \nclustering  to  capture  a  complex,  overlapping  system  of  classes,  and  also  because \nit  serves  to  compare  the  performance  of our  algorithm  with  the  original  ADCL US \nalgorithm.  Observe  first  that  two  kinds  of classes  emerge  in  the  solution.  Classes \n1,  3,  and  6  represent  familiar  arithmetic  concepts  (e.g.  \"multiples of three\",  \"odd \nnumbers\"),  while the remaining classes  correspond  to subsets of consecutive integers \n\n\f8 \n\n1. B. TENENBAUM \n\nTable 2:  Classes  and weights  recovered  for  the  16  consonant  phonemes. \n\nf  0 \n\ndg \n\nb \n\nv {t \n\nRank  Weight  Stimuli in class \n\nInterpretation \nfront  unvoiced fricatives \nback  voiced  stops \nunvoiced  stops  (omitting t) \nfront  voiced \nunvoiced stops \nnasals \nvoiced  (omitting b) \nunvoiced  (omittings) \nVariance  accounted for  =  90.2%  with  8 clusters  (additive  constant  =  .047). \n\n.800 \n.572 \n.463 \n.424 \n.357 \n.292 \n.169 \n.132 \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n\np  k \n\np t  k \n\nmn \n\ndgvCTz2 \n\nptkfOs \n\nand thus  together  represent  the dimension of numerical magnitude.  In general,  both \narithmetic  properties  and  numerical  magnitude  contribute  to  judged  similarity,  as \nevery number has features of both types (e.g.  9 is a  \"large\"  \"odd\"  \"multiple of three\"), \nexcept for  0,  whose only property is  \"small.\"  Clearly an overlapping clustering model \nis  necessary  here  to accomodate the multiple causes  of similarity. \n\nThe  best  solution  reported  for  these  data  using  the  original  ADCLUS  algorithm \nconsisted  of 10  classes,  accounting for  83.1%  of the variance of the  data (Shepard  & \nArabie,  1979).4  Several  of the  clusters  in  this  solution  differed  by  only  one  or  two \nmembers  (e.g.  three  of the  clusters  were  {0,1},  {0,1,2},  and  {0,1,2,3,4}),  which  led \nus  to suspect  that a  better fit  might be obtained with fewer  than 10  classes.  Table 2 \nshows  the  best  solution found  in five  runs of our  algorithm, accounting for  90.9%  of \nthe  variance  with eight  classes.  Compared with  our  solution,  the  original ADCLUS \nsolution leaves  almost twice  as  much residual variance unaccounted for,  and  with 10 \nclasses,  is also less  parsimonious. \n\n3.3  Confusions between 16  consonant phonemes \n\nFinally,  we  examine  Miller  &  Nicely's  (1955)  classic  data on  the  confusability of 16 \nconsonant  phonemes,  collected  under  varying signal/noise conditions  with the  orig(cid:173)\ninal  intent  of identifying the features  of English  phonology  (compiled  and  reprinted \nin  Carroll &  Wish,  1974).  Note  that  the  recovered  classes  have  reasonably  natural \ninterpretations  in  terms of the  basic features  of phonological theory,  and  a  very  dif(cid:173)\nferent  overall  structure  from  the  classes  recovered  in  the  previous  example.  Quite \nsignificantly,  the  classes  respect  a  hierarchical  structure  almost  perfectly,  with  class \n3 included in class  5,  classes  1 and 5 included in class 8,  and so on.  Only the absence \nof /b /  in class  7 violates the strict  hierarchy. \n\nThese  data also provide the only convenient  oppportunity to compare our algorithm \nwith  the  MAPCLUS  approach  to additive clustering  (Arabie &  Carroll,  1980).  The \npublished  MAPCLUS solution accounts for  88.3% of the variance in this data, using \neight  clusters.  Arabie  &  Carroll  (1980)  report  being  \"substantively  pe ... turbed\"  (p. \n232)  that  their  algorithm  does  not  recover  a  distinct  cluster  for  the  nasals  /m n/, \nwhich have been considered  a very salient subset in both traditional phonology (Miller \n&  Nicely,  1955)  and other  clustering  models  (Shepard,  1980).  Table  3  presents  our \neight-cluster  solution,  accounting  for  90.2%  of the  variance.  While  this  represents \nonly  a  marginal improvement, our  solution does  contain  a  cluster  for  the nasals,  as \nexpected  on theoretical  grounds. \n\n4Variance  accounted  for  =  1- Ej Ei#j(SiJ  - 8)2,  where s is  the mea.n  of the set  {Sij}. \n\n\fLearning the  Structure of Similarity \n\n9 \n\n3.4  Conclusion \n\nThese examples show that ADCLUS can discover meaningful representations of stim(cid:173)\nuli with arbitrarily overlapping class structures  (arithmetic properties),  as well as  di(cid:173)\nmensional structure  (numerical magnitude) or hierarchical structure  (phoneme fami(cid:173)\nlies)  when appropriate.  We have argued that modeling similarity should be a  natural \napplication of learning  generative  models  with  multiple hidden  causes,  and  in  that \nspirit,  presented  a  new  probabilistic formulation  of the  ADCLUS  model  and  an  al(cid:173)\ngorithm  based  on  EM  that  promises  better  results  than  previous  approaches.  We \nare  currently  pursuing  several  extensions:  enriching  the  generative  model,  e.g.  by \nincorporating significant  prior  structure,  and  improving the  fitting  process,  e.g.  by \ndeveloping efficient and accurate mean field  approximations. More generally, we  hope \nthis  work  illustrates how  sophisticated  techniques  of computational learning  can be \nbrought to bear on foundational problems of structure  discovery  in  cognitive science. \n\nAcknowledgements \n\nI  thank  P.  Dayan,  W.  Richards,  S.  Gilbert,  Y.  Weiss,  A.  Hershowitz,  and  M.  Bernstein \nfor  many  helpful  discussions,  and  Roger  Shepard  for  generously  supplying  inspiration  and \nunpublished  data.  The author is  a  Howard  Hughes  Medical  Institute  Predoctoral  Fellow. \n\nReferences \n\nArabie,  P.  &  Carroll,  J.  D.  (1980).  MAPCLUS:  A  mathematical  programming  approach  to \nfitting  the ADCLUS  model.  Psychometrika 45,  211-235. \n\nCarroll,  J.  D.  &  Wish,  M.  (1974)  Multidimensional  perceptual  models  and  measurement \nmethods.  In  Handbook of Perception,  Vol.  2.  New  York:  Academic  Press,  391-447. \n\nDempster,  A.  P., Laird,  N.  M.,  &  Rubin,  D.  B.  (1977).  Maximum likelihood  estimation from \nincomplete  data  via  the  EM  Algorithm  (with  discussion).  J.  Roy.  Stat.  Soc.  B39,  1-38. \n\nGhahramani,  Z.  (1995).  Factorial  learning  and  the  EM  algorithm.  In  G.  Tesauro,  D.  S. \nTouretzky,  &  T .  K.  Leen  (eds.),  Advances  in  Neural  Information  Processing  Systems  7. \nCambridge,  MA:  MIT  Press,  617-624. \n\nHinton,  G.  E.,  Dayan,  P.,  Frey,  B.  J.,  &  Neal,  R.  M.  (1995)  The  ((wake-sleep\"  algorithm for \nunsupervised  neural  networks.  Science 268,  1158-1161. \n\nMiller,  G.  A.  &  Nicely,  P.  E.  (1955).  An  analysis  of  perceptual  confusions  among  some \nEnglish  consonants.  J.  Ac.  Soc.  Am.  27,  338-352. \n\nNeal,  R .  M.  (1992).  Connectionist  learning  of belief networks.  Arti/.  Intell.  56, 71-113. \n\nNeal,  R.  M.  (1993).  Probabilistic  inference  using  Markov  chain  Monte  Carlo  methods. \nTechnical  Report  CRG-TR-93-1,  Dept.  of Computer Science,  U.  of Toronto. \n\nRose,  K.,  Gurewitz,  F.,  &  Fox,  G.  (1990).  Statistical  mechanics  and  phase  transitions  in \nclustering.  Physical Review  Letters 65,  945-948. \n\nSaund,  E.  (1995).  A  multiple  cause  mixture  model for  unsupervised  learning.  Neural  Com(cid:173)\nputation 7,  51-71. \n\nShepard,  R.  N.  &  Arabie,  P.  (1979).  Additive  clustering:  Representation  of similarities  as \ncombinations  of discrete  overlapping  properties.  Psychological Review 86,  87-123. \n\nShepard,  R.  N.,  Kilpatric,  D.  W., &  Cunningham,  J.  P., (1975).  The internal representation \nof numbers.  Cognitive  Psychology 7,  82-138. \n\nShepard,  R.  N.  (1980) .  Multidimensional  scaling,  tree-fitting,  and  clustering.  Science 210, \n390-398. \n\nTversky,  A.  (1977).  Features of similarity.  Psychological Review 84,  327-352. \n\n\f", "award": [], "sourceid": 1052, "authors": [{"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}