{"title": "Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 914, "page_last": 920, "abstract": null, "full_text": "Learning the Similarity of Documents: \n\nAn Information-Geometric  Approach  to \nDocument  Retrieval and  Categorization \n\nThomas  Hofmann \n\nDepartment of Computer Science \nBrown  University,  Providence,  RI \n\nhofmann@cs.brown.edu,  www.cs.brown.edu/people/th \n\nAbstract \n\nThe  project  pursued  in  this  paper  is  to  develop  from  first \ninformation-geometric  principles  a  general  method  for  learning \nthe  similarity  between  text  documents.  Each  individual  docu(cid:173)\nment  is  modeled  as  a  memoryless  information source.  Based  on \na  latent  class  decomposition of the  term-document  matrix, a  low(cid:173)\ndimensional  (curved)  multinomial subfamily is  learned.  From this \nmodel a  canonical similarity function - known as the Fisher  kernel \n- is  derived.  Our  approach  can  be  applied  for  unsupervised  and \nsupervised  learning problems alike.  This in  particular covers inter(cid:173)\nesting  cases  where  both,  labeled and  unlabeled  data are  available. \nExperiments in  automated indexing and text categorization verify \nthe advantages of the proposed  method. \n\n1 \n\nIntroduction \n\nThe computer-based analysis and organization of large document repositories is one \noftoday's great challenges in machine learning, a key problem being the quantitative \nassessment  of document  similarities.  A  reliable  similarity measure  would  provide \nanswers to questions like:  How similar are two text documents and which documents \nmatch a  given  query  best?  In a  time, where  searching in  huge  on-line  (hyper-)text \ncollections like the World Wide Web becomes more and more popular, the relevance \nof these  and related  questions needs  not to  be  further  emphasized. \n\nThe  focus  of this  work  is  on  data-driven  methods  that  learn  a  similarity function \nbased  on  a  training  corpus  of text  documents  without  requiring  domain-specific \nknowledge.  Since we do not assume that labels for  text categories, document classes, \nor  topics,  etc.  are  given  at  this  stage,  the  former  is  by  definition  an  unsupervised \nlearning problem.  In  fact,  the  general  problem of learning  object  similarities pre(cid:173)\ncedes  many  \"classical\"  unsupervised  learning methods like  data clustering that al(cid:173)\nready  presuppose  the availability of a  metric or  similarity function.  In  this  paper, \nwe  develop  a  framework for  learning similarities between  text documents from first \nprinciples.  In  doing  so,  we  try  to span  a  bridge  from  the foundations  of statistics \nin  information  geometry  [13,  1]  to  real-world  applications  in  information  retrieval \nand  text  learning,  namely  ad  hoc  retrieval  and  text  categorization.  Although  the \ndeveloped  general  methodology  is  not  limited  to  text  documents,  we  will  for  sake \nof concreteness  restrict  our attention exclusively  to this domain. \n\n\fLearning the Similarity of Documents \n\n915 \n\n2  Latent  Class  Decomposition \n\nMemoryless  Information  Sources  Assume  we  have  available  a  set  of docu(cid:173)\nments  V  = {d l ,  ..\u2022 , dN}  over  some  fixed  vocabulary  of words  (or  terms)  W  = \n{WI, ... , WM}.  In  an  information-theoretic perspective,  each  document  di  can  be \nviewed as an information source,  i.e. a probability distribution over word sequences. \nFollowing common practice  in  information retrieval,  we  will  focus  on the more re(cid:173)\nstricted  case  where  text  documents are  modeled on the level  of single  word  occur(cid:173)\nrences.  This means that we  we  adopt the  bag-of- words view  and treat  documents \nas  memoryless information sources. I \n\nA.  Modeling  assumption:  Each  document  is  a  memoryless information  source. \n\nThis assumption implies that each  document can be  represented  by  a  multinomial \nprobability distribution  P(wjldi),  which  denotes  the  (unigram)  probability that a \ngeneric  word  occurrence  in document di  will be Wj.  Correspondingly,  the data can \nbe  reduced  to  some  simple sufficient  statistics  which  are  counts  n(di , Wj)  of how \noften  a  word  Wj  occurred  in a  document  dj \u2022  The  rectangular  N  x  M  matrix with \ncoefficients  n(di , Wj)  is also called  the  term-document  matrix. \n\nLatent  Class  Analysis  Latent  class  analysis  is  a  decomposition  technique  for \ncontingency  tables  (cf.  [5,  3]  and  the  references  therein)  that  has  been  applied  to \nlanguage modeling [15]  (\"aggregate Markov model\") and in information retrieval [7] \n(\"probabilistic latent  semantic analysis\").  In  latent  class  analysis,  an  unobserved \nclass  variable  Zk  E  Z  =  {zt, ... , ZK}  is  associated  with  each  observation,  i.e.  with \neach  word  occurrence  (di , Wj).  The joint probability distribution over V  x  W  is  a \nmixture model that can  be  parameterized in two  equivalent ways \n\nP(di, Wj)  = 2: P(zk)P(dilzk)P(wjlzk) =  P(di) 2: P( WjIZk)P(Zk Idd . \n\n(1) \n\nK \n\nK \n\nk=l \n\nk=l \n\nThe  latent  class  model  (1)  introduces  a  conditional  independence  assumption, \nnamely that di  and Wj  are  independent  conditioned on  the  state of the associated \nlatent variable.  Since  the  cardinality of Zk  is  typically smaller than the  number of \ndocuments/words  in  the  collection,  Zk  acts  as  a  bottleneck  variable  in  predicting \nwords conditioned on the context  of a  particular document. \n\nTo give the reader a more intuitive understanding of the latent class decomposition, \nwe  have  visualized  a  representative  subset  of 16  \"factors\"  from  a  K  =  64  latent \nclass  model  fitted  from  the  Reuters2I578  collection  (cf.  Section  4)  in  Figure  1. \nIntuitively, the learned parameters seem to be very meaningful in that they represent \nidentifiable topics and capture the  corresponding  vocabulary quite well. \n\nBy using the latent class decomposition to model a collection of memory less sources, \nwe  implicitly assume  that  the overall collection will  help  in  estimating parameters \nfor  individual sources,  an assumption which has been  validated in our experiments. \n\nB.  Modeling  assumption:  Parameters for  a  collection  of memoryless informa(cid:173)\ntion  sources  are  estimated by  latent  class  decomposition. \n\nParameter  Estimation  The  latent  class  model  has  an  important  geometrical \ninterpretation:  the  parameters \u00a21  ==  P( Wj IZk)  define  a  low-dimensional subfamily \nof the  multinomial family,  S(\u00a2)  ==  {11\"  E  [0;  I]M  :  1I\"j  = :Ek 1/;k\u00a21  for  some1/;  E \n[0;  I]K,  :Ek 1/;k  =  I},  i.e. all multinomials 11\"  that can be obtained by convex combi(cid:173)\nnations from  the set of \"basis\"  vectors  {\u00a2k  : 1 :::;  k  :::;  K}.  For given  \u00a2-parameters, \n\n1 Extensions  to the more general  case are possible,  but beyond  the scope of this paper. \n\n\f916 \n\nT.  Hofmann \n\ngovernment \n\ntax \n\nbudget \n\ncut \n\ns pending \n\ncuts \n\ndeficit \ntaxes \nreform \nbillion \ntrading \nexchange \nfutures \nstock \noptions \nindex \n\ncontracts \n\nmarket \nlondon \n\nexchanges \n\npresIdent \nchairman \nexecutive \n\nchief \nofficer \n\nvice \n\ncompany \n\nnamed \nboard \ndirector \namerica.n \ngeneral \nmotors \nchrysler \n\ngm \ncar \nford \ntest \ncars \nmotor \n\nbanks \ndebt \nbrazil \nnew \nloans \ndlrs \n\nbankers \n\nb .. nk \n\npayments \n\nbillion \ntr .. de \njapan \n\nj .. panese \n\nec \n\nstates \nunited \nofficials \n\ncommunity \neuropean \nimports \n\npct \n\njanuary \nfebruary \n\nrise \nrose \n1986 \n\ndecember \n\nyear \nfell \n\nprices \n\noil \n\ncrude \nenergy \n\npetroleum \n\nprices \nbpd \n\nbarrels \nbarrel \n\nexploration \n\nprice \n\nunlon \n\nair \n\nworkers \nstrike \nairlines \naircraft \n\nport \n\nboeing \n\nemployees \n\nairline \n\nvs \ncts \nnet \nloss \nmin \nshr \nqtr \nrevs \nprofit \nnote \n\nmarks \n\ncurrency \n\ndollar \ngerman \n\nbundesbank \n\ncentral \nmark \nwest \ndollars \ndealers \nareas \n\nweather \n\narea \n\nnormal \ngood \ncrop \n\ndamage \ncaused \naffected \npeople \n\ngold \nsteel \nplant \nmining \ncopper \n\nton s \nsilver \nmetal \n\nproduction \n\nounCeS \n\nfood \ndrug \nstudy \naid s \n\nprod uct \ntrea.tment \ncompany \n\nenvironmental \n\nproducts \napproval \n\nbllhon \n\ndlrs \nyear \n\nsurplu s \ndeficit \nforeign \ncurrent \ntrade \n\naccount \nreserves \nhouse \nrea.gan \n\npresident \n\nadministration \n\ncongress \n\nwhite \n\nsecretary \n\ntold \n\nvolcker \nreagans \n\nFigure 1:  16 selected factors from a 64 factor decomposition ofthe Reuters21578 col(cid:173)\nlection.  The displayed terms are the 10 most probable words in the class-conditional \ndistribution  P (Wj IZk)  for  16  selected  states  Zk  after  the exclusion  of stop  words. \n\neach  1/;i , 1/;i  ==  P(zkldi),  will  define  a  unique  multinomial distribution  rri  E  S(\u00a2). \nSince  S( \u00a2)  defines  a  submanifold on the  multinomial simplex,  it corresponds  to a \ncurved exponential  subfamily. 2  We  would like to emphasis that we  propose  to learn \nboth, the parameters within the family (the 1/;'s or mixing proportions P(Zk Idi ))  and \nthe parameters that define  the subfamily (the  \u00a2's or class-conditionals P(WjIZk)). \nThe standard procedure for  maximum likelihood estimation in latent variable mod(cid:173)\nels  is  the  Expectation  Maximization (EM)  algorithm.  In  the  E-step one  computes \nposterior  probabilities for  the latent class  variables, \n\nP(Zk )P( di IZk )P( Wj IZk) \n2:1  P(zI)P(dilzt)P(wjlz/) \nThe M-step formulae can  be  written  compactly as \n\nP(Zk) P( di IZk )P( Wj IZk) \n\nP(di' Wj) \n\nP(diIZk)} \nP(WjIZk) \n\nP(Zk) \n\nN  M \n\nex  2: 2: n(dn, wm)P(zkldn, wm) X \nn=l m=l \n\n{  din \ndjm \n1 \n\n(2) \n\n(3) \n\nwhere  6 denotes  the  Kronecker  delta. \n\nRelated  Models  As  demonstrated  in  [7],  the  latent  class  model  can  be  viewed \nas  a  probabilistic  variant of Latent  Semantic  Analysis  [2],  a  dimension  reduction \ntechnique  based  on  Singular Value  Decomposition.  It is  also  closely  related  to the \nnon-negative matrix decomposition discussed in [12]  which uses  a  Poisson sampling \nmodel and has been motivated by imposing non-negativity constraints on a decom(cid:173)\nposition  by  PCA.  The  relationship  of the  latent  class  model  to  clustering  models \nlike  distributional  clustering  [14]  has  been  investigated  in  [8].  [6]  presents  yet  an(cid:173)\nother approach to dimension reduction for  multinomials which is  based on spherical \nmodels,  a  different  type  of curved  exponential  subfamilies than  the  one  presented \nhere  which  is  affine in the  mean-value parameterization. \n\n2Notice  that graphical  models  with latent variable  are in general  stratified exponential \nfamilies  [4],  yet  in our case  the geometry is  simpler.  The geometrical  view  also  illustrates \nthe  well-known  identifiability  problem  in  latent  class  analysis.  The  interested  reader  is \nreferred  to  [3].  As  a  practical  remedy,  we  have  used  a  Bayesian  approach  with  conjugate \n(Dirichlet)  prior  distributions  over  all  multinomials  which  for  the  sake  of clarity  is  not \ndescribed in  this paper since  it is  very  technical  but nevertheless  rather straightforward. \n\n\fLearning the Similarity of Documents \n\n917 \n\n3  Fisher Kernel  and Information  Geometry \nThe  Fisher  Kernel  We  follow  the  work  of [9]  to  derive  kernel  functions  (and \nhence  similarity functions)  from  generative  data  models.  This  approach  yields  a \nuniquely  defined  and  intrinsic  (i. e.  coordinate  invariant)  kernel,  called  the  Fisher \nkernel.  One  important  implication  is  that  yardsticks  used  for  statistical  models \ncarryover to the selection of appropriate similarity functions.  In spite of the purely \nunsupervised  manner in which a  Fisher kernel  can be learned,  the latter is also very \nuseful  in  supervised learning,  where  it  provides  a  way  to  take  advantage  of addi(cid:173)\ntional  unlabeled  data.  This is  important in text  learning,  where  digital document \ndatabases and the World Wide Web  offer  a  huge  background text  repository. \n\nAs  a  starting  point,  we  partition  the  data  log-likelihood  into  contributions  from \nthe  various  documents.  The  average  log-probability  of  a  document  di ,  i.e.  the \nprobability  of all  the  word  occurrences  in  di  normalized  by  document  length  is \ngiven by, \n\nl(dd = L F(wjldi ) log L P(WjIZk)P(Zkldi),  F(wj Idi ) ==  2: n(d(, ~j)) \n\n(4) \n\nM \n\nj=l \n\nK \n\nk=l \n\nm  n  d\"  Wm \n\nwhich is  up to constants the negative Kullback-Leibler divergence  between  the em(cid:173)\npirical distribution  F(wjldi ) and the  model distribution represented  by  (1). \nIn order  to derive  the  Fisher kernel,  we  have  to compute the  Fisher scores  u(di ; 0), \ni.e. the gradient of l(dd  with respect  to 0,  as well as the  Fisher information 1(0)  in \nsome parameterization 0  [13].  The  Fisher  kernel at {)  is  then given  by  [9] \n\n(5) \n\nComputational Considerations  For computational reasons  we  propose  to ap(cid:173)\nproximate the  (inverse)  information matrix by  the identity matrix, thereby  making \nadditional assumptions about  information orthogonality.  More  specifically,  we  use \na  variance  stabilizing  parameterization for  multinomials - the  square-root  param(cid:173)\neterization  - which  yields  an  isometric  embedding  of multinomial families  on  the \npositive  part  of a  hypersphere  [11].  In  this  parameterization,  the  above  approx(cid:173)\nimation will  be  exact  for  the  multinomial family  (disregarding  the  normalization \nconstraint).  We  conjecture  that it will  also  provide a  reasonable approximation in \nthe case  of the subfamily defined  by  the latent class  model. \n\nc.  Simplifying assumption:  The  Fisher  information  in  the  square-root  param(cid:173)\neterization  can  be  approximated by the  identity  matrix. \n\nInterpretation of Results \nInstead of going through the details of the derivation \nwhich is postponed to the end of this section, it is revealing to relate the results back \nto our main problem of defining  a similarity function  between text  documents.  We \nwill  have  a  closer  look at  the  two  contributions  reSUlting  from  different  sets  of pa(cid:173)\nrameters.  The contribution which stems from (square-root transformed) parameters \nP(Zk)  is  (in  a simplified version)  given  by \n\nL P(Zk Idi)P(Zk Idn )/ P(Zk) . \n\nk \n\n(6) \n\nJ(  is  a  weighted  inner  product  in  the  low-dimensional factor  representation  of the \ndocuments  by  mixing  weights  P(zkldi).  This  part  of the  kernel  thus  computes  a \n\"topical\" overlap between documents and is thereby able to capture  synonyms, i.e., \nwords  with an  identical or similar meaning, as  well  as  words  referring  to  the same \n\n\f918 \n\nT.  Hofmann \n\ntopic.  Notice,  that it is  not  required  that di  and dn  actually  have  (many)  terms in \ncommon in order  to get a  high similarity score. \nThe contribution due  to the parameters P(WjIZk)  is of a  very different  type.  Again \nusing  the approximation of the  Fisher matrix, we  arrive at the inner product \n\nK:(di, do)  =  l( P(Wj Idi ) I'>(wj Ido ) ~ P(zkldi';(2~;:; Ido , Wj)  \u2022 \n\n(7) \n\nj(  has also a  very appealing interpretation:  It essentially computes an inner product \nbetween  the empirical distributions of di  and dn ,  a  scheme  that is  very  popular in \nthe  context  of information retrieval  in  the  vector  space  model.  However,  common \nwords  only  contribute,  if they  are  explained  by  the  same factor(s),  i.e.,  if the  re(cid:173)\nspective posterior probabilities overlap.  This allows to capture words  with multiple \nmeanings, so-called  polysems.  For  example, in the factors  displayed in Figure 1 the \nterm  \"president\"  occurs  twice  (as  the  president  of a  company and as  the  president \nof the  US).  Depending  on  the  document  the  word  occurs  in,  the  posterior  proba(cid:173)\nbility will  be  high  for  either  one  of the factors,  but  typically  not for  both.  Hence, \nthe  same  term  used  in  different  context  and  different  meanings  will  generally  not \nincrease the similarity between documents, a  distinction that is absent in  the naive \ninner product  which  corresponds  to the degenerate  case  of K  =  1. \nSince  the  choice  of K  determines  the  coarseness  of the  identified  \"topics\"  and  dif(cid:173)\nferent  resolution  levels  possibly  contribute  useful  information,  we  have  combined \nmodels by  a simple additive combination of the derived  inner  products.  This com(cid:173)\nbination scheme has experimentally proven  to be  very  effective  and robust. \n\nD.  Modeling  assumption:  Similarities derived from  latent  class  decompositions \nat  different  levels  of resolution  are  additively combined. \n\nIn  summary, the  emergence  of important language  phenomena like  synonymy and \npolysemy  from  information-geometric  principles  is  very  satisfying  and  proves  in \nour opinion that interesting similarity functions  can  be  rigorously  derived,  without \nspecific  domain knowledge  and based  on few,  explicitly stated assumptions  (A-D). \n\nTechnical Derivation  Define  Pjk  ==  2v'P(wjlzk), then \n\n8l(d j ) \n\noP(WjIZk)  =  . fp(wjlzk) P(wjldi ) P(zkJdd \n\nV \n\nP(wjldi ) \n\noP(WjIZk)  OPjk \nP(wjlddP(Zkldi' Wj) \n\nv'P(WjIZk) \n\nSimilarly we  define  Pk  = 2v'P(Zk).  Applying Bayes'  rule  to substitute  P(zkldd  in \nl(dd  (i.e.  P(zkldd  =  P(zk)P(di/zk)/P(di))  yields \n\n8l(dd \nOPk \n\n8l(d;)  OP(Zk)  =  v'P(Zk) P(dilzk)  ~ P(wjldd P(W ' IZk) \nOP(Zk)  OPk \n\nJ \n\nP(zkJdd  ~ P(wj1di)p( \nv'P(Zk) \n\nL...J \nj  P(wjldi) \n\nJ \n\nP(dd  ~ P(WjJdd \nI  )  P( zkldj ) \nW\u00b7  Zk  ~ ~==== \nv'P(Zk)  . \n\nJ \n\nThe last (optional) approximation step makes sense whenever  P(wjld j )  ~ P(wjldi ). \nNotice  that  we  have  ignored  the  normalization  constraints  which  would  yield  a \n(reactive)  term  that  is  constant  for  each  multinomial.  Experimentally,  we  have \nobserved no deterioration in performance by making these additional simplifications. \n\n\fLearning the Similarity of Documents \n\n919 \n\nVSM \nVSM++ \n\nMedline  Cranfield  CACM  CISI \n12.7 \n20.3 \n\n44.3 \n67.2 \n\n29.9 \n37.9 \n\n17.9 \n27.5 \n\nTable  1:  Average  precision  results  for  the  vector  space  baseline  method  (VSM) \nand the  Fisher  kernel approach  (VSM ++) for  4 standard test  collections,  Medline, \nCranfield,  CACM, and  CIS!. \n\n20x sub  SVM \n\nSVM++ \nkNN \nkNN++ \n\nlOx sub  SVM \n\n5x sub \n\nSVM++ \nkNN \nkNN++ \nSVM \nSVM++ \nkNN \nkNN++ \n\nall  data  SVM \nlOx  cv \n\nSVM++ \nkNN \nkNN++ \n\nI earn \n5.51 \n4.56 \n5.91 \n5.05 \n4.88 \n4.11 \n5.51 \n4.94 \n4.09 \n3.64 \n5.13 \n4.74 \n2.92 \n2.98 \n4.17 \n4.07 \n\nacq  money \n7.67 \n3.25 \n5.37 \n2.08 \n3.24 \n9.64 \n3.11 \n7.80 \n5.54 \n2.38 \n2.08 \n4.84 \n2.64 \n9.23 \n2.42 \n7.47 \n4.40 \n2.10 \n1.78 \n4.15 \n2.27 \n8.70 \n2.22 \n6.99 \n3.21 \n1.20 \n1.21 \n3.15 \n1.78 \n6.69 \n5.34 \n1.73 \n\ngrain \n2.06 \n1.71 \n2.54 \n2.35 \n1.71 \n1.42 \n2.55 \n2.28 \n1.32 \n0.98 \n2.40 \n2.18 \n0.77 \n0.76 \n1.73 \n1.58 \n\ncrude \n2.50 \n1.53 \n2.42 \n1.95 \n1.88 \n1.45 \n2.42 \n1.88 \n1.46 \n1.19 \n2.23 \n1.74 \n0.92 \n0.86 \n1.42 \n1.18 \n\naverage \n\nlmprov. \n\n4.20 \n3.05 \n4.75 \n4.05 \n3.27 \n2.78 \n4.47 \n3.79 \n2.67 \n2.35 \n4.14 \n3.57 \n1.81 \n1.79 \n3.16 \n2.78 \n\n-\n\n+27.4% \n\n-\n\n+14.7% \n\n-\n\n+15.0% \n\n-\n\n+15.2% \n\n+12 .1% \n\n+13.7% \n\n-\n\n-\n\n-\n\n-\n\n+0.6% \n\n+12.0% \n\nTable 2:  Classification errors for  k-nearest  neighbors  (kNN)  SVMs  (SVM)  with the \nnaive  kernel  and  with  the  Fisher  kernel(++)  (derived  from  J(  = 1  and  J(  = 64 \nmodels)  on  the  5  most frequent  categories  of the  Reuters21578  corpus  (earn,  acq, \nmonex-fx, grain, and crude)  at different  subsampling levels. \n\n4  Experimental Results \n\nWe  have  applied  the  proposed  method for  ad hoc  information retrieval,  where  the \ngoal  is  to  return  a  list  of documents,  ranked  with  respect  to a  given  query.  This \nobviously  involves  computing  similarities  between  documents  and  queries. \nIn  a \nfollow-up series of experiments to the ones reported in [7] - where kernels K(di , dn )  = \n~k P(Zk Idi)P(Zk Idn )  and JC(di' dn )  =  ~j P(Wj Idi )P(wjldn )  have been  proposed in \nan  ad hoc manner - we  have  been able to obtain a  rigorous theoretical justification \nas  well  as  some  additional  improvements.  Average  precision-recall  values  for  four \nstandard  test  collections  reported  in  Table  1  show  that  substantial  performance \ngains can be achieved with  the help of a  generative  model  (cf.  [7]  for  details on  the \nconducted experiments). \n\nTo demonstrate the utility of our method for supervised learning problems, we  have \napplied  it  to  text  categorization,  using a  standard data set  in  the  evaluation,  the \nReuters21578  collections of news  stories.  We  have  tried  to  boost  the  performance \nof two  classifiers  that  are  known  to  be  highly  competitive for  text  categorization: \nthe k- nearest  neighbor method and Support Vector  Machines  (SVMs)  with a linear \nkernel  [10].  Since  we  are particularly interested  in a  setting,  where  the  generative \nmodel is trained on a larger corpus of unlabeled data, we have run experiments where \nthe classifier  was  only trained on a  subsample (at subsampling factors  20x,10x,5x). \nThe results are summarized in Table 2.  Free  parameters of the base  classifiers  have \nbeen  optimized  in  extensive  simulations with  held-out  data.  The  results  indicate \n\n\f920 \n\nT.  Hofmann \n\nthat  substantial  performance  gains  can  be  achieved  over  the  standard  k-nearest \nneighbor  method  at  all  subsampling  levels.  For  SVMs  the  gain  is  huge  on  the \nsubsampled data collections,  but  insignificant  for  SVMs  trained  on all  data.  This \nseems to indicate that the generative model does not provide any extra information, \nif  the  SVM  classifier  is  trained  on  the  same  data.  However,  notice  that  many \ninteresting  applications  in  text  categorization  operate  in  the  small  sample  limit \nwith lots  of unlabeled  data.  Examples include  the  definition  of personalized  news \ncategories by just a few  example, the classification and/or filtering of email, on-line \ntopic spotting and tracking,  and many more. \n\n5  Conclusion \nWe  have  presented  an  approach  to  learn  the  similarity  of text  documents  from \nfirst  principles.  Based  on  a  latent  class  model,  we  have  been  able  to  derive  a \nsimilarity function,  that is theoretically satisfying, intuitively appealing, and shows \nsubstantial performance gains in the conducted experiments.  Finally, we have made \na  contribution to the relationship between unsupervised and supervised  learning as \ninitiated in [9]  by showing that generative models can help to exploit unlabeled data \nfor  classification problems. \n\nReferences \n[1]  Shun'ichi  Amari.  Differential-geometrical  methods  in  statistics.  Springer-Verlag, \n\nBerlin,  New York,  1985. \n\n[2]  S.  Deerwester,  S.  T.  Dumais,  G.  W.  Furnas,  T.  K.  Landauer,  and  R.  Harshman. \nIndexing by latent semantic analysis.  Journal of the American Society for Information \nScience, 41:391-407,  1990. \n\n[3]  M.  J. Evans,  Z.  Gilula,  and I. Guttman.  Latent class analysis of two-way contingency \n\ntables  by Bayesian  methods.  Biometrika, 76(3):557-563,  1989. \n\n[4]  D.  Geiger,  D.  Heckerman,  H.  King,  and  C.  Meek.  Stratified  exponential  families: \nGraphical  models  and  model  selection.  Technical  Report  MSR-TR-98-31,  Microsoft \nResearch,  1998. \n\n[5]  Z. Gilula and S.  J . Haberman.  Canonical analysis of contingency  tables by maximum \n\nlikelihood.  Journal  of the American Statistical Association, 81(395):780-788,  1986. \n\n[6]  A. Gous.  Exponential and Spherical Subfamily Models.  PhD thesis,  Stanford, Statistics \n\nDepartment,  1998. \n\n[7]  T. Hofmann.  Probabilistic latent semantic indexing.  In  Proceedings of the 22th Inter(cid:173)\n\nnational Conference on  Research and Development in  Information Retrieval (SIGIR), \npages  50-57,  1999. \n\n[8]  T . Hofmann, J.  Puzicha,  and M.  I. Jordan.  Unsupervised  learning from  dyadic  data. \n\nIn  Advances in  Neural  Information  Processing Systems  11.  MIT Press,  1999. \n\n[9]  T.  Jaakkola  and D.  Haussler.  Exploiting  generative  models  in  discriminative  classi(cid:173)\nfiers.  In  Advances in  Neural  Information  Processing Systems  11.  MIT Press,  1999. \n\n[lO]  T. Joachims.  Text categorization  with support vector machines:  Learning with many \nrelevant  features.  In  International  Conference on  Machine  Learning (ECML),  1998. \n[ll]  R.E.  Kass  and  P.  W.  Vos.  Geometrical foundations  of asymptotic  inference.  Wiley, \n\nNew  York,  1997. \n\n[12]  D.  Lee and S.  Seung.  Learning the parts of objects by non-negative  matrix factoriza(cid:173)\n\ntion.  Nature,  401:788-791,  1999. \n\n[13]  M.  K. Murray and J. W.  Rice.  Differential geometry and statistics. Chapman &  Hall, \n\nLondon,  New  York,  1993. \n\n[14]  F.C.N.  Pereira,  N.Z.  Tishby,  and L.  Lee.  Distributional  clustering  of English  words. \n\nIn  Proceedings  of the  ACL, pages  183- 190,  1993. \n\n[15]  L.  Saul  and  F .  Pereira.  Aggregate  and  mixed-order  Markov  models  for  statistical \nlanguage  processing.  In  Proceedings of the 2nd International Conference on  Empirical \nMethods  in  Natural  Language Processing,  1997. \n\n\f", "award": [], "sourceid": 1654, "authors": [{"given_name": "Thomas", "family_name": "Hofmann", "institution": null}]}