{"title": "Discovering High Order Features with Mean Field Modules", "book": "Advances in Neural Information Processing Systems", "page_first": 509, "page_last": 515, "abstract": null, "full_text": "Discovering High Order Features with Mean Field Modules \n\n509 \n\nDiscovering  high  order features  with  mean field \n\nmodules \n\nConrad C.  Galland and Geoffrey  E.  Hinton \n\nPhysics  Dept.  and Computer Science  Dept. \n\nUniversity of Toronto \n\nToronto, Canada \n\nM5S  lA4 \n\nABSTRACT \n\nA new  form of the deterministic Boltzmann machine (DBM) learn(cid:173)\ning procedure is presented  which can efficiently train network mod(cid:173)\nules  to discriminate  between  input  vectors  according  to some  cri(cid:173)\nterion.  The new  technique directly utilizes the free  energy  of these \n\"mean field  modules\"  to represent the probability that the criterion \nis  met,  the  free  energy  being readily  manipulated by  the  learning \nprocedure.  Although  conventional deterministic  Boltzmann learn(cid:173)\ning  fails  to  extract  the  higher  order  feature  of shift  at  a  network \nbottleneck,  combining  the  new  mean  field  modules  with  the  mu(cid:173)\ntual information objective function  rapidly produces  modules that \nperfectly extract this important higher order feature without direct \nexternal supervision. \n\nINTRODUCTION \n\n1 \nThe  Boltzmann  machine  learning  procedure  (Hinton  and  Sejnowski,  1986)  can  be \nmade much more  efficient  by using a  mean field  approximation in  which  stochastic \nbinary units are replaced by deterministic real-valued units (Peterson and Anderson, \n1987).  Deterministic  Boltzmann learning  can  be  used  for  \"multicompletion\"  tasks \nin  which  the  subsets  of the  units  that  are  treated  as  input  or  output  are  varied \nfrom trial to trial (Peterson  and  Hartman,  1988).  In this respect  it  resembles  other \nlearning procedures that also involve settling to a stable state (Pineda,  1987).  Using \nthe multicompletion paradigm, it should be possible  to force a  network to explicitly \nextract important higher order features of an ensemble of training vectors by forcing \nthe  network  to  pass  the  information  required  for  correct  completions  through  a \nnarrow  bottleneck.  In  back-propagation networks  with  two  or  three  hidden  layers, \nthe use of bottlenecks sometimes allows the learning to explictly discover  important. \n\n\f510 \n\nGalland and Hinton \n\nunderlying features  (Hinton,  1986)  and  our  original  aim was  to  demonstrate  that \nthe  same  idea  could  be  used  effectively  in  a  DBM  with  three  hidden  layers.  The \ninitial simulations using  conventional  techniques  were  not successful,  but  when  we \ncombined a new  type of DBM  learning with a  new  objective function,  the resulting \nnetwork extracted the crucial higher order features  rapidly and perfectly. \n\n2  THE MULTI-COMPLETION TASK \nFigure 1 shows a network  in  which  the input vector is divided into 4 parts.  Al is a \nrandom binary vector.  A2  is generated  by  shifting Al either to the right or  to the \nleft  by  one  \"pixel\",  using wraparound.  B1  is also  a  random binary vector,  and  B2 \nis generated from B1  by using  the same shift as was used  to generate  A2  from  Al. \nThis means that any three of AI,  A2,  B1,  B2  uniquely specify the fourth  (we  filter \nout the ambiguous cases where this is not true).  To perform correct completion, the \nnetwork  must explicitly represent  the  shift  in  the  single  unit that connects  its two \nhalves.  Shift is  a  second  order  property  that cannot  be  extracted  without  hidden \nunits. \n\nA2 \nAl \n\nB2 \n\nBI \n\nFigure 1. \n\n3  SIMULATIONS USING STANDARD  DETERMINISTIC \n\nBOLTZMANN LEARNING \n\nThe following discussion assumes familiarity with the deterministic Boltzmann learn(cid:173)\ning  procedure,  details  of which  can  be  obtained  from  Hinton  (1989).  During  the \npositive  phase  of learning,  each  of the  288  possible  sets  of shift  matched  four-bit \nvectors  were  clamped onto inputs AI, A2  and B1,  B2,  while  in the negative  phase, \none  of the four  was  allowed  to settle  undamped.  The  weights  were  changed  after \neach  training  case  using  the  on-line  version  of the  DBM  learning  procedure.  The \nchoice of which input not to damp changed systematically throughout  the learning \nprocess  so  that  each  was  left  undamped  equally  often.  This  technique,  although \nsuccessful  in  problems  with  only  one  hidden  layer,  could  not  train  the  network  to \ncorrectly perform the multicompletion task where any of the four input layers would \nsettle to the correct state when the other three were  clamped.  As a result,  the single \n\n\fDiscovering High Order Features with Mean Field Modules \n\n511 \n\ncentral unit failed to extract shift.  In general, the DBM learning procedure, like its \nstochastic predecessor,  seems  to have difficulty learning tasks in multi-hidden layer \nnets.  This failure led to the development of the new  procedure  which,  in one form, \nmanages to correctly extract shift without the need for  many hidden layers or direct \nexternal supervision. \n\n4  A  NEW LEARNING PROCEDURE FOR MEAN FIELD \n\nMODULES \n\nA  DBM  with  unit states in  the range  [-1,1] has free  energy \n\n(1) \n\nThe DBM settles to a free  energy minimum, F*, at a  non-zero  temperature,  where \nthe states of the units are  given  by \n\nYi  = tanh( T  2: Yj Wij ) \n\n1 \n\nj \n\n(2) \n\nAt the minimum, the derivative of F*  with respect to a particular weight (assuming \nT  =  1)  is given  by  (Hinton,  1989) \n\n(3) \n\nSuppose that we  want a network module to discriminate between input vectors that \n\"fit\"  some  criterion  and  input  vectors  that  don't.  Instead  of using  a  net  with  an \noutput unit that indicates the degree of fit,  we  could view  the negative of the mean \nfield  free  energy  of the  whole  module  as  a  measure  of how  happy  it  is  with  the \nclamped  input  vector.  From  this  standpoint,  we  can  define  the  probability  that \ninput vector  Q  fits  the  criterion  as \n\n1 \n\nPcx  = (1 + eF~) \n\n(4) \n\nwhere  F~ is  the  equilibrium free  energy  of the  module  with  vector  Q  clamped  on \nthe inputs. \n\nSupervised  training  can  be  performed  by  using  the  cross-entropy  error  function \n(Hinton,  1987): \n\nN+ \n\nC  =  - L log(pcx)  - L log(1- P/3) \n\nN_ \n\n(5) \n\ni=cx \n\nj=/3 \n\nwhere  the first  sum is over  the  N +  input cases  that fit  the criterion,  and the second \nis over the N _ cases  that don't.  The cross-entropy expression is used to specify error \n\n\f512 \n\nGalland and Hinton \n\nderivatives for  Pa  and hence  for  F~. Error  derivatives for  each  weight  can  then  be \nobtained  by  using  equation  (3),  and  the  module  is  trained  by  gradient  descent  to \nhave high free  energy for  the  \"negative\"  training cases  and low free  energy for  the \n\"positive\"  cases. \n\nThus, for  each  positive case \n\nolog(Pa) \n\nOWij \n\nFor each  negative  case, \n\n1 \n\nr  oF~ \ne'\"  - -\nOWij \n\n1 + eF: \n1 \n\n1 + e- F: \n\n(-YiYj) \n\nolog(1 - P13) \n\nOWij \n\nof* \n_13_ \nOWij \n\nTo test  the  new  procedure,  we  trained  a  shift  detecting  module,  composed  of the \nthe  input  units  Al  and  A2  and  the  hidden  units  HA  from  figure  1,  to  have  low \nfree  energy for  all and only the  right shifts.  Each weight  was  changed in an on-line \nfashion  according to \n\n~w;J' =  f \n\n. \n\n1 \n\n1 + e-F~  \u2022 \nY;YJ' \n\nfor  each  right shifted  case,  and \n\nfor  each  left  shifted  case.  Only  10  sweeps  through  the  24  possible  training  cases \nwere  required  to  successfully  train  the  module  to  detect  shift.  The  training  was \nparticularly easy  because  the hidden  units only receive  connections from  the input \nunits  which  are  always  clamped,  so  the  network  settles  to  a  free  energy  minimum \nin one iteration.  Details of the simulations are given in Galland and Hinton (1990). \n\n5  MAXIMIZING  MUTUAL  INFORMATION  BETWEEN \n\nMEAN FIELD  MODULES \n\nAt  first  sight,  the  new  learning  procedure  is  inherently  supervised,  so  how  can  it \nbe  used  to  discover  tha.t  shift  is  an  important  underlying  feature?  One  method \n\n\fDiscovering High Order Features with Mean Field Modules \n\n513 \n\nis  to  use  two  modules  that  each  supervise  the  other.  The  most  obvious  way  of \nimplementing this idea quickly  creates modules that always agree  because they are \nalways  \"on\".  If,  however,  we  try to maximize the mutual information between  the \nstochastic binary variables represented  by the free  energies of the modules, there is \na strong pressure for each binary variable to have high entropy across cases because \nthe mutual information between  binary variables A and  B is: \n\n(6) \n\nwhere  HAB  is  the  entropy  of the joint distribution  of A  and  B  over  the  training \ncases,  and H A  and H B  are  the entropies of the individual distributions. \nConsider  two  mean  field  modules  with  associated  stochastic  binary variables  A,B \nE {O, I}.  For a  given  case  a, \n\np(Aa = 1) = \n\n1 \nF. \n1 +e  A.at \n\n(7) \n\nwhere  FA  a  is  the free  energy  of the  A  module with the training case  a  clamped on \nthe  input: \n\nWe  can  compute  the  probability that the A  module is  on or off by  averaging over \nthe input sample distribution,  with  pa being the prior probability of an input case \na: \n\np(A=O)  =  1- p(A=I) \n\nSimilarly, we  can compute the four  possible values in the joint probability distribu(cid:173)\ntion of A  and  B: \n\np(A=I,B=I) \n\np(A=O,B=I)  = p(B=I)-p(A=I,B=I) \np(A=I,B=O)  = p(A=I)-p(A=I,B=I) \n\np( A = 0, B = 0)  =  1 - p( B = 1) - p( A = 1) + p( A = 1, B = 1) \n\nUsing equation (3), the partial derivatives of the various individual and joint proba(cid:173)\nbility functions  with respect  to a weight  Wile  in the A  module are readily calculated. \n\n(8) \n\n\f514 \n\nGalland and Hinton \n\nop(A:: 1, B == 1)  ==  \"\"\" pa op(Aa = 1) p(Ba =  1) \n\nOW\u00b7k \n\n, \n\nL.J \na \n\nOW\u00b7k \n\n' \n\n(9) \n\nThe entropy of the stochastic binary  variable A  is \n\nHA  = - <logp(A) > =  - 2:  p(A::a) logp(A=a) \n\na=O,l \n\nThe entropy of the joint distribution is  given by \n\nHAB \n\n- <logp(A, B) > \n- 2:p(A=a, B=b) logp(A=a, B=b) \n\na,b \n\nThe partial derivative of I(A; B) with respect  to a single weight Wik  in the A module \ncan now  be  computed; since  HB  does not depend on Wik,  we  need  only differentiate \nHA  and  HAB.  As shown  in  Galland and Hinton (1990),  the derivative is given  by \n\noI(A; B) \n\nOWik \n\nOWik \n\nOWik \n\n2: pa (p(Aa == 1) - 1) p(Aa ==  1)(YiYk)  log p(A :0) \n\np(A -1) \n\n[ \n\na \n\n_  p(Ba = 1) log p(A= I, B= 1)  _  p(Ba =0) log  p(A= I, B=O)] \n\np(A=O, B= 1) \n\np(A=O, B= 0) \n\nThe above derivation is drawn from Becker and Hinton (1989) who show that mutual \ninformation can be used  as a learning signal in  back-propagation nets.  We  can now \nperform gradient ascent in I(A; B) for each weight in both modules using a two-pass \nprocedure,  the probabilities across  cases  being  accumulated  in  the first  pass. \n\nThis  approach  was  applied  to  a  system  of two  mean  field  modules  (the  left  and \nright  halves  of figure  1 without  the  connecting  central  unit)  to  detect  shift.  As  in \nthe  multi-completion  task,  random  binary  vectors  were  clamped  onto  inputs  AI, \nA2  and  Bl,  B2  related  only  by  shift.  Hence,  the  only  way  the  two  modules  can \nprovide  mutual  information to each  other  is  by  representing  the  shift.  Maximizing \nthe  mutual  information  between  them  created  perfect  shift  detecting  modules  in \nonly  10  two-pass  sweeps  through  the  288  training  cases.  That  is,  after  training, \neach  module  was  found  to  have  low  free  energy  for  either  left  or  right  shifts,  and \nhigh free  energy for  the other.  Details of the simulations are again given  in  G all an cl \nand  Hinton  (1990). \n\n\fDiscovering High Order Features with Mean Field Modules \n\nSIS \n\n6  SUMMARY \nStandard  deterministic  Boltzmann  learning  failed  to  extract  high  order  features \nin  a  network  bottleneck.  We  then  explored  a  variant  of DBM  learning  in  which \nthe free  energy  of a  module  represents  a  stochastic  binary  variable.  This  variant \ncan  efficiently  discover  that  shift  is  an  important  feature  without  using  external \nsupervision,  provided  we  use  an  architecture  and  an  objective  function  that  are \ndesigned  to extract higher order features  which  are invariant across space. \n\nAcknowledgements \n\nWe  would like to thank Sue  Becker  for  many helpful  comments.  This research  was \nsupported by grants from the Ontario Information Technology Research  Center and \nthe National Science and Engineering Research Council of Canada.  Geoffrey Hinton \nis  a fellow  of the Canadian Institute for  Advanced Research. \n\nReferences \n\nBecker,  S.  and Hinton,  G.  E.  (1989).  Spatial coherence  as an internal teacher  for  a \nneural  network.  Technical  Report CRG-TR-89-7, University of Toronto. \n\nGalland,  C.  C.  and  Hinton,  G.  E.  (1990).  Experiments  on  discovering  high  order \nfeatures  with  mean  field  modules.  University  of Toronto  Connectionist  Research \nGroup Technical Report, forthcoming. \n\nHinton,  G.  E.  (1986)  Learning distributed representations  of concepts.  Proceedings \nof the  Eighth  Annual Conference  of the  Cognitive  Science  Society,  Amherst,  Mass. \n\nHinton,  G.  E.  (1987)  Connectionist  learning  procedures.  Technical  Report  CMU(cid:173)\nCS-87-115, Carnegie Mellon  University. \n\nHinton,  G.  E.  (1989)  Deterministic  Boltzmann learning  performs steepest  descent \nin weight-space.  Neural  Computation,  1. \nHinton,  G.  E.  and  Sejnowski,  T.  J.  (1986)  Learning  and  relearning  in  Boltzmann \nmachines.  In  Rumelhart,  D.  E.,  McClelland,  J.  L.,  and  the  PDP  group,  Parallel \nDistributed Processing:  Explorations in the  Microstructure  of Cognition.  Volume  1: \nFoundations,  MIT Press,  Cambridge, MA. \n\nHopfield,  J.  J.  (1984)  Neurons  with graded response  have  collective  computational \nproperties like  those  of two-state neurons.  Proceedings  of the  National Academy  of \nSciences  U.S.A.,  81, 3088-3092. \nPeterson,  C.  and Anderson,  J.  R.  (1987) A mean field  theory learning algorithm for \nneural  networks.  Complex Systems,  1,  995-1019. \n\nPeterson,  C.  and Hartman, E.  (1988)  Explorations of the mean field  theory learning \nalgorithm.  Technical  Report  ACA-ST/HI-065-88,  Microelectronics  and  Computer \nTechnology Corporation,  Austin, TX. \n\nPineda,  F .  J.  (1987)  Generalization  of backpropagation  to  recurrent  neural  net(cid:173)\nworks.  Phys.  Rev.  Lett.,  18, 2229-2232. \n\n\f", "award": [], "sourceid": 260, "authors": [{"given_name": "Conrad", "family_name": "Galland", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}