{"title": "Empirical Entropy Manipulation for Real-World Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 851, "page_last": 857, "abstract": null, "full_text": "Empirical Entropy Manipulation  for \n\nReal-World  Problems \n\nPaul Viola: Nicol N.  Schraudolph, Terrence J. Sejnowski \n\nComputational Neurobiology Laboratory \nThe Salk Institute for  Biological Studies \n\n10010  North Torrey  Pines  Road \n\nLa Jolla,  CA  92037-1099 \n\nviola@salk.edu \n\nAbstract \n\nNo finite sample is sufficient to determine the density, and therefore \nthe entropy, of a signal directly.  Some assumption about either the \nfunctional form of the density or about its smoothness is necessary. \nBoth amount to a prior over the space of possible density functions. \nBy  far  the  most common approach  is  to  assume  that  the  density \nhas a  parametric form. \n\nBy  contrast  we  derive  a  differential  learning  rule  called  EMMA \nthat  optimizes  entropy  by  way  of kernel  density  estimation.  En(cid:173)\ntropy  and  its  derivative  can  then  be  calculated  by  sampling from \nthis density  estimate.  The resulting  parameter update rule  is  sur(cid:173)\nprisingly simple and efficient. \n\nWe  will  show  how  EMMA  can  be  used  to detect  and  correct  cor(cid:173)\nruption  in  magnetic  resonance  images  (MRI).  This  application  is \nbeyond  the scope  of existing parametric entropy  models. \n\n1 \n\nIntroduction \n\nInformation theory is playing an increasing role in unsupervised learning and visual \nprocessing.  For example, Linsker has used the concept of information maximization \nto produce theories of development in the visual cortex (Linsker,  1988).  Becker and \nHinton have  used  information theory  to motivate algorithms for  visual  processing \n(Becker and Hinton,  1992).  Bell and Sejnowski have used information maximization \n\n\u2022 Author  to  whom  correspondence  should  be addressed.  Current  address:  M.LT.,  545 \n\nTechnology  Square,  Cambridge,  MA  02139. \n\n\f852 \n\nP.  VIOLA, N. N.  SCHRAUDOLPH, T. J. SEJNOWSKI \n\nto  solve  the  \"cocktail  party\"  or  signal  separation  problem  (Bell  and  Sejnowski, \n1995).  In  order  to simplify analysis and implementation,  each  of these  techniques \nmakes specific  assumptions about the nature of the signals used,  typically that the \nsignals are drawn from some parametric density.  In practice,  such assumptions are \nvery  inflexible. \n\nIn  this  paper  we  will  derive  a  procedure  that  can  effectively  estimate  and  manip(cid:173)\nulate the  entropy  of a  wide  variety of signals  using  non-parametric densities.  Our \ntechnique  is  distinguished by is  simplicity, flexibility  and efficiency. \n\nWe will begin with a discussion of principal components analysis (PCA) as an exam(cid:173)\nple of a simple parametric entropy manipulation technique.  After pointing out some \nof PCA's  limitation,  we  will  then  derive  a  more  powerful  non-parametric  entropy \nmanipulation procedure.  Finally,  we  will  show  that  the  same  entropy  estimation \nprocedure can be used  to tackle  a  difficult  visual processing  problem. \n\n1.1  Parametric Entropy Estimation \n\nTypically  parametric  entropy  estimation  is  a  two  step  process.  We  are  given  a \nparametric  model for  the  density  of a  signal  and  a  sample.  First,  from  the  space \nof possible  density  functions  the  most  probable  is  selected.  This often  requires  a \nsearch  through  parameter  space.  Second,  the  entropy  of the  most  likely  density \nfunction  is  evaluated. \n\nParametric techniques can work well when the assumed form of the density matches \nthe actual data.  Conversely,  when the parametric assumption is violated the result(cid:173)\ning algorithms are incorrect.  The most common assumption, that the data follow the \nGaussian density,  is  especially restrictive.  An entropy maximization technique  that \nassumes  that  data is  Gaussian,  but  operates  on  data drawn from  a  non-Gaussian \ndensity,  may in fact  end  up minimizing entropy. \n\n1.2  Example:  Principal Components Analysis \n\nThere are  a  number of signal processing  and learning problems that can  be formu(cid:173)\nlated as entropy maximization problems.  One prominent example is  principal com(cid:173)\nponent analYllill (PCA). Given a random variable X, a vector v  can be used to define \na  new random variable,  Y\"  =  X  . v with variance  Var(Y,,)  =  E[(X . v - E[X . v])2]. \nThe principal component v is  the unit vector for  which  Var(Yv)  is  maximized. \nIn  practice  neither  the  density  of X  nor  Y\"  is  known.  The  projection  variance  is \ncomputed from a  finite  sample, A,  of points from X, \n\nVar(Y,,)  ~ Var(Y,,)  ==  EA[(X . v - EA[X . v])2]  , \n\nA \n\n(1) \n\nwhere  VarA(Y,,)  and  E A [\u00b7]  are  shorthand for  the empirical variance and mean eval(cid:173)\nuated over A.  Oja has derived an elegant on-line rule for  learning v when presented \nwith a  sample of X  (Oja,  1982). \nUnder the assumption that X  is  Gaussian is  is  easily proven that Yv  has maximum \nentropy.  Moreover, in the absence of noise, Yij,  contains maximal information about \nX.  However,  when  X  is  not  Gaussian  Yij  is  generally  not  the  most  informative \nprojection. \n\n2  Estimating  Entropy with  Parzen  Densities \n\nWe will now derive a general procedure for manipulating and estimating the entropy \nof a random variable from a sample.  Given a sample of a random variable X,  we  can \n\n\fEmpirical  Entropy  Manipulation  for  Real-world  Problems \n\n853 \n\nconstruct another random variable Y  =  F(X,l1).  The entropy,  heY), is a function of \nv and can be manipulated by changing 11.  The following derivation assumes that Y  is \na  vector random variable.  The joint entropy of a  two random variables,  h(Wl' W2), \ncan  be  evaluated by  constructing  the  vector  random variable,  Y  =  [Wl' w2jT  and \nevaluating heY). \n\nRather than assume that the density has a  parametric form,  whose  parameters are \nselected  using maximum likelihood estimation,  we  will instead use  Parzen  window \ndensity estimation (Duda and Hart, 1973).  In the context of entropy estimation, the \nParzen density  estimate has  three significant  advantages over maximum likelihood \nparametric density  estimates:  (1)  it can  model  the  density  of any  signal  provided \nthe  density function is  smooth;  (2)  since  the  Parzen  estimate is  computed directly \nfrom the sample, there is no search for  parameters; (3)  the derivative of the entropy \nof the  Parzen estimate is  simple to compute. \nThe form of the Parzen  estimate constructed from  a  sample A  is \np.(y, A) = ~A I: R(y - YA)  = EA[R(y - YA)] \n\n(2) \n\n, \n\nYAEA \n\nwhere  the  Parzen  estimator  is  constructed  with  the  window  function  R(\u00b7)  which \nintegrates  to  1.  We  will  assume  that  the  Parzen  window  function  is  a  Gaussian \ndensity  function.  This  will  simplify  some  analysis,  but  it  is  not  necessary.  Any \ndifferentiable function could be used.  Another good choice  is  the  Cauchy density. \n\nUnfortunately evaluating the  entropy integral \n\nhey) ~ -E[log p.(~, A)]  =  -i: log p.(y, A)dy \n\nis  inordinately  difficult.  This  integral  can  however  be  approximated as  a  sample \nmean: \n\n(3) \nwhere  EB{  ]  is  the  sample  mean  taken  over  the  sample  B.  The  sample  mean \nconverges  toward  the  true  expectation  at  a  rate  proportional  to  1/ v' N B  (N B  is \nthe size  of B).  To reiterate,  two  samples can be  used  to estimate the entropy of a \ndensity:  the first  is used to estimate the density,  the second  is  used  to estimate the \nentropyl.  We  call  h\u00b7 (Y) the EMMA estimate of entropy2. \n\nOne way to extremize entropy is  to use  the derivative of entropy with respect  to v. \nThis may be expressed  as \n\n~h(Y) ~ ~h\u00b7(Y) =  __ 1_  '\" LYAEA  f;gt/J(YB  - YA) \nN B  L....iB  Ly  EA gt/J(YB  - YA) \ndl1 \nd  1 \n1 \n\ndv \n\nYBE \n\nA \n\n=  NB  I: I: Wy (YB , YA)  dl1 \"2 Dt/J(YB  - YA), \n\nYBEB YAEA \n\nwhere WY(Yl' Y2)  =  L \n\n_ \n\ngt/J(Yl  - Y2) \n\n( \n\nYAEA  gt/J  Yl  - YA \n\n) \n\n, \n\n(4) \n\n(5) \n\n(6) \n\nDt/J(Y)  ==  yT.,p-ly,  and  gt/J(Y)  is  a  multi-dimensional  Gaussian  with  covariance  .,p. \nWy(Yl' Y2)  is  an indicator of the degree of match between its arguments, in a  \"soft\" \n\nlUsing a  procedure  akin  to leave-one-out  cross-validation  a  single  sample  can  be  used \n\nfor  both purposes. \n\n2EMMA is  a  random  but  pronounceable  subset of the letters in  the  words  \"Empirical \n\nentropy  Manipulation  and Analysis\". \n\n\f854 \n\nP. VIOLA, N.  N.  SCHRAUDOLPH, T. J.  SEJNOWSKl \n\nsense.  It will  approach one if Yl  is  significantly closer  to Y2  than any element of A. \nTo reduce  entropy the  parameters v  are  adjusted such  that there  is  a  reduction in \nthe average  squared  distance  between  points which Wy  indicates are  nearby. \n\n2.1  Stochastic Maximization Algorithm \n\nBoth  the  calculation  of the  EMMA  entropy  estimate  and  its  derivative  involve  a \ndouble  summation.  As  a  result  the  cost  of evaluation is  quadratic  in sample size: \nO(NANB).  While an accurate  estimate of empirical entropy  could be obtained by \nusing  all of the  available data (at  great  cost),  a  stochastic  estimate of the  entropy \ncan  be  obtained  by  using  a  random subset  of the  available  data (at  quadratically \nlower cost).  This is  especially critical in entropy manipulation problems,  where  the \nderivative  of entropy  is  evaluated many hundreds  or  thousands  of times.  Without \nthe  quadratic savings  that arise  from  using smaller samples entropy  manipulation \nwould be impossible (see  (Viola,  1995) for  a  discussion  of these  issues). \n\n2.2  Estimating the Covariance \n\nIn addition to  the  learning  rate  .A,  the  covariance  matrices  of the  Parzen  window \nfunctions,  g,p,  are  important parameters of EMMA.  These parameters may be  cho(cid:173)\nsen  so that  they  are  optimal in  the  maximum likelihood sense.  For  simplicity,  we \nassume  that  the  covariance  matrices are  diagonal,.,p = DIAG(O\"~,O\"~, ... ).  Follow(cid:173)\ning a  derivation almost identical to the one described  in Section  2 we  can derive an \nequation analogous to  (4), \n\nd .  \n-h  (Y) =  -\ndO\"k \n\n1\"\" \"\" \nL...J  L...J  WY(YB' YA) \nN B \nYsE  YAEa \n\nb \n\n(  1 )  ([y]~ \n) \n- - - 1 \n-\nO\"~ \nO\"k \n\n(7) \n\nwhere  [Y]k  is  the  kth  component  of  the  vector  y.  The  optimal,  or  most  likely, \n.,p  minimizes  h\u00b7 (Y).  In  practice  both  v  and  .,p  are  adjusted  simultaneously;  for \nexample, while v  is  adjusted to maximize h\u00b7 (YlI ),  .,p  is adjusted to minimize h\u00b7 (y,,). \n\n3  Principal  Components  Analysis  and  Information \n\nAs  a  demonstration,  we  can  derive  a  parameter  estimation  rule  akin  to  principal \ncomponents  analysis  that  truly  maximizes  information.  This  new  EMMA  based \ncomponent  analysis  (ECA)  manipulates the  entropy of the  random  variable  Y\"  = \nX\u00b7v under the constraint that Ivl  =  1.  For any given value of v the entropy of Yv  can \nbe  estimated from two samples of X  as:  h\u00b7(Yv )  = -EB[logEA[g,p(xB\u00b7v - XA\u00b7 v)]], \nwhere  .,p  is  the variance of the  Parzen window function.  Moreover  we  can estimate \nthe derivative of entropy: \n\nd~ h\u00b7(YlI )  = ;  L  L  Wy(YB, YA)  .,p-l(YB - YA)(XB  - XA) \n\nB  B \n\nA \n\n, \n\nwhere  YA  =  XA  . v  and YB  =  XB  . v.  The  derivative  may be  decomposed  into parts \nwhich  can be  understood  more  easily.  Ignoring the  weighting function  Wy.,p-l  we \nare  left  with the  derivative of some unknown function  f(y\"): \n\nd \ndvf(Yv )  =  N  N  L  L(YB - YA)(XB  - XA) \n\n1 \n\nB  A  B \n\nA \n\n(8) \n\nWhat then  is  f(y\")?  The derivative  of the squared  difference  between  samples  is: \nd~ (YB  - YA)2  =  2(YB  - YA)(XB  - XA) \n\n. So we  can see  that \n\nf(Y,,)  =  2N IN  L  L(YB - YA)2 \n\nB  A  B \n\nA \n\n\fEmpirical  Entropy  Manipulation  for  Real-world  Problems \n\n855 \n\n\u2022 I \n\n:  . \n\nECA-MIN \nECA-MAX \nBCM \nBINGO \nPCA \n\n3 \n\n2 \n\no \n-I \n\n-2 \n\n-3 \n\n\u2022\u2022  t \n\n-4 \n\n-2 \n\no \n\n2 \n\n4 \n\nFigure  1:  See  text for  description. \n\nis  one  half the expectation of the  squared difference  between  pairs of trials of Yv \u2022 \n\nRecall  that  PCA searches  for  the  projection,  Yv ,  that  has  the  largest  sample vari(cid:173)\nance.  Interestingly,  f(Yv )  is  precisely  the sample variance.  Without the  weighting \nterm Wll ,p-l,  ECA  would find  exactly  the  same vector  that  PCA  does:  the  max(cid:173)\nimum variance  projection  vector.  However  because  of Wll ,  the  derivative  of ECA \ndoes not act on all points of A  and B  equally.  Pairs of points that are far  apart are \nforced  no  further  apart.  Another  way  of interpreting  ECA  is  as  a  type  of robust \nvariance  maximization.  Points that might best  be  interpreted  as  outliers,  because \nthey are very far  from the  body of other points,  playa very  small role  in the mini(cid:173)\nmization.  This robust  nature stands  in  contrast  to PCA  which  is very  sensitive  to \noutliers. \n\nFor densities  that are  Gaussian,  the maximum entropy  projection is  the first  prin(cid:173)\ncipal component.  In  simulations ECA effectively finds  the same projection as  PCA, \nand it does so with speeds that are comparable to Oja's rule.  ECA can be used both \nto find the entropy maximizing (ECA-MAX) and minimizing (ECA-MIN) axes.  For \nmore complex densities the PCA axis is very different from the entropy maximizing \naxis.  To provide  some intuition regarding the  behavior of ECA  we  have  run ECA(cid:173)\nMAX,  ECA-MIN,  Oja's  rule,  and  two  related  procedures,  BCM  and  BINGO,  on \nthe  same density.  BCM  is  a  learning  rule  that  was  originally proposed  to  explain \ndevelopment of receptive  fields  patterns  in visual cortex  (Bienenstock,  Cooper and \nMunro,  1982).  More  recently  it  has  been  argued  that  the  rule  finds  projections \nthat  are  far  from  Gaussian  (Intrator  and  Cooper,  1992).  Under  a  limited  set  of \nconditions  this  is  equivalent  to finding  the  minimum entropy  projection.  BINGO \nwas proposed to find  axes along which there is a  bimodal distribution (Schraudolph \nand  Sejnowski,  1993). \n\nFigure 1 displays a  400 point sample and the projection axes discussed  above.  The \ndensity is a mixture of two clusters.  Each cluster has high kurtosis in the horizontal \ndirection.  The  oblique  axis  projects  the  data so that it is  most uniform and hence \nhas  the  highest  entropy;  ECA-MAX  finds  this  axis.  Along  the  vertical axis  the \ndata is  clustered  and has  low entropy;  ECA-MIN  finds  this axis.  The vertical axis \nalso  has  the  highest  variance.  Contrary  to  published  accounts,  the  first  principal \ncomponent can in fact correspond to the minimum entropy projection.  BCM, while \nit  may  find  minimum entropy  projections  for  some  densities,  is  attracted  to  the \nkurtosis along the  horizontal axis.  For this distribution BCM neither minimizes nor \nmaximizes entropy.  Finally,  BINGO successfully  discovers  that  the  vertical axis  is \nvery  bimodal. \n\n\f856 \n\nP.  VIOLA, N.  N.  SCHRAUOOLPH, T. J. SEJNOWSKI \n\n\\  Corrupted(cid:173)\n:.  Corrected .\u2022 \n\n. ' .. . : \n\n1200 \n\n1000 \n\n800 \n\n600 \n\n400 \n\n200 \n\nFigure  2:  At  left:  A  slice  from  an  MRI  scan  of a  head.  Center:  The  scan  after \ncorrection.  Right:  The  density  of pixel  values  in  the  MRI  scan  before  and  after \ncorrection. \n\n~.1  0  0.1  0.2  0.3  0.4 \n\n'. \n0.7  0.8  0.9 \n\n4  Applications \n\nEMMA has proven useful in a  number of applications.  In object recognition EMMA \nhas  been used  align 3D shape models with video images (Viola and Wells III, 1995). \nIn  the  area of medical  imaging EMMA  has  been  used  to  register  data that  arises \nfrom  differing  medical  modalities  such  as  magnetic  resonance  images,  computed \ntomography images, and  positron  emission tomography  (Wells,  Viola and  Kikinis, \n1995). \n\n4.1  MRI Processing \n\nIn  addition,  EMMA  can  be  used  to  process  magnetic  resonance  images  (MRI). \nAn  MRI is  a  2 or 3 dimensional image that records  the density of tissues inside the \nbody.  In the head, as in other parts of the body, there are a number of distinct tissue \nclasses  including:  bone,  water,  white matter, grey matter, and fat.  ~n principle the \ndensity  of pixel  values  in  an  MRI  should  be  clustered,  with  one  cluster  for  each \ntissue  class.  In  reality  MRI  signals  are  corrupted  by  a  bias  field,  a  multiplicative \noffset  that varies slowly in space.  The bias field results from unavoidable variations \nin magnetic field  (see  (Wells  III et  al.,  1994) for  an overview of this  problem). \n\nBecause  the  densities  of each  tissue  type  cluster  together  tightly,  an  uncorrupted \nMRI  should  have  relatively  low  entropy.  Corruption  from  the  bias  field  perturbs \nthe  MRI  image,  increasing  the  values  of some  pixels  and  decreasing  others.  The \nbias field  acts like noise, adding entropy to the pixel density.  We  use EMMA to find \na  low-frequency  correction field  that  when  applied  to  the  image,  makes  the  pixel \ndensity  have  a  lower  entropy.  The  resulting  corrected  image  will  have  a  tighter \nclustering  than the  original density. \nCall the uncorrupted scan s(z); it is  a function of a spatial random variable z.  The \ncorrupted scan,  c( x) = s( z) + b( z)  is a sum of the true scan and the bias field.  There \nare  physical reasons  to believe  b( x)  is  a  low order polynomial in the components of \nz.  EMMA is used to minimize the entropy of the corrected signal, h( c( x) - b( z, v\u00bb, \nwhere  b( z, v),  a  third  order  polynomial with  coefficients  v,  is  an  estimate for  the \nbias corruption. \n\nFigure  2  shows  an  MRI  scan  and  a  histogram  of pixel  intensity  before  and  after \ncorrection.  The  difference  between  the  two  scans  is  quite  subtle:  the  uncorrected \nscan  is  brighter  at  top  right  and  dimmer  at  bottom  left.  This  non-homogeneity \n\n\fEmpirical  Entropy  Manipulation  for  Real-world  Problems \n\n857 \n\nmakes  constructing  automatic  tissue  classifiers  difficult.  In  the  histogram  of the \noriginal scan white and grey matter tissue classes  are confounded into a  single peak \nranging  from  about  0.4  to  0.6.  The  histogram  of the  corrected  scan  shows  much \nbetter separation between these two classes.  For images like this the correction field \ntakes between  20  and 200 seconds  to compute on a  Sparc  10. \n\n5  Conclusion \n\nWe have demonstrated a novel entropy manipulation technique working on problems \nof significant  complexity  and  practical  importance.  Because  it  is  based  on  non(cid:173)\nparametric density  estimation it  is  quite flexible,  requiring  no  strong  assumptions \nabout  the  nature  of signals.  The  technique  is  widely  applicable  to  problems  in \nsignal  processing,  vision  and unsupervised  learning.  The  resulting  algorithms  are \ncomputationally efficient. \n\nAcknowledgements \n\nThis research  was support  by the  Howard  Hughes Medical Institute. \n\nReferences \nBecker,  S.  and  Hinton,  G.  E.  (1992).  A  self-organizing  neural  network  that  discovers \n\nsurfaces in  random-dot  stereograms.  Nature,  355:161-163. \n\nBell,  A.  J.  and  Sejnowski,  T.  J. (1995).  An information-maximisation  approach  to  blind \nseparation.  In  Tesauro,  G.,  Touretzky,  D.  S.,  and  Leen,  T.  K.,  editors,  Advance8 in \nNeural Information Proce88ing, volume  7,  Denver  1994.  MIT  Press,  Cambridge. \n\nBienenstock,  E., Cooper, L., and Munro, P.  (1982).  Theory for  the development  of neuron \nselectivity:  Orientation specificity  and binocular  interaction in visual  cortex.  Journal \nof Neur08cience, 2. \n\nDuda,  R.  and  Hart,  P.  (1973).  Pattern  Cla88ification  and  Scene  AnalY8i8.  Wiley,  New \n\nYork. \n\nIntrator,  N.  and  Cooper,  L.  N.  (1992).  Objective  function  formulation  of the  bcm  the(cid:173)\n\nory  of visual  cortical  plasticity:  Statistical  connections,  stability  conditions.  Neural \nNetwork.,  5:3-17. \n\nLinsker,  R.  (1988).  Self-organization  in  a  perceptual  network.  IEEE  Computer,  pages \n\n105-117. \n\nOja, E.  (1982).  A simplified  neuron  model  as a  principal  component  analyzer.  Journal of \n\nMathematical Biology,  15:267-273. \n\nSchraudolph,  N. N. and  Sejnowski,  T. J. (1993).  Unsupervised  discrimination  of clustered \ndata via  optimization  of binary  information  gain.  In  Hanson,  S.  J.,  Cowan,  J.  D., \nand Giles,  C. L., editors,  Advance. in Neural Information Proce88ing, volume 5,  pages \n499-506,  Denver  1992. Morgan  Kaufmann,  San  Mateo. \n\nViola,  P.  A.  (1995).  Alignment  by  Ma:cimization  of  Mutual  Information.  PhD  thesis, \n\nMassachusetts  Institute  of Technology.  MIT AI Laboratory  TR 1548. \n\nViola,  P.  A.  and  Wells  III,  W.  M.  (1995).  Alignment  by  maximization  of mutual  infor(cid:173)\n\nmation.  In  Fifth  Inti.  Conf.  on  Computer  Vi8ion,  pages  16-23,  Cambridge,  MA. \nIEEE. \n\nWells,  W.,  Viola,  P.,  and  Kikinis,  R.  (1995).  Multi-modal  volume  registration  by  maxi(cid:173)\nmization  of mutual information.  In  Proceeding.  of the  Second International Sympo-\n8ium  on Medical Robotic.  and Computer A88i8ted Surgery, pages  55  - 62.  Wiley. \n\nWells  III, W., Grimson, W., Kikinis,  R., and Jolesz,  F. (1994).  Statistical  Gain  Correction \nand  Segmentation  of MRI Data.  In  Proceeding.  of the  Computer Society Conference \non  Computer  Vi.ion and  Pattern Recognition, Seattle,  Wash.  IEEE, Submitted. \n\n\f", "award": [], "sourceid": 1040, "authors": [{"given_name": "Paul", "family_name": "Viola", "institution": null}, {"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}