{"title": "Learning Sparse Codes with a Mixture-of-Gaussians Prior", "book": "Advances in Neural Information Processing Systems", "page_first": 841, "page_last": 847, "abstract": null, "full_text": "Learning sparse codes with a \nmixture-of-Gaussians prior \n\nBruno A.  Olshausen \n\nDepartment of Psychology and \n\nCenter for  Neuroscience, UC  Davis \n\n1544 Newton Ct. \nDavis,  CA  95616 \n\nK.  Jarrod Millman \n\nCenter for  Neuroscience,  UC  Davis \n\n1544 Newton Ct. \nDavis,  CA 95616 \n\nkjmillman@ucdavis. edu \n\nbaolshausen@ucdavis.edu \n\nAbstract \n\nWe  describe  a  method  for  learning  an  overcomplete  set  of  basis \nfunctions  for  the  purpose of modeling  sparse structure in images. \nThe  sparsity  of the  basis  function  coefficients  is  modeled  with  a \nmixture-of-Gaussians  distribution.  One  Gaussian  captures  non(cid:173)\nactive  coefficients  with  a  small-variance  distribution  centered  at \nzero,  while one or more other Gaussians capture active coefficients \nwith a large-variance distribution.  We show that when the prior is \nin such  a form,  there exist efficient  methods for  learning the basis \nfunctions  as  well  as  the parameters of the prior.  The performance \nof the  algorithm  is  demonstrated  on  a  number  of test  cases  and \nalso  on  natural  images.  The  basis  functions  learned  on  natural \nimages  are similar to those  obtained with other methods,  but the \nsparse form of the coefficient distribution is much better described. \nAlso, since the parameters of the prior are adapted to the data, no \nassumption about sparse  structure in the  images  need  be  made  a \npriori,  rather it is  learned from  the data. \n\n1 \n\nIntroduction \n\nThe  general  problem  we  address  here  is  that  of  learning  a  set  of basis  functions \nfor  representing  natural  images  efficiently.  Previous  work  using  a  variety  of opti(cid:173)\nmization schemes has established that the basis functions  which best code  natural \nimages in terms  of sparse,  independent components resemble  a  Gabor-like wavelet \nbasis  in which  the basis functions  are spatially localized,  oriented and bandpass in \nspatial-frequency  [1,  2,  3,  4].  In  order  to  tile  the joint  space  of position,  orienta(cid:173)\ntion,  and  spatial-frequency  in  a  manner  that  yields  useful  image  representations, \nit has  also  been  advocated  that the  basis  set  be  overcomplete  [5],  where  the num(cid:173)\nber  of basis  functions  exceeds  the  dimensionality  of  the  images  being  coded.  A \nmajor challenge  in learning overcomplete bases,  though,  comes  from  the fact  that \nthe  posterior  distribution  over  the  coefficients  must  be  sampled  during  learning. \nWhen the posterior is  sharply peaked,  as it is  when a  sparse prior is  imposed, then \nconventional sampling methods become especially cumbersome. \n\n\f842 \n\nB.  A.  Olshausen and K.  J.  Millman \n\nOne approach to dealing with the problems associated with overcomplete codes and \nsparse priors is suggested by the form of the resulting posterior distribution over the \ncoefficients averaged over many images.  Shown below is the posterior distribution of \none of the coefficients in a 4 x's overcomplete representation.  The sparse prior that \nwas  imposed in learning was  a Cauchy distribution and is  overlaid (dashed line).  It \nwould seem that the coefficients do not fit  this imposed prior very well,  and instead \nwant  to occupy  one  of two  states:  an  inactive  state  in  which  the  coefficient  is  set \nnearly to zero, and an active state in which the coefficient takes on some significant \nnon-zero  value  along  a  continuum.  This  suggests  that  the  appropriate  choice  of \nprior is  one that is  capable of capturing these two discrete states. \n\nII \nI \nI \nI \n\u2022 \nI \nI \n\n\\ . \n\nI \n\n-2 \n\n0 \n\n2 \ncoo!IIdon1v .... \n\nFigure  1:  Posterior distribution of coefficients  with Cauchy prior overlaid. \n\nOur approach to modeling this form of sparse structure uses a mixture-of-Gaussians \nprior  over  the  coefficients.  A  set  of binary  or  ternary  state  variables  determine \nwhether  the  coefficient  is  in  the  active  or  inactive  state,  and  then  the  coefficient \ndistribution  is  Gaussian  distributed  with  a  variance  and  mean  that  depends  on \nthe  state  variable.  An  important  advantage of this  approach,  with  regard  to  the \nsampling problems mentioned above, is that the use of Gaussian distributions allows \nan  analytical  solution  for  integrating  over  the  posterior  distribution  for  a  given \nsetting of the state variables.  The only sampling that needs to be done then is  over \nthe binary or ternary state variables.  We show here that this problem is  a tractable \none.  This  approach differs  from  that taken previously  by  Attias  [6]  in that we  do \nnot  use  variational  methods  to  approximate  the  posterior,  but  rather  we  rely  on \nsampling to adequately characterize the posterior distribution over  the coefficients. \n\n2  Mixture-of-Gaussians  model \n\nAn image,  I(x, y),  is  modeled  as  a  linear superposition of basis functions,  \u00a2i(X, y), \nwith coefficients  ai,  plus  Gaussian noise  II( x, y) : \n\n(1) \n\nIn what follows  this will  be expressed in  vector-matrix notation as  1= q. a + II. \nThe prior  probability distribution over  the coefficients  is  factorial,  with the distri(cid:173)\nbution over each coefficient ai  modeled as  a mixture-of-Gaussians distribution with \neither two or  three  Gaussians  (fig.  2).  A set of binary or ternary state variables  Si \nthen determine  which  Gaussian is  used to describe  the coefficients. \nThe total prior over both sets of variables,  a  and s,  is  of the form \n\n(2) \n\n\fLearning Sparse Codes with a Mixture-of-Gaussians Prior \n\n843 \n\nTwo Gaussians \n\n(binary state variables) \n\nThree Gaussians \n\n(ternary state variables) \n\nP(lIj) \n\nP(lIj) \n\n_,;=-1 \n\n.<;=1 \n\nai \n\nai \n\nFigure 2:  Mixture-of-Gaussians prior. \n\nwhere P(Si) determines the probability of being in the active or inactive states, and \nP(ailsi)  is  a  Gaussian distribution whose  mean and  variance is  determined by the \ncurrent state Si. \nThe total image probability is  then given  by \n\n(3) \n\n(4) \n\n(5) \n\n(6) \n\nP(IIO)  = L P(s/O) J P(I/a, O)P(als, O)da \n\ns \n\nwhere \n\nP(Ila, 0) \n\nP(als,O) \n\nP(sIO) \n\n-~II-4>aI2 \n\n1 \n--e  2 \nZAN \n\n_l_e-t(a-Il(s))t Aa(s) (a-Il(s)) \nZAa(s) \n\n1  _1 s t  A  s \n\n\u2022 \n\n--e  2 \nZA. \n\nand the parameters 0  include  AN,  4),  Aa(s), f.L(s),  and As .  Aa(s)  is  a  diagonal in(cid:173)\nverse covariance matrix with elements Aa(S)ii = Aa; (Si).  (The notations  Aa(s)  and \nf.L(s)  are used  here to explicitly  reflect  the dependence  of the means  and variances \nof the ai  on sd  As is also diagonal  (for now)  with elements ASii = As;.  The model \nis  illustrated graphically in figure  3. \n\nSi \n\n(binary or \nternary) \n\nFigure 3:  Image model. \n\n\f844 \n\n3  Learning \n\nB.  A.  Olshausen and K.  J.  Millman \n\nThe  objective  function  for  learning  the  parameters  of the  model  is  the  average \nlog-likelihood: \n\n\u00a3  =  (log P(IIO)) \n\n(7) \n\nMaximizing this objective will  minimize the lower bound on coding length. \nLearning is accomplished via gradient ascent on the objective, \u00a3.  The learning rules \nfor  the parameters As,  Aa (s),  J-t( s)  and  ~ are given by: \n\nex \n\n= \n\n{)\u00a3 \n\n{)>\"Si \n\n1 2 [(Si)P(SiI 9) - (si)P(sII,9)] \n\n{)\u00a3 \n\n{)>\"ai (u) \n!  [(8(Si  - u))P(sII,9)_ \n2 \n\n>\"ai (u) \n\n(8(Si  - u)  (Kii(U) - 2ai(U)J-ti(U) + J-t~(u)))P(sII,9)] \n{)\u00a3 \n\n{)J-ti( u) \n>\"ai (u) (8(Si  - u) (ai(u) - J-ti(U))) \n{)\u00a3 \n\n(8) \n\n(9) \n\n(10) \n\n{)~ \n\n>\"N [I (a(s)) P(sII,9)  - ~ (K(s)) P(SII,9)] \n\n(11) \nwhere  u  takes  on  values  0,1  (binary)  or  -1,0,1  (ternary)  and  K(s)  H-1 (s)  + \na(s) a(s)T.  (a  and  H  are  defined  in  eqs.  15  and  16  in  the  next  section.)  Note \nthat in these expressions we have dropped the outer brackets averaging over images \nsimply to reduce clutter. \nThus, for each image we must sample from the posterior P(sll, 0)  in order to collect \nthe appropriate statistics needed for learning.  These statistics must be accumulated \nover many different images,  and then the parameters are updated according to the \nrules  above.  Note that this  approach differs  from  that of Attias  [6]  in  that we  do \nnot  attempt  to sum  over  all  states,  s,  or  to  use  the  variational  approximation to \napproximate  the  posterior.  Instead,  we  are  effectively  summing  only  over  those \nstates that  are most  probable according to the posterior.  We  conjecture  that this \nscheme will  work in  practice because the posterior has significant probability only \nfor  a  small fraction  of states  s,  and so it can be  well-characterized  by  a  relatively \nsmall number of samples.  Next we  present an efficient  method for  Gibbs  sampling \nfrom  the posterior. \n\n4  Sampling and  inference \n\nIn order to sample from  the posterior P(sll,O), we  first  cast it in Boltzmann form: \n(12) \n\nP(sll,O) ex  e-E(s) \n\nwhere \n\nE(s)  = \n\n-logP(s,IIO) =  -logP(sIO) J P(lla,O)P(als,O)da \n\n\fLearning Sparse Codes with a Mixture-of-Gaussians Prior \n\n845 \n\n~ST Ass + 10gZAa(S) + Eals(a,s) + ~IOgdetH(s) + const. \n\n(13) \n\nand \n\na  =  argminEals(a,s) \n\na \n\nH(s)  =  \\7\\7 aEals(a, s) = )\\Nif!T if! + Aa(s) \n\n(14) \n\n(15) \n\n(16) \n\nGibbs-sampling on P(sII, (})  can be performed by flipping state variables Si  accord(cid:173)\ning  to \n\nP(Si  t- sa)  = \n\nP(Si  t- sa) = \n\nl+e~E(\u00b7i~\u00b7a) \n\n1 \n\n(binary) \n\n(ternary) \n\n(17) \n(18) \n\nWhere  sa  =  Si  in the binary case,  and sa  and sf3  are the two alternative states in \nthe  ternary case.  AE(Si  t- sa)  denotes  the  change  in  E(s)  due  to changing  Si  to \nsa  and is  given by: \n\nwhere  ASi = sa - Si,  AAai  =  Aai (sa)  - Aai (Si),  J  =  H- 1 ,  and Vi  = Aai (Si) J.Li(Si). \nNote  that  all  computations for  considering  a  change of state are  local  and involve \nonly  terms  with  index  i.  Thus,  deciding  whether  or  not  to  change  state  can  be \ncomputed quickly.  However,  if a change of state is  accepted,  then we  must update \nJ.  Using the Sherman-Morrison formula, this can be kept to an O(N2) computation: \n\n(19) \n\nJ  t- J  -\n\nAAak \n\n[ \n1 + AAak  Jkk \n\n] J k Jk \n\n(20) \n\nAs  long as accepted state changes are rare  (which we  have found  to be the case for \nsparse distributions), then Gibbs sampling may be performed quickly and efficiently. \nIn addition, Hand J  are generally very sparse matrices, so as  the system is  scaled \nup  the  number  of elements  of a  that  are  affected  by  a  flip  of Si  will  be  relatively \nfew. \n\nIn order to code images under this model,  a  single state of the coefficients must be \nchosen for  a given  image.  We  use for  this purpose the MAP estimator: \n\nargmaxP(aII,s, (}) \n\na \n\narg max P(sII, (}) \n\ns \n\n(21) \n\n(22) \n\nMaximizing the  posterior  distribution  over  s  is  accomplished  by  assigning  a  tem(cid:173)\nperature, \n\nP(sII, (})  ex  e-E(s)/T \n\n(23) \n\nand gradually lowering it until  there are no more state changes. \n\n\f846 \n\n5  Results \n\n5.1  Test  cases \n\nB.  A.  Olshausen and K. 1.  Millman \n\nWe  first  trained  the  algorithm on  a  number  of test cases  containing  known  forms \nof both  sparse and non-sparse  (bi-modal)  structure,  using  both  critically sampled \n(complete)  and  2x's  overcomplete  basis  sets.  The  training  sets  consisted  of 6x6 \npixel image patches that were created by  a  sparse superposition of basis functions \n(36  or 72)  with  P(ISil  = 1)  = 0.2,  Aa; (0)  = 1000,  and  Aa; (1)  = 10.  The results  of \nthese  test  cases confirm that the algorithm is  capable of correctly extracting both \nsparse and non-sparse structure from  data, and they are not shown here for  lack of \nspace. \n\n5.2  Natural images \n\nWe trained the algorithm on 8x8 image patches extracted from pre-whitened natural \nimages.  In all cases, the basis functions were initialized to random functions  (white \nnoise) and the prior was initialized to be Gaussian (both Gaussians of roughly equal \nvariance).  Shown in figure 4a, b are the results for  a set of 128 basis functions  (2 x 's \novercomplete)  in  the two-Gausian case.  In the three-Gaussian case,  the  prior  was \ninitialized  to  be  platykurtic  (all  three  Gaussians  of  equal  variance  but  offset  at \nthree different  positions).  Thus,  in this  case  the sparse form  of the prior emerged \ncompletely from the data.  The resulting priors for two of the coefficients are shown \nin figure 4c, with the posterior distribution averaged over many images overlaid.  For \nsome of the coefficients the posterior distribution matches the mixture-of-Gaussians \nprior well,  but for  others the tails appear more Laplacian in form.  Also,  it appears \nthat the  extra complexity offered  by having three  Gaussians  is  not  utilized:  Both \nGaussians  move  to  the  center  position  and  have  about  the  same  mean.  When  a \nnon-sparse, bimodal  prior is  imposed,  the basis function  solution does  not become \nlocalized,  oriented, and bandpass as  it does with sparse priors. \n\n5.3  Coding efficiency \n\nWe  evaluated the coding efficiency  by quantizing the coefficients  to different  levels \nand  calculating  the  total  coefficient  entropy  as  a  function  of the  distortion intro(cid:173)\nduced  by  quantization.  This  was  done  for  basis  sets  containing  48,  64,  96,  and \n128 basis functions.  At  high SNR's the overcomplete basis sets yield better coding \nefficiency,  despite  the  fact  that there  are  more  coefficients  to code.  However,  the \npoint at which this occurs appears to be well  beyond the point where errors are no \nlonger perceptually noticeable  (around  14 dB). \n\n6  Conclusions \n\nWe  have  shown  here  that  both the  prior  and  basis  functions  of our image  model \ncan be  adapted to  natural images.  Without  sparseness  being  imposed,  the model \nboth seeks  distributions that are sparse and learns the appropriate basis functions \nfor  this  distribution.  Our  conjecture  that  a  small  number  of samples  allows  the \nposterior  to  be  sufficiently  characterized  appears  to hold.  In  all  cases  here,  aver(cid:173)\nages  were  collected  over  40  Gibbs  sweeps,  with  10  sweeps  for  initialization.  The \nalgorithm proved capable of extracting the structure in challenging datasets in high \ndimensional spaces. \n\nThe overcomplete image codes have the lowest coding cost at high SNR levels,  but \nat  levels  that  appear  higher  than  is  practically  useful.  On  the  other  hand,  the \n\n\fLearning Sparse Codes with a Mixture-of-Gaussians Prior \n\n847 \n\na. \n\nc. \n\n::~~::~~ \n\n.  - '~ \". \n\n~:. \n\n\" \n\n\" \n\n.\\ \n\n10'\" \n\n. \n\n\" \n\n10.... \n\n'0\" \n\n-2 \n\n-1 \n\n0 \n\nI \n\n-. \n\n2 \n\n10~  J./ \n\n\"'\\. \n\n-Z \n\n_I \n\n0 \n\n1 \n\n2 \n\nFigure 4:  An overcomplete set of 128 basis functions  (a)  and priors (b,  vertical axis \nis  log-probability)  learned from  natural images.  c,  Two of the priors  learned from \na  three-Gaussian mixture using  64  basis  functions,  with the posterior distribution \naveraged  over  many  coefficients  overlaid.  d,  Rate  distortion  curve  comparing  the \ncoding efficiency  of different  learned basis sets. \n\nsum of marginal entropies likely  underestimates the true entropy of the coefficients \nconsiderably,  as there are certainly statistical dependencies  among the coefficients. \nSo  it  may  still  be  the  case  that  the  overcomplete  bases  will  show  a  win  at  lower \nSNR's  when  these  dependencies  are  included  in  the  model  (through  the  coupling \nterm As). \n\nAcknowledgments \n\nThis work was supported by NIH  grant R29-MH057921. \n\nReferences \n[1]  Olshausen  BA,  Field  DJ  (1997)  \"Sparse  coding  with  an  overcomplete  basis  set:  A \n\nstrategy employed by VI?\"  Vision  Research,  37:  3311-3325. \n\n[2]  Bell  AJ,  Sejnowski  TJ  (1997)  \"The  independent  components  of  natural  images  are \n\nedge filters,\"  Vision  Research,  37:  3327-3338. \n\n[3]  van Hateren  JH,  van  der  Schaaff A  (1997)  \"Independent component filters  of natural \nimages compared with simple cells  in primary visual cortex,\"  Proc. Royal  Soc.  Lond. \nB,  265:  359-366. \n\n[4]  Lewicki  MS , Olshausen BA  (1999)  \"A  probabilistic framework  for  the adaptation and \n\ncomparison of image codes,\"  JOSA A,  16(7):  1587-160l. \n\n[5]  Simoncelli  EP,  Freeman  WT,  Adelson  EH,  Heeger  DJ  (1992)  \"Shiftable  multiscale \n\ntransforms,\"  IEEE Transactions on Information Theory,  38(2):  587-607. \n\n[6]  Attias H  (1999)  \"Independent factor  analysis,\"  Neural  Computation, 11:  803-852. \n\n\f", "award": [], "sourceid": 1668, "authors": [{"given_name": "Bruno", "family_name": "Olshausen", "institution": null}, {"given_name": "K.", "family_name": "Millman", "institution": null}]}