{"title": "Symplectic Nonlinear Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 437, "page_last": 443, "abstract": null, "full_text": "Symplectic Nonlinear  Component \n\nAnalysis \n\nLucas  C.  Parra \n\nSiemens  Corporate Research \n\n755  College Road  East,  Princeton,  NJ  08540 \n\nlucas@scr.siemens.com \n\nAbstract \n\nStatistically independent features  can be extracted by finding a fac(cid:173)\ntorial representation of a signal distribution.  Principal Component \nAnalysis  (PCA)  accomplishes  this  for  linear  correlated  and  Gaus(cid:173)\nsian  distributed signals.  Independent  Component Analysis  (ICA), \nformalized  by  Comon (1994),  extracts  features  in  the  case  of lin(cid:173)\near  statistical  dependent  but  not  necessarily  Gaussian  distributed \nsignals.  Nonlinear  Component Analysis finally should find  a  facto(cid:173)\nrial  representation  for  nonlinear  statistical  dependent  distributed \nsignals.  This  paper  proposes  for  this  task  a  novel  feed-forward, \ninformation  conserving,  nonlinear  map  - the  explicit  symplectic \ntransformations.  It also solves the problem of non-Gaussian output \ndistributions  by  considering  single  coordinate  higher  order  statis(cid:173)\ntics. \n\n1 \n\nIntroduction \n\nIn  previous papers  Deco  and  Brauer  (1994)  and  Parra,  Deco,  and  Miesbach  (1995) \nsuggest  volume  conserving  transformations  and  factorization  as  the  key  elements \nfor  a  nonlinear  version  of Independent  Component  Analysis.  As  a  general  class \nof volume  conserving  transformations  Parra  et  al.  (1995)  propose  the  symplectic \ntransformation .  It was  defined  by  an implicit nonlinear  equation,  which  leads to a \ncomplex relaxation procedure for  the function recall.  In  this paper an explicit form \nof the symplectic map is  proposed,  overcoming thus  the computational problems. \n\n\f438 \n\nL. C.PARRA \n\nIn  order  to  correctly  measure  the  factorization  criterion  for  non-Gaussian  output \ndistributions,  higher  order  statistics  has to  be  considered.  Comon  (1994)  includes \nin  the  linear  case  higher  order  cumulants  of the  output  distribution.  Deco  and \nBrauer  (1994)  consider  multi-variate,  higher  order  moments  and  use  them  in  the \ncase  of nonlinear  volume conserving transformations.  But the  calculation of multi(cid:173)\ncoordinate higher  moments is  computational expensive. \n\nThe factorization criterion for  statistical independence  can be expressed in terms of \nminimal mutual information.  Considering only volume  conserving transformations \nallows  to  concentrate  on  single  coordinate  statistics,  which  leads  to an  important \nreduction  of computational complexity.  So  far,  this  approach  (Deco  & Schurman, \n1994;  Parra et aI.,  1995)  has  been  restricted  to second  order statistic.  The present \npaper  discusses  the  use  of higher  order  cumulants for  the estimation of the single \ncoordinate  output  distributions.  The  single  coordinate  entropies  measured  by  the \nproposed  technique match the entropies of the sampled data more accurately.  This \nleads in turns  to better factorization  results. \n\n2  Statistical Independence \n\nMore  general  than  decorrelation  used  in  PCA  the  goal  is  to  extract  statistical \nindependent  features  from  a  signal  distribution  p(x).  We  look  for  a  determinis(cid:173)\ntic  transformation  on  ~n:  y  =  f(x)  which  generates  a  factorial  representation \np(y) = It p(Yd,  or at least  a  representation  where  the individual coordinates P(Yi) \nof the  output  variable yare  \"as  factorial  as  possible\".  This  can  be  accomplished \nby  minimizing the mutual information M I[P(y)]. \n\no ::;  M I[P(y)]  = L H[P(Yi)]  - H[P(y)], \n\nn \n\ni=l \n\n(1) \n\nsince  M I[P(y)]  =  0 holds  if p(y) is  factorial.  The mutual information can  be used \nas  a  measure of \"independence\".  The entropies  H  in  the  definition  (1)  are  defined \nas  usual  by  H[P(y)]  = - J~oop(y)lnp(y)dy. \nAs  in  linear  PCA  we  select  volume  conserving  transformations,  but  now  without \nrestricting ourselves to linearity.  In the noise-free  case of reversible transformations \nvolume conservation implies conservation of entropy from the input x  to the output \ny, i.e.  H[P(y)]  = H[P(x)]  = canst (see Papoulis, 1991).  The minimization of mutual \ninformation  (1)  reduces  then  to  the  minimization of the  single  coordinate  output \nentropies  H[P(Yi)].  This  substantially  simplifies  the  complexity  of the  problem, \nsince  no multi-coordinate statistics is  required. \n\n2.1  Measuring the Entropy with Cumulants \n\nWith  an  upper  bound  minimization criterion  the  task  of measuring entropies  can \nbe  avoided  (Parra et aI.,  1995): \n\n(2) \n\n\fSymplectic  Nonlinear  Component  Analysis \n\n439 \n\nEdgeworth appIOlClmatlOr'l to second and fanh order \n\nO.B,----~--~-~--~-___, \n\n~ . \n\n: \n\n0.7 \n\n0.6 \n\n~0.5 \n\n>-\n\n~ 04 \n~ 03 \n~ \nQ..  0.2 \n\n0., \n\no  -\n\n.O.~~--=---,!------=---~----: \n\ndQ(y1)/dY1 \n-----~> )i \n\n1 \n\nFigure 1:  LEFT:  Doted line:  exponential distribution with additive Gaussian noise \n(noise-variance/decay-constant  =  0.2).  Dashed \nsampled  with  1000  data  points. \nline:  Gaussian approximation equivalent to the Edgeworth approximation to second \norder.  Solid  line:  Edgeworth  approximation  including  terms  up  to  fourth  order. \nRIGHT:  Structure of the volume conserving  explicit symplectic map. \n\nThe minimization of the individual output coordinate entropies  H(P(Yi)]  simplifies \nto  the minimization of output  variances  (Ti.  For the  validity of that approach  it  is \ncrucial that the map y  = f(x) transforms the arbitrary input distribution p(x) into \na  Gaussian output  distribution.  But  volume conserving  and  continuous  maps can \nnot transform arbitrary distributions into Gaussians.  To overcome this problem one \nincludes statistics - higher  than second  order - to the optimization criterion. \n\nComon  (1994)  suggests  to  use  the  Edgeworth  expansion  of a  probability  distribu(cid:173)\ntion.  This  leads  to an  analytic  expression  of the  entropy  in  terms  of measurable \nhigher  order  cumulants.  Edgeworth  expands  the  multiplicative  correction  to  the \nbest  Gaussian  approximation of the  distribution  in  the  orthonormal basis  of Her(cid:173)\nmite polynomials hcr(y).  The expansion  coefficients  are  basically  given  by  the  cu(cid:173)\nmulants Ccr  of distribution p~y).  The Edgeworth expansions reads for  a  zero-mean \ndistribution with  variance  (T \n\n,  (see  Kendall  &  Stuart,  1969) \n\np(y) \n\n2 \n\n-l-e-~ f(y) \n-j2;(J \n\n(3) \n\nNote,  that  by  truncating  this  expansion  at  a  certain  order,  we  obtain  an  approx(cid:173)\nimation  Papp(Y),  which  is  not  strictly  positive.  Figure  1,  left  shows  a  sampled \nexponential  distribution with additive Gaussian noise. \n\nBy  cutting expansion  (3)  at  fourth  order,  and  further  expanding  the  logarithm in \ndefinition of entropy up to sixth order, Comon (1994) approximates the entropy by, \n\n\f440 \n\nL.C.PARRA \n\n1 c~  C4 \nH(P(Y)app]  ~ 2\"ln(271'e) + In 0'  - 120'6  - 480'8  - 480'12  + 8\" 0'60'4 \n\n7  c~ \n\n1  c~ \n\n1 \n\n1  c\u00a7 \n\n(4) \n\nWe suggest to use this expression  to minimize the single coordinate entropies in the \ndefinition of the mutual information (1). \n\n2.2  Measuring the Entropy by Estimating an Approximation \n\nNote  that  (4)  could only be obtained  by  truncating  the expansion  (3).  It is  there(cid:173)\nfore  limited to fourth order statistic,  which  might be  not enough for  a  satisfactory \napproximation.  Besides,  the additional approximation of the logarithm is  accurate \nonly  for  small  corrections  to  the  best  Gaussian  approximation,  i.e.  for  fey)  ~ 1. \nFor distributions with non-Gaussian tails the correction terms might be rather large \nand even  negative  as  noted  above.  We  therefore  suggest  alternatively,  to measure \nthe entropy by estimating the logarithm of the approximated distribution In Papp (y) \nwith the given data points Yv  and  using Edgeworth  approximation (3)  for  Papp  (y), \n\nH(P(y)]  ~ - N  L lnpapp (Yv)  =  canst + In 0'  - N  LIn f(yv) \n\n1  N \n\n1  N \n\n(5) \n\nv=1 \n\nv=1 \n\nFurthermore,  we  suggest  to  correct  the  truncated  expansion  Papp  by  setting \nfapp  (y)  -+  0  for  all  fapp  (y)  <  O.  For  the  entropy  measurement  (5)  there  is  in \nprinciple no limitation to any specific  order. \n\nIn  table  1  the  different  measures  of entropy  are  compared.  The  values  in  the  row \nlabeled 'partition' are measured by  counting the numbers n(i) of data points falling \nin equidistant intervals i of width D.y and summing -pC i)D.y lnp(i) over all intervals, \nwith p(i)D.y =  n(i)IN.  This gives  good  results  compared to the theoretical  values \nonly because  of the relatively  large  sampling size.  These  values  are  presented  here \nin  order  to  have  an  reliable  estimate for  the  case  of the  exponential  distribution, \nwhere  cumulant methods tend to fail. \n\nThe results for  the exponential distribution show  the  difficulty of the measurement \nproposed  by  Comon, whereas  the estimation measurement given by equation  (5)  is \nstable even when considering (for this case)  unreliable 5th and 6th order cumulants. \nThe results for  the symmetric-triangular and uniform distribution demonstrate the \ninsensibility  of the  Gaussian  upper  bound for  the  example of figure  2.  A  uniform \nsquared  distribution  is  rotated  by  an  angle  a.  On  the  abscissa  and  ordinate  a \ntriangular  or  uniform  distribution  are  observed  for  the  different  angles  a  =  II/4 \nor  a  =  0  respectively.  The  approximation of the  single  coordinate  entropies  with \na  Gaussian  measure  is  in  both  cases  the  same.  Whereas  measurements  including \nhigher order statistics correctly  detect minimal entropy (by fixed  total information) \nfor  the uniform distribution at a  =  O. \n\n3  Explicit  Symplectic Transformation \n\nDifferent  ways  of  realizing  a  volume  conserving  transformation  that  guarantees \nH(P(x)]  =  H(P(x)]  have  been  proposed  (Deco  &  Schurman,  1994;  Parra  et  aI., \n\n\fSymplectic  Nonlinear  Component  Analysis \n\n441 \n\n11easured  entropy  of \nsampled  distributions \npartition \nGaussian  upper  bound  (2) \nComan, eq.  (4) \nEstimate  (5)  - 4th  order \nEstimate  (5)  - 6th  order \ntheoretical  value \n\nGauss \n\nuniform \n\n1.35 \u00b1  .02 \n1.415 \u00b1  .02 \n1.414 \u00b1  .02 \n1.414 \u00b1  .02 \n1.414 \u00b1  .02 \n\n1.419 \n\n.024 \u00b1  .006 \n.18 \u00b1  .016 \n.14 \u00b1  .015 \n.13 \u00b1  .015 \n.092 \u00b1  .001 \n\n.0 \n\ntriangular \nexponential \nsymmetric  + Gauss  noise \n.14 \u00b1  .02 \n.18 \u00b1  .02 \n.17 \u00b1  .02 \n.17\u00b1.02 \n.16  \u00b1  .02 \n\n1.31  \u00b1  .03 \n1.53 \u00b1  .04 \n3.0 \u00b1  2.5 \n1.39 \u00b1  .05 \n\n1.3 \u00b1  .5 \n\n.153 \n\nTable  1:  Entropy  values  for  different  distributions  sampled  with  N  =  1000  data \npoints  and  the  different  estimation methods  explained  in  the  text .  The  standard \ndeviations are  obtained by  multiple repetition of the  experiment. \n\n1995).  A  general  class  of volume  conserving  transformations  are  the  symplectic \nmaps (Abraham &  Marsden,  1978).  An  interesting and for  our purpose important \nfact  is  that  any  symplectic  transformation  can  be  expressed  in  terms  of a  scalar \nfunction.  And  in  turn  any  scalar  function  defines  a  symplectic  map.  In  (Parra \net  al.,  1995)  a  non-reflecting  symplectic  transformation  has  been  presented.  But \nits  implicit  definition  results  in  the  need  of solving  a  nonlinear  equation  for  each \ndata point.  This leads to time consuming computations which  limit in  practice the \napplications to low  dimensional problems  (n~ 10).  In this work reflecting  symplec(cid:173)\ntic  transformations with  an  explicit  definition  are  used  to  define  a  \"feed-forward\" \nvolume conserving  maps.  The input  and  output space  is  divided  in  two partitions \nx  =  (Xl, X2)  and Y =  (Yl, Y2),  with  Xl, X2, Yl , Y2  E ?Rn / 2 . \n\n(6) \n\nThe structure  of this symplectic map  is  represented  in figure  1,  right.  Two scalar \nfunctions  P  : ?Rn / 2  1-+  ?R  and  Q : ?Rn / 2  1-+  ?R  can  be  chosen  arbitrarily.  Note  that \nfor  quadratic  functions  equation  (6)  represents  a  linear  transformation.  In  order \nto have  a  general  transformation we  introduce for  each  of these  scalar functions  a \n3-layer perceptron  with  nonlinear hidden  units and  a single linear output  unit: \n\n(7) \n\nThe  scalar  functions  P  and  Q  are  parameterized  by  the  network  parameters \nWl, W2  E  Rm  and  Wl, W 2  E  Rm  x  Rn/2.  The  hidden-unit,  nonlinear  activation \nfunction  9  applies  to each  component of the  vectors  WlYl  and  W2X2  respectively. \nBecause of the structure of equation (6) the output coordinates Yl  depend only addi(cid:173)\ntively on the input coordinates Xl.  To obtain a more general nonlinear dependence \na  second  symplectic layer  has  to be added. \n\nTo  obtain  factorial  distributions  the  parameters  of the  map  have  to  be  trained. \nThe approximations of the single coordinate entropies  (4)  or  (5)  are inserted  in the \nmutual information optimization criterion (1).  These approximations are expressed \nthrough  moments  in  terms  of the  measured  output  data  points.  Therefore,  the \n\n\f442 \n\nL.C.PARRA \n\nO,B,.---~-~-~-~-~-~-~---, \n\n0,6 \n\n0,4 \n\n0,2 \n\n-0.2 \n\n-0.4 \n\n-0.6 \n\n, , \n\n. :.':  :' ... \" \n..... \n:.,' \n. \n\n, \n, \n, \n\n-~~,B---0~,6-~-0~.4---0~,2-~--0~.2--0~.4--0~.6-~0,B \n\nFigure 2:  Sampled 2-dimensional squared uniform distribution rotated by 7l\" /4.  Solid \nlines represent  the directions found  by any of the higher order techniques  explained \nin the text.  Dashed lines represent  directions calculated by linear PCA. (This result \nis  arbitrary and varies  with noise) . \n\ngradient of these expressions with respect to parameters ofthe map can be computed \nin  principle.  For  that  matter  different  kinds  of  averages  need  to  be  computed. \nEven though, the computational complexity is not substantially increased compared \nwith the efficient  minimum variances  criterion  (2),  the  complexity of the algorithm \nincreases  considerably.  Therefore,  we  applied  an optimization algorithm that  does \nnot  require  any  gradient  information.  The  simple  stochastic  and  parallel  update \nalgorithm ALOPEX  (Unnikrishnan  &  Venugopal,  1994)  was  used. \n\n4  Experiments \n\nAs  explained  above,  finding  the  correct  statistical  independent  directions  of a  ro(cid:173)\ntated  two  dimensional  uniform  distribution  causes  problems for  techniques  which \ninclude only second order statistic.  The statistical independent coordinates are sim(cid:173)\nply  the axes  parallel  to  the edges  of the  distribution  (see  figure  2).  A  rotation  i. e. \na  linear transformation suffices  for  this  task.  The  covariance  matrix of the  data is \ndiagonal for  any  rotation  of the squared  distribution  and,  hence,  does  not  provide \nany information about the  correct  orientation of the square.  It is  well  known,  that \nPCA fails  to find  in the case of non-Gaussian distributions the statistical indepen(cid:173)\ndent  coordinates.  Similarly the  Gaussian  upper bound technique  (2)is  not  capable \nto minimize the mutual information in this case.  Instead,  with anyone of the higher \norder criteria explained in the previous section one finds  the appropriate coordinates \nfor  any linearly transformed multi-dimensional uniform distribution.  This has been \nobserved  empirically for  a  series  of setups.  The symplectic map  was  restricted  in \nthis experiments to linea1;ity  by using square scalar functions. \n\nThe  second  example  shows  that  the  proposed  technique  in  fact  finds  nonlinear \nrelations  between  the  input  coordinates.  An  one-dimensional  signal  distributed \naccording  to  the  distribution  of figure  1  was  nonlinearly  transformed  into  a  two-\n\n\fSymplectic  Nonlinear  Component  Analysis \n\n443 \n\n.  '.: <~., \n\n.'  . \n\n. : ' .. ; \n\nFigure  3:  Symplectic map trained  with  4th and 2nd order statistics corresponding \nto  the  equations  (5)  and  (2)  respectively.  Left:  input  distribution.  The  line  at \nthe  center  of the  distribution  gives  the  nonlinear  transformed  noiseless  signal  dis(cid:173)\ntributed according to the distribution shown in figure  1.  Center and Right :  Output \ndistribution of the symplectic map corresponding  to the 4th order  (right)  and  2nd \norder  (center)  criterion. \n\ndimensional  signal  and  corrupted  with  additive  noise,  leading  to  the  distribution \nshown  in figure  3,  left.  The task of finding  statistical  independent  coordinates  has \nbeen  tackled  by  an  explicit  symplectic  transformation  with. n  = 2  and  m  = 6. \nOn  figure  3  the  different  results  for  the  optimization  according  to  the  Gaussian \nupper  bound  criterion  (2)  and  the  approximated entropy  criterion  (5)  are  shown. \nObviously  considering higher order statistics  in fact  improves the result  by  finding \nthe better  representation of the nonlinear dependency. \n\nReference \n\nAbraham,  R.,  & Marsden,  J .  (1978).  Foundations  of Mechanics  The  Benjamin(cid:173)\n\nCummings Publishing Company, Inc.,  London. \n\nComon,  P.  (1994).  Independent  component  analysis,  A  new  concept  Signal  Pro(cid:173)\n\ncessing,  36,  287- 314. \n\nDeco,  G.,  &  Brauer,  W.  (1994).  Higher  Order Statistical  Decorrelation  by  Volume \n\nConcerving  Nonlinear  Maps.  Neural  Networks,  ?  submitted. \n\nDeco,  G., & Schurman, B.  (1994).  Learning Time Series  Evolution by Unsupervised \n\nExtraction of Correlations.  Physical  Review E,  ?  submitted. \n\nKendall, M.  G., & Stuart, A.  (1969).  The Advanced  Theory  of Statistics (3 edition)., \n\nVol.  1.  Charles  Griffin  and Company Limited,  London. \n\nPapoulis, A.  (1991).  Probability,  Random  Variables,  and Stochastic Processes. Third \n\nEdition, McGraw-Hill,  New  York. \n\nParra,  L.,  Deco,  G.,  &  Miesbach,  S.  (1995). \n\nRedundancy  reduction  with \n\ninformation-preserving nonlinear maps.  Network,  6(1),  61-72. \n\nUnnikrishnan,  K.,  P.,  & Venugopal,  K.,  P.  (1994).  Alopex:  A  Correlation-Based \nLearning Algorithm for  Feedforward  and Recurrent  Neural  Networks.  Neural \nComputation,  6(3),  469- 490. \n\n\f", "award": [], "sourceid": 1080, "authors": [{"given_name": "Lucas", "family_name": "Parra", "institution": null}]}