{"title": "Online Independent Component Analysis with Local Learning Rate Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 789, "page_last": 795, "abstract": null, "full_text": "Online Independent  Component Analysis \n\nWith Local Learning Rate Adaptation \n\nNicol N.  Schraudolph \n\nXavier Giannakopoulos \n\nnic<Didsia.ch \n\nxavier<Didsia.ch \n\nIDSIA,  Corso Elvezia 36 \n6900  Lugano, Switzerland \n\nhttp://www.idsia.ch/ \n\nAbstract \n\nStochastic meta-descent (SMD) is a new technique for online adap(cid:173)\ntation  of local  learning  rates  in  arbitrary twice-differentiable  sys(cid:173)\ntems.  Like matrix momentum it uses full  second-order information \nwhile  retaining  O(n)  computational  complexity  by  exploiting  the \nefficient  computation  of Hessian-vector  products.  Here  we  apply \nSMD  to independent  component  analysis,  and  employ  the  result(cid:173)\ning  algorithm  for  the  blind  separation  of time-varying  mixtures. \nBy matching individual learning rates to the rate of change in each \nsource signal's  mixture coefficients,  our technique is  capable of si(cid:173)\nmultaneously tracking sources that move at very different,  a priori \nunknown speeds. \n\n1 \n\nIntroduction \n\nIndependent component analysis  (ICA)  methods are typically run in batch mode in \norder to keep the stochasticity of the empirical gradient low.  Often this is combined \nwith  a  global  learning rate annealing scheme that negotiates  the tradeoff between \nfast  convergence  and  good  asymptotic  performance.  For  time-varying  mixtures, \nthis must be replaced by a learning rate adaptation scheme.  Adaptation of a single, \nglobal learning rate, however, facilitates  the tracking only of sources whose mixing \ncoefficients  change  at  comparable  rates  [1],  resp.  switch  all  at  the  same  time  [2]. \nIn  cases  where  some  sources  move  much  faster  than  others,  or switch  at  different \ntimes,  individual  weights  in  the unmixing  matrix must  adapt at different  rates  in \norder to achieve good performance. \n\nWe apply stochastic meta-descent (SMD), a new online adaptation method for local \nlearning rates  [3,  4],  to an extended Bell-Sejnowski ICA algorithm [5]  with  natural \ngradient  [6]  and  kurtosis  estimation  [7]  modifications.  The resulting  algorithm  is \ncapable  of separating  and  tracking  a  time-varying  mixture  of  10  sources  whose \nunknown mixing coefficients change at different  rates. \n\n\f790 \n\nN.  N.  Schraudo!ph and X  Giannakopou!os \n\n2  The SMD  Algorithm \n\nGiven  a  sequence  XQ, Xl, ... of data points,  we  minimize  the  expected  value  of a \ntwice-differentiable loss function fw(x) with respect to its parameters W by stochas(cid:173)\ntic gradient descent: \n\nWt+l  =  Wt  +  Pt\u00b7 8t ,  where \n\nIt \n\n(1) \n\nand  .  denotes  component-wise  multiplication.  The  local  learning  rates P are  best \nadapted  by  exponentiated  gradient  descent  [8,  9],  so  that  they  can  cover  a  wide \ndynamic range while staying strictly positive: \n\nlnpt \n\na fwt (xt) \nI-t  alnp \n\n1  ... \nnpt-l  -\nPt-l . exp(1-t It . Vt)  ,  where  Vt \n\n(2) \n\nand I-t  is  a  global  meta-learning rate.  This  approach rests  on the assumption that \neach element of P affects f w( x) only through the corresponding element of w.  With \nconsiderable  variation,  (2)  forms  the basis  of most  local  rate adaptation methods \nfound  in  the literature. \nIn order to avoid an expensive exponentiation [10]  for  each weight update, we typ(cid:173)\nically use the linearization etL  ~ 1 + u,  valid for  small  luI, giving \n\n(3) \n\nwhere we  constrain the  multiplier  to be at  least  (typically)  (} = 0.1  as  a  safeguard \nvalues.  For  the meta-level  gradient \nagainst  unreasonably  small  -\ndescent to be stable, I-t  must in any case be chosen such that the multiplier for P does \nnot stray far  from  unity;  under these conditions  we  find  the linear  approximation \n(3)  quite sufficient. \n\nor  negative -\n\nDefinition  of v.  The  gradient  trace  v should  accurately measure the effect  that \na  change in  local  learning rate has  on the corresponding weight.  It is  tempting to \nconsider  only  the  immediate  effect  of a  change in Pt  on  Wt+l:  declaring Wt  and 8t \nin  (1)  to be independent of Pt,  one then quickly  arrives at \n\n... \n...  ~ \nVt+l  =  ~l ...  =  Pt\u00b7 Ut \n\n_  aWt+l \nu  npt \n\n(4) \n\nHowever,  this  common  approach  [11,  12,  13,  14,  15]  fails  to take into account  the \nincremental  nature of gradient  descent:  a  change in P affects  not only the current \nupdate of W,  but also  future  ones.  Some  authors  account for  this  by setting v to \nan exponential  average of past gradients  [2,  11,  16];  we  found  empirically that the \nmethod of Almeida  et  al.  [15]  can indeed be improved by this approach  [3].  While \nsuch averaging serves to reduce the stochasticity of the product It \u00b7It-l implied by \n(3)  and  (4),  the average remains one of immediate, single-step effects. \nBy  contrast,  Sutton  [17,  18]  models  the  long-term  effect  of P on  future  weight \nupdates in a  linear system by carrying the relevant  partials forward through time, \nas  is  done  in  real-time recurrent learning  [19].  This  results  in  an iterative update \nrule for  v,  which  we  have extended to nonlinear systems  [3,  4].  We  define  vas an \n\n\fOnline leA with Local Rate Adaptation \n\n791 \n\nexponential average of the effect  of all  past changes in p on the current weights: \n\n... \nVt+ 1  =  1 -\n\n( \n\n-\n\nA) ~ d  OWt+1 \nL.J 1\\  0 1  ... \n. npt-i \ni=O \n\n(5) \n\nThe forgetting  factor  0  ~ A ~ 1 is  a free  parameter of the algorithm.  Inserting  (1) \ninto (5)  gives \n\n(6) \n\nwhere H t  denotes the instantaneous Hessian of fw(i!)  at time t.  The approximation \nin  (6)  assumes  that  (Vi> 0) oPt!OPt-i  =  0;  this  signifies  a  certain dependence on \nan appropriate choice of meta-learning rate p..  Note that there is  an efficient  O(n) \nalgorithm  to  calculate  HtVt  without  ever  having to  compute  or  store  the  matrix \nH t  itself  [20];  we  shall  elaborate  on  this  technique  for  the  case  of  independent \ncomponent analysis below. \n\nMeta-level  conditioning.  The gradient  descent  in P at  the  meta-level  (2)  may \nof course  suffer  from  ill-conditioning  just  like  the descent  in  W at  the  main  level \n(1); the meta-descent in fact  squares the condition number when v is defined as the \nprevious gradient, or an exponential average of past gradients.  Special measures to \nimprove  conditioning  are thus  required  to make  meta-descent  work  in  non-trivial \nsystems. \nMany  researchers  [11,  12,  13,  14]  use  the sign  function  to radically  normalize the \np-update.  Unfortunately such a nonlinearity does not preserve the zero-mean prop(cid:173)\nerty that  characterizes  stochastic  gradients in equilibrium -- in  particular,  it  will \ntranslate any skew  in  the equilibrium distribution into a  non-zero  mean change in \np.  This causes convergence to non-optimal step sizes, and renders such methods un(cid:173)\nsuitable for  online learning.  Notably,  Almeida  et  al.  [15]  avoid this pitfall  by  using \na  running estimate of the gradient's stochastic variance as their meta-normalizer. \nIn addition to modeling the long-term effect  of a  change in local learning rate, our \niterative gradient trace serves as  a highly effective conditioner for  the meta-descent: \nthe fixpoint  of (6)  is  given by \n\nVt  = \n\n[AHt  +  (I-A) diag(I/Pi)]-llt \n\n(7) \n\na modified Newton step, which for typical values of A (i. e.,  close to 1)  scales with \n-\nthe inverse of the gradient.  Consequently,  we  can expect the product It . Vt  in  (2) \nto be a  very  well-conditioned quantity.  Experiments with  feedforward  multi-layer \nperceptrons  [3,  4]  have  confirmed  that  SMD  does  not  require  explicit  meta-level \nnormalization, and converges faster than alternative methods. \n\n3  Application to leA \n\nWe  now  apply  the SMD  technique to independent  component  analysis,  using  the \nBell-Sejnowski  algorithm  [5]  as our  base method.  The goal is  to find  an  unmixing \n\n\f792 \n\nN.  N.  Schraudolph and X  Giannakopoulos \n\nup to scaling and permutation -\n\nmatrix Wt  which -\nprovides  a  good linear esti(cid:173)\nmate Vt  ==  WtXt  of the independent sources St  present in a given mixture signal Xt\u00b7 \nThe mixture is  generated linearly according to Xt  =  Atst , where At  is  an unknown \n(and unobservable)  full  rank matrix. \n\nWe  include the well-known  natural gradient  [6]  and kurtosis  estimation  [7]  modifi(cid:173)\ncations to the basic algorithm,  as  well  as  a  matrix  Pt  of local  learning rates.  The \nresulting online update for  the weight  matrix Wt  is \n\nwhere the gradient D t  is  given by \n\nDt \n\n== \n\n8f;;~~t)  =  ([Vt  \u00b1 tanh(Vt)] vt - 1) Wt , \n\n(8) \n\n(9) \n\nwith  the sign  for  each  component  of the tanh(Vt)  term  depending  on  its  current \nkurtosis estimate. \nFollowing Pearlmutter [20],  we now  define the differentiation operator \n\nRVt (g(Wt\u00bb \n\n== \n\n8g(W~r+ rVt) Ir=o \n\nu \n\n(10) \n\nwhich  describes the effect  on 9  of a  perturbation of the weights  in the direction of \nVt.  We  can use RVt  to efficiently  calculate the Hessian-vector product \n\n(11) \n\nwhere  \"vee\"  is  the operator that concatenates all columns of a  matrix into a single \ncolumn  vector.  Since Rv,  is  a  linear operator, we  have \n\nRv,(Wt ) \nRVt (Vt) \nRVt (tanh(Vd) \n\n-\n\nVt, \nRv, (WtXt)  =  VtXt, \ndiag( tanh' (Vt\u00bb)  VtXt  , \n\n(12) \n(13) \n(14) \n\nand so forth  (cf.  [20]).  Starting from  (9),  we  apply the RVt  operator to obtain \n\nHt*Vt \n\n-\n\nRv,[([Vt \u00b1 tanh(Vt)] ytT - 1) Wt] \n([Vt  \u00b1 tanh(Vt)] vt - 1) Vt  +  RVt([ Yt  \u00b1 tanh(Vt)] vt - 1) Wt \n([ Vt  \u00b1 tanh(Vt)] vt - 1) Vt  + \n[(1 \u00b1 diag[tanh'(Vt)]) VtXt vt +  [Vt  \u00b1 tanh(Vd](Vtxt)T] Wt \n\nIn conjunction with the matrix versions of our learning rate update  (3) \n\nand gradient trace (6) \n\nthis constitutes our SMD-ICA  algorithm. \n\n(15) \n\n(16) \n\n(17) \n\n\fOnline leA with Local Rate Adaptation \n\n793 \n\n4  Experiment \n\nThe algorithm was  tested  on an artificial  problem  where  10  sources  follow  elliptic \ntrajectories according to \n\nXt  =  (Abase  + Al sin(wt) + A2 cos(wt)) St \n\n(18) \nwhere  Abase  is  a  normally distributed mixing matrix,  as well  as  Al  and A2, whose \ncolumns represent the axes of the ellipses on which the sources travel.  The velocities \nware normally  distributed  around  a  mean  of one  revolution  for  every  6000  data \nsamples.  All  sources are supergaussian. \nThe  ICA-SMD  algorithm  was  implemented  with  only  online  access  to  the  data, \nincluding on-line whitening  [21].  Whenever the condition number  of the estimated \nwhitening matrix exceeded a large threshold (set to 350 here), updates (16)  and (17) \nwere  disabled to prevent the algorithm from  diverging.  Other parameters settings \nwere It =  0.1,  >.  =  0.999, and  p =  0.2. \nResults that were not separating the 10 sources without ambiguity were discarded. \nFigure  1  shows  the  performance  index  from  [6]  (the  lower  the  better,  zero  being \nthe ideal case) along with the condition number of the mixing matrix, showing that \nthe algorithm  is  robust to a  temporary confusion  in  the separation.  The ordinate \nrepresents 3000 data samples, divided into mini-batches of 10 each for  efficiency. \nFigure  2  shows  the  match  between  an  actual  mixing  column  and  its  estimate,  in \nthe subspace spanned by the elliptic trajectory.  The singularity occurring halfway \nthrough  is  not  damaging  performance.  Globally  the  algorithm  remains  stable  as \nlong as degenerate inputs are handled correctly. \n\n5  Conclusions \n\nOnce  SMD-ICA  has found  a  separating solution,  we  find  it  possible to simultane(cid:173)\nously track ten sources that move independently at very different,  a priori unknown \n\nOOr-------~------T_------~------,_------~------~ \n\nError index  -\ncond(A)120  ---+---\n\n50 \n\n40 \n\n30 \n\nFigure 1:  Global  view  of the quality of separation \n\n\f794 \n\nN.  N.  Schraudolph and X  Giannakopoulos \n\nOr---.---.,~--,----.----~---.----.----.----r---. \n\nEstimation error  -\n\n-, \n\n-2 \n\n-3 \n\n-4 \n\n-5 \n\n-6~--~--~~--~--~----~--~----~---L----~--~ \n2.5 \n\n-2.5 \n\n-0.5 \n\n0.5 \n\n'.5 \n\n2 \n\n-2 \n\n-'.5 \n\n-, \n\no \n\nFigure  2:  Projection of a  column  from  the mixing  matrix.  Arrows  link  the exact \npoint with its estimate; the trajectory proceeds from  lower  right to upper left. \n\nspeeds.  To  continue tracking over  extended  periods  it  is  necessary  to handle  mo(cid:173)\nmentary singularities,  through online estimation of the number of sources or some \nother heuristic solution.  SMD's adaptation of local learning rates can then facilitate \ncontinuous, online use of ICA in rapidly changing environments. \n\nAcknowledgments \n\nThis  work  was  supported  by the Swiss  National  Science  Foundation  under  grants \nnumber 2000-052678.97/1 and 2100-054093.98. \n\nReferences \n\n[1]  J.  Karhunen  and  P.  Pajunen,  \"Blind  source  separation  and  tracking  using \nnonlinear PCA criterion:  A least-squares approach\",  in  Proc.  IEEE Int.  Conf. \non Neural  Networks,  Houston, Texas,  1997, pp.  2147- 2152. \n\n[2]  N.  Murata, K.-R. Milller, A.  Ziehe, and S.-i. Amari,  \"Adaptive on-line learning \nin  changing  environments\", \nin  Advances  in  Neural  Information  Processing \nSystems,  M.  C.  Mozer,  M.  I.  Jordan,  and  T .  Petsche,  Eds.  1997,  vol.  9,  pp. \n599- 605, The MIT Press,  Cambridge, MA. \n\n[3]  N.  N.  Schraudolph,  \"Local gain adaptation in stochastic gradient descent\",  in \nProceedings  of the  9th  International  Conference  on Artificial Neural Networks, \nEdinburgh,  Scotland,  1999,  pp.  569-574, lEE,  London,  ftp://ftp.idsia.ch/ \npub/nic/smd.ps.gz. \n\n[4]  N.  N.  Schraudolph,  \"Online learning with adaptive local step sizes\",  in  Neural \nNets  - WIRN Vietri-99;  Proceedings  of the  11th Italian  Workshop  on Neural \nNets,  M.  Marinaro  and  R.  Tagliaferri,  Eds.,  Vietri  suI  Mare,  Salerno,  Italy, \n1999, Perspectives in Neural Computing, pp.  151-156, Springer Verlag, Berlin. \n\n\fOnline leA with Local Rate Adaptation \n\n795 \n\n[5]  A.  J.  Bell  and  T.  J.  Sejnowski,  \"An  information-maximization  approach  to \nblind  separation  and  blind  deconvolution\",  Neural  Computation,  7(6):1129-\n1159,1995. \n\n[6]  S.-i.  Amari, A.  Cichocki, and H.  H.  Yang,  \"A new learning algorithm for  blind \nsignal  separation\", \nin  Advances  in  Neural  Information  Processing  Systems, \nD.  S.  Touretzky,  M.  C.  Mozer,  and  M.  E.  Hasselmo,  Eds.  1996,  vol.  8,  pp. \n757-763, The MIT Press, Cambridge,  MA. \n\n[7]  M.  Girolami  and  C.  Fyfe, \n\n\"Generalised  independent  component  analysis \nthrough  unsupervised  learning with  emergent  bussgang  properties\",  in  Proc. \nIEEE Int.  Conf.  on Neural Networks,  Houston, Texas,  1997, pp.  1788-179l. \n\n[8]  J.  Kivinen  and  M.  K.  Warmuth,  \"Exponentiated  gradient  verSus  gradient \ndescent  for  linear  predictors\",  Tech.  Rep.  UCSC-CRL-94-16,  University  of \nCalifornia, Santa Cruz,  June 1994. \n[9]  J.  Kivinen  and  M.  K.  Warmuth, \n\n\"Additive  versus  exponentiated  gradient \nupdates  for  linear  prediction\", \nin  Proc.  27th  Annual  ACM  Symposium  on \nTheory  of Computing,  New York,  NY,  May 1995, pp. 209-218, The Association \nfor  Computing Machinery. \n\n[10]  N.  N.  Schraudolph,  \"A  fast,  compact  approximation of the exponential func(cid:173)\n\ntion\",  Neural  Computation,  11(4):853-862, 1999. \n\n[11]  R.  Jacobs,  \"Increased rates of convergence through learning rate adaptation\", \n\nNeural Networks,  1:295- 307,  1988. \n\n[12]  T.  Tollenaere,  \"SuperSAB:  fast  adaptive back  propagation with good  scaling \n\nproperties\",  Neural Networks,  3:561-573, 1990. \n\n[13]  F.  M.  Silva and L.  B.  Almeida,  \"Speeding up back-propagation\",  in  Advanced \nNeural Computers, R.  Eckmiller, Ed., Amsterdam, 1990, pp. 151-158, Elsevier. \n\n[14]  M.  Riedmiller  and H.  Braun,  \"A  direct adaptive method for  faster  backprop(cid:173)\n\nagation learning:  The RPROP algorithm\",  in  Proc.  International  Conference \non Neural Networks,  San Francisco,  CA,  1993, pp. 586-591, IEEE, New  York. \n[15]  L.  B.  Almeida,  T.  Langlois,  J. D.  Amaral, and A.  Plakhov,  \"Parameter adap(cid:173)\ntation in  stochastic optimization\",  in  On-Line  Learning  in  Neural  Networks, \nD.  Saad, Ed., Publications of the Newton Institute, chapter 6.  Cambridge Uni(cid:173)\nversity Press,  1999,  ftp://146.193. 2 . 131/pub/lba/papers/adsteps . ps .gz. \n\n[16]  M.  E.  Harmon and L.  C.  Baird III,  \"Multi-player residual  advantage learning \n\nwith general function approximation\" , Tech.  Rep. WL-TR-1065, Wright Labo(cid:173)\nratory, WL/ AACF, 2241 Avionics Circle, Wright-Patterson Air Force Base, OH \n45433-7308,  1996,  http://vvv.leemon.com/papers/sim_tech/sim_tech.ps.gz. \n\n[17]  R.  S.  Sutton,  \"Adapting bias  by  gradient  descent:  an incremental  version  of \ndelta-bar-delta\",  in  Proc.  10th  National  Conference  on Artificial Intelligence. \n1992, pp.  171-176, The MIT Press, Cambridge, MA,  ftp://ftp.cs.umass.edu/ \npub/anv/pub/sutton/sutton-92a.ps.gz. \n\n[18]  R.  S.  Sutton,  \"Gain adaptation beats least squares?\",  in Proc.  7th  Yale  Work(cid:173)\nshop  on  Adaptive  and  Learning  Systems,  1992,  pp.  161-166,  ftp://ftp.cs. \numass.edu/pub/anv/pub/sutton/sutton-92b.ps.gz. \n\n[19]  R.  Williams and D.  Zipser,  \"A  learning algorithm for  continually running fully \n\nrecurrent  neural  networks\",  Neural  Computation,  1:270-280, 1989. \n\n[20]  B.  A.  Pearlmutter,  \"Fast exact multiplication  by  the Hessian\",  Neural  Com(cid:173)\n\nputation,  6(1):147-160,1994. \n\n[21]  J.  Karhunen,  E.  Oja,  L.  Wang,  R.  Vigario,  and  J.  Joutsensalo,  \"A  class  of \nneural networks for independent component analysis\",  IEEE  Trans.  on Neural \nNetworks,  8(3):486-504, 1997. \n\n\f", "award": [], "sourceid": 1648, "authors": [{"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "Xavier", "family_name": "Giannakopoulos", "institution": null}]}