{"title": "Iterative Construction of Sparse Polynomial Approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 1064, "page_last": 1071, "abstract": null, "full_text": "Iterative Construction of \n\nSparse Polynomial Approximations \n\nTerence D.  Sanger \nMassachusetts Institute \n\nof Technology \nRoom E25-534 \n\nCambridge,  MA 02139 \n\ntds@ai.mit.edu \n\nRichard S.  Sutton \nGTE Laboratories \n\nIncorporated \n\n40  Sylvan Road \n\nWaltham, MA  02254 \n\nsutton@gte.com \n\nAbstract \n\nChristopher J.  Matheus \n\nGTE Laboratories \n\nIncorporated \n\n40  Sylvan Road \n\nWaltham,  MA 02254 \n\nmatheus@gte.com \n\nWe  present an iterative  algorithm for  nonlinear  regression  based  on  con(cid:173)\nstruction of sparse  polynomials.  Polynomials are  built sequentially from \nlower to higher order.  Selection of new terms is accomplished using a novel \nlook-ahead  approach that  predicts whether  a  variable  contributes  to  the \nremaining  error.  The algorithm is  based  on  the tree-growing heuristic  in \nLMS  Trees  which  we  have  extended  to approximation of arbitrary  poly(cid:173)\nnomials  of the  input  features.  In  addition,  we  provide a  new  theoretical \njustification  for  this  heuristic  approach.  The  algorithm  is  shown  to dis(cid:173)\ncover a  known polynomial from samples,  and  to make accurate estimates \nof pixel  values in an image-processing task. \n\n1 \n\nINTRODUCTION \n\nLinear  regression  attempts to  approximate a  target  function  by  a  model  that is  a \nlinear  combination of the input features.  Its approximation ability is thus limited \nby the  available  features.  We  describe  a  method for  adding new  features  that are \nproducts  or  powers  of existing  features.  Repeated  addition  of new  features  leads \nto  the  construction  of  a  polynomial  in  the  original  inputs,  as  in  (Gabor  1961). \nBecause  there is  an  infinite  number  of possible  product  terms,  we  have  developed \na  new  method for  predicting the usefulness of entire classes of features  before  they \nare  included.  The  resulting  nonlinear  regression  will  be  useful  for  approximating \nfunctions  that can be described  by sparse polynomials. \n\n1064 \n\n\fIterative Construction of Sparse Polynomial Approximations \n\n1065 \n\nf \n\nXn \n\nFigure 1:  Network depiction  of linear  regression on a  set of features  Xi. \n\n2  THEORY \n\nLet  {xdi=l  be  the  set  of features  already  included  in  a  model  that  attempts  to \npredict the function f . The output of the model is a  linear  combination \n\nn \n\ni = LCiXi \n\ni=l \n\nwhere  the  Ci'S  are  coefficients  determined  using  linear  regression.  The model  can \nalso be depicted  as  a  single-layer network as in  figure  1.  The approximation error \nis  e = f  - j,  and  we  will  attempt to  minimize  E[e2 ]  where  E  is  the  expectation \noperator. \n\nThe  algorithm  incrementally  creates  new  features  that  are  products  of existing \nfeatures.  At each step,  the  goal  is  to select  two features  xp  and  Xq  already  in  the \nmodel and  create a  new feature  XpXq  (see figure 2).  Even if XpXq  does not decrease \nthe  approximation  error,  it  is  still  possible  that  XpXqXr  will  decrease  it  for  some \nX r .  So in  order  to decide  whether  to create a  new  feature  that  is  a  product  with \nx p ,  the algorithm must  \"look-ahead\"  to determine if there exists any polynomial a \nin  the xi's such  that inclusion  ofaxp  would  significantly decrease  the error.  If no \nsuch polynomial exists,  then  we  do  not need  to consider  adding any features  that \nare products with xp. \nDefine  the inner  product  between two polynomials a  and  b as  (alb)  = E[ab]  where \nthe expected value is taken with respect to a  probability measure I-\"  over the (zero(cid:173)\nmean)  input  values.  The induced  norm  is  IIal12  = E[a 2 ],  and  let  P  be  the  set  of \npolynomials with finite  norm.  {P, (\u00b7I\u00b7)}  is then an infinite-dimensional linear vector \nspace.  The Weierstrass approximation theorem proves that P  is dense  in the set of \nall  square-integrable functions  over  1-\",  and  thus justifies  the  assumption  that  any \nfunction  of interest can  be  approximated  by a  member of P. \nAssume that the error e  is  a  polynomial in P.  In order to test whether  axp  partic(cid:173)\nipates in e for  any polynomial a  E  P,  we write \ne = apxp + bp \n\n\f1066 \n\nSanger, Sutton, and Matheus \n\nf \n\nFigure 2:  Incorporation of a  new product term into the model. \n\nwhere  ap and  bp are  polynomials,  and  ap is  chosen  to  minimize  lIapxp  - ell 2 \nE[( apxp - e )2].  The orthogonality principle  then shows that apxp  is  the  projection \nof the polynomial e  onto the linear subspace of polynomials xpP.  Therefore,  bp is \northogonal to xpP,  so that E[bpg] = 0 for  all g in  xpP. \nWe now write \n\nE[e2] = E[a;x;] + 2E[apxpbp] + E[b;] = E[a;x;] + E[b;] \n\nsince  E[apxpbp] = 0  by orthogonality.  If apxp  were included  in  the model,  it would \nthus reduce  E[e2] by E[a;x;], so  we  wish  to choose  xp  to maximize  E[a;x;].  Un(cid:173)\nfortunately, we  have no dIrect  measurement of ap \u2022 \n\n3  METHODS \n\nAlthough E[a;x;] cannot be measured directly, Sanger (1991)  suggests choosing xp \nto maximize  E[e2x~] instead, which is  directly measurable.  Moreover,  note that \n\nE[e2x;]  =  E[a;x;] + 2E[apx;bp] + E[x;b;] \n\n=  E[a;x;] \n\nand  thus  E[e2x;]  is  related  to  the  desired  but  unknown  value  E[a;x;].  Perhaps \nbetter would  be to use \n\nE[e 2x2]  E[a2x4 ] \n~=-::-:p- -\np  p \nE[x~]  -\nE[x~] \n\nwhich can be thought of as the regression of (a;x~)xp against xp' \nof e2  against xr  for  all i  as the basis for  comparison.  The regression coefficients Wi \n\nMore  recently,  (Sutton and  Matheus 1991)  suggest  using the regression  coefficients \n\nare called  \"potentials\", and  lead  to a  linear approximation of the squared error: \n\n(1) \n\n\fIterative Construction of Sparse Polynomial Approximations \n\n1067 \n\nIf a  new  term apxp  were included  in  the model of f, then  the squared  error would \nbe  b;  which is  orthogonal to any polynomial in  xpP  and  in particular to x;.  Thus \nthe coefficient of x;  in (1)  would  be  zero after  inclusion of apxp,  and wpE[x;] is  an \napproximation  to  the  decrease  in  mean-squared  error  E[e 2 ]  - E[b;]  which  we  can \nexpect from inclusion of apxp.  We  thus choose xp  by maximizing wpE[x;]. \n\nThis  procedure  is  a  form  of look-ahead  which  allows  us  to predict  the  utility of a \nhigh-order term apxp without actually including it in the regression.  This is perhaps \nmost  useful  when  the term is  predicted  to make only  a  small contribution for  the \noptimal a p ,  because  in  this  case  we  can  drop  from  consideration any new  features \nthat include xp. \nWe can choose a different variable Xq  similarly, and test the usefulness of incorporat(cid:173)\ning the product XpXq  by computing a  \"joint potential\" Wpq  which is the regression of \nthe squared error against the model including a new term x~x~. The joint potential \nattempts to predict  the magnitude of the term E[a~qx;xi]. \nWe now use this method to choose a single new feature XpXq  to include in the model. \nFor all pairs XiXj  such that Xi  and Xj  individually have high potentials, we  perform \na  third regression to determine the joint potentials of the product terms XiXj.  Any \nterm with a high joint potential is likely to participate in f.  We choose to include the \nnew term XpXq  with the largest joint potential.  In  the network model,  this results in \nthe construction of a  new unit that computes the product of xp and  x q, as  in figure \n2.  The new  unit is  incorporated  into the regression,  and  the resulting error  e  will \nbe orthogonal to this unit and all  previous  units.  Iteration of this  technique  leads \nto  the  successive  addition  of new  regression  terms  and  the  successive  decrease  in \nmean-squared error E[e 2 ].  The process stops when the residual mean-squared error \ndrops below a  chosen threshold, and  the final model consists of a sparse polynomial \nin  the original inputs. \nWe have implemented this algorithm both in a  non-iterative version that computes \ncoefficients and potentials based on a fixed  data set, and in an iterative version that \nuses the LMS  algorithm (Widrow and Hoff 1960)  to compute both coefficients and \npotentials incrementally  in  response  to  continually  arriving  data.  In  the  iterative \nversion,  new terms are added at fixed  intervals and are chosen by  maximizing over \nthe  potentials  approximated  by  the  LMS  algorithm.  The  growing  polynomial  is \nefficiently  represented as a  tree-structure,  as in (Sanger 1991a). \nAlthough the algorithm  involves three separate regressions,  each is  over only  O( n) \nterms,  and  thus  the  iterative  version  of the algorithm  is  only  of O(n)  complexity \nper input pattern processed. \n\n4  RELATION TO  OTHER ALGORITHMS \n\nApproximation  of functions  over  a  fixed  monomial  basis  is  not  a  new  technique \n(Gabor  1961,  for  example).  However,  it performs  very poorly  for  high-dimensional \ninput  spaces,  since  the  set  of all  monomials  (even of very  low  order)  can  be  pro(cid:173)\nhibitively large.  This has led to a search for  methods which allow the generation of \nsparse polynomials.  A recent example and  bibliography are  provided  in  (Grigoriev \net  al.  1990),  which  describes  an  algorithm  applicable  to  finite  fields  (but  not  to \n\n\f1068 \n\nSanger, Sutton, and Matheus \n\nj \n\nFigure  3:  Products  of hidden  units  in  a  sigmoidal  feedforward  network  lead  to a \npolynomial in the hidden units themselves. \n\nreal-valued  random variables). \n\nThe  GMDH  algorithm  (Ivakhnenko  1971,  Ikeda  et  al.  1976,  Barron  et  al.  1984) \nincrementally  adds  new  terms  to  a  polynomial  by  forming  a  second  (or  higher) \norder polynomial in 2 (or more) of the current terms,  and including this polynomial \nas a new  term if it correlates with the error.  Since GMDH does not use look-ahead, \nit  risks  avoiding  terms which  would  be  useful  at future steps.  For example,  if the \npolynomial  to be  approximated  is  xyz  where  all  three  variables  are  independent, \nthen  no  polynomial  in  x  and  y  alone  will  correlate  with  the  error,  and  thus  the \nterm  xy  may  never  be  included.  However,  x 2y2  does  correlate  with  x 2y2 Z2,  so \nthe look-ahead  algorithm presented here would  include  this  term, even though the \nerror  did  not  decrease  until  a  later  step.  Although  GMDH  can  be  extended  to \ntest  polynomials  of more  than  2  variables,  it  will  always  be  testing  a  finite-order \npolynomial in a finite number of variables, so there will always exist target functions \nwhich it will not be able  to approximate. \nAlthough  look-ahead  avoids  this  problem,  it  is  not  always  useful.  For  practical \npurposes,  we  may be interested in the best Nth-order approximation to a function, \nso it may not be  helpful  to include  terms  which  participate in monomials  of order \ngreater  than  N,  even  if these  monomials  would  cause  a  large  decrease  in  error. \nFor example,  the  best  2nd-order  approximation  to  x 2 + ylOOO  + zlOOO  may  be  x 2 , \neven  though  the  other  two  terms  contribute more  to the  error.  In  practice,  some \ncombination of both infinite look-ahead and GMDH-type  heuristics may be  useful. \n\n5  APPLICATION  TO  OTHER STRUCTURES \n\nThese methods have a  natural application to other network structures.  The inputs \nto  the  polynomial  network  can  be  sinusoids  (leading  to  high-dimensional  Fourier \nrepresentations),  Gaussians  (leading  to  high-dimensional  Radial  Basis  Functions) \nor  other  appropriate  functions  (Sanger  1991a,  Sanger  1991b).  Polynomials  can \n\n\fI terative Construction of Sparse Polynomial Approximations \n\n1069 \n\neven  be applied  with sigmoidal networks as input, so  that \n\nXi  =  (T  (I: SijZj ) \n\nwhere  the z;'s are  the original inputs,  and  the Si;'S are the weights to a  sigmoidal \nhidden  unit whose  value is  the  polynomial term Xi.  The last  layer  of hidden units \nin  a  multilayer  network  is  considered  to  be the set of input features  Xi  to a  linear \noutput unit,  and  we  can compute  the potentials of these features  to determine the \nhidden  unit  xp  that  would  most  decrease  the  error  if apxp  were  included  in  the \nmodel (for  the optimal  polynomial  ap ).  But ap  can  now  be  approximated  using  a \nsubnetwork of any desired  type.  This subnetwork is used  to add a  new hidden unit \nC&pxp  that is  the product of xp  with  the subnetwork output C&p,  as  in figure  3. \nIn order to  train  the  C&p  subnetwork  iteratively using gradient descent,  we  need  to \ncompute the effect  of changes in C&p  on the network error \u00a3 = E[(f - j)2].  We have \n\nwhere  S 4pXp  is  the weight from the new hidden unit to the outpu t.  Without loss  of \ngenerality we  can set  S4pXp  = 1 by  including this factor  within  C&p.  Thus the  error \nterm for  iteratively training the subnetwork ap  is \n\nwhich  can  be  used  to drive  a  standard  backpropagation-type  gradient  descent  al(cid:173)\ngorithm.  This  gives  a  method  for  constructing  new  hidden  nodes  and  a  learning \nalgorithm for  training  these  nodes.  The same  technique  can  be applied  to deeper \nlayers in a  multilayer network. \n\n6  EXAMPLES \n\nWe have applied the algorithm to approximation of known polynomials in  the pres(cid:173)\nence of irrelevant noise  variables,  and  to a  simple  image-processing task. \nFigure 4 shows the results of applying the algorithm to 200  samples of the polyno(cid:173)\nmial 2 + 3XIX2 + 4X3X4X5  with 4 irrelevant noise  variables.  The algorithm correctly \nfinds  the true polynomial in 4 steps, requiring about 5 minutes on a  Symbolics Lisp \nMachine.  Note that although the error did not decrease after cycle 1,  the term X4X5 \nwas incorporated since it would  be useful in a  later step to reduce the error as part \nof X3X4X5  in  cycle 2. \nThe  image  processing  task  is  to  predict  a  pixel  value  on  the succeeding  scan  line \nfrom a  2x5  block of pixels on the  preceding 2 scan lines.  If successful,  the resulting \npolynomial  can  be  used  as  part  of a  DPCM  image  coding  strategy.  The network \nwas  trained  on  random blocks from  a  single  face  image,  and  tested  on  a  different \nimage.  Figure  5 shows the original  training and  test images,  the  pixel  predictions, \nand  remaining  error  .  Figure 6 shows the  resulting 55-term polynomial.  Learning \nthis polynomial  required  less than 10  minutes on a  Sun Sparcstation 1. \n\n\f1070 \n\nSanger, Sutton, and Matheus \n\n200  sa.mples of IJ  =  2 + 3z1 z2 + 4x3 z4 Zs \nwith  4  additional  irrelevant  inputs,  z6-z9 \n\nOriginal  MSE:  1.0 \n\nCycle 1 : \n\nMSE: \nTerms: \nCoeffs: \nPo ten tials: \nTop Pairs: \nNew  Term: \n\nCycle  2: \n\nMSE: \nTerms: \nCoeffs: \nPotentials: \nTop Pairs: \nNew  Term: \n\nCycle  3: \n\nMSE: \nTerms: \nCoeffs: \nPotentials: \nTop Pairs: \nNew  Term: \n\nCycle  4: \n\nMSE: \nTerms: \nCoeffs: \n\nSolution: \n\n0.967 \nXl \n-0.19 \n0.22 \n(S  4)  (5  3)  (43)  (4  4) \nXIO  =X4 X S \n\nX 2 \n0.14 \n0.24 \n\nX3 \n0.24 \n0.2S \n\nX 4 \n0.31 \n0 .32 \n\nXs \n0. 17 \n0 .33 \n\nX6 \n0.48 \n0.01 \n\nX7 \n0.03 \n0.08 \n\nX8 \nO.OS \n0.01 \n\nX9 \n0.S8 \n0.05 \n\n0.966 \nX4 \nXl \n0.30 \n-0.19 \n0.25 \nO.OS \n(103)  (101)  (102) (10  10) \nXu  =X10 X 3  =X3 X 4 X S \n\nX2 \n0.14 \n0.22 \n\nX3 \n0.24 \n0.2S \n\nXs \n0.18 \n0 .02 \n\nX6 \n0.48 \n0.03 \n\nX7 \n0.03 \n0.08 \n\nX8 \nO.OS \n0.02 \n\nX9 \n0.S7 \n0.03 \n\nXlO \nO.OS \n0 .47 \n\nX2 \n-0.26 \n0.S9 \n\n0.349 \nXl \n0.04 \n0.S2 \n(2 1)  (2  9)  (22) (1  9) \nXu  =X1 X 2 \n\nX3 \n0.09 \n0.03 \n\nX4 \n0.37 \n0.02 \n\nXs \n-0.04 \n-0.08 \n\nX6 \n0.27 \n0.03 \n\nX7 \n0.10 \n-O.OS \n\nX8 \n0 .22 \n-0.06 \n\nX9 \n0.42 \n0.05 \n\nX10  Xll \n4.07 \nO. OS \n\n-0.26 \n-O.OS \n\n0.000 \nXl \n-0.00 \n2 + 3X1 X2 + 4X3X4X5 \n\nX2 \n-0.00 \n\nX3 \n-0.00 \n\nX4 \n0.00 \n\nXs \n-0.00 \n\nX6 \n0 .00 \n\nX7 \n0.00 \n\nX8 \n0.00 \n\nX9 \n0.00 \n\nX10  Xu \n4.00 \n\n-0.00 \n\nX l2 \n3.00 \n\nFigure 4:  A simple example of polynomial learning. \n\nFigure 5:  Original,  predicted,  and  error  images.  The top row is  the training image \n(RMS  error 8.4),  and the bottom row is  the test image (RMS  error 9.4). \n\n\fIterative Construction of Sparse Polynomial Approximations \n\n1071 \n\n-40\u00b71z0 +  -23.9z1  +  -5.4z2 +  -17\u00b71z3+ \n(1.1z5  +  2.4z8  +  -1.1z2 +  -1.5z0 +  -2.0Z1  +  1.3z4  +  2.3z6 +  3\u00b71z7  +  -25 .6)z4 + \n( \n\n(-2.9z9  +  3.0z8  +  -2.9z4 +  -2.8z3 +  -2 .9z2  +  -1.9z5 +  -6.3%0  +  -5.2%1  +  2.5z6  +  6.7z7  +  1.1)z9+ \n(3 .9z8  +  Z5  +  3.3z4  +  1.6z3  +  1.1z2  +  2 .9z6  +  5.0Z7  +  16 .1)z8+ \n-2.3%3  +  -2 .1%2  +  -1.6.%1  +  1.1z4  +  2\u00b71z6  +  3.5%7  +  28 .6)z5+ \n\n87 \u00b71z6 +  128.1%7  +  80 .5%8+ \n( \n\n(-2\u00b76.%9  +  -2.4%5  +  -4.5%0  +  -3 .9%1  +  3.4%6  +  7 .3%7  +  -2.5)%9+ \n21.7%8  +  -16 .0%4  +  -12\u00b71z3 +  -8.8%2 +  31.4)%9+ \n\n2 . 6 \n\nFigure 6:  55-term polynomial  used  to generate figure  5. \n\nAcknowledgments \n\nWe would like to thank Richard Brandau for  his helpful comments and suggestions \non  an  earlier  draft  of this  paper.  This  report  describes  research  done  both  at \nGTE  Laboratories  Incorporated,  in  Waltham  MA,  and  at  the  laboratory  of Dr. \nEmilio Bizzi in the department of Brain and Cognitive Sciences at MIT. T.  Sanger \nwas  supported  during  this  work  by  a  National  Defense  Science  and  Engineering \nGraduate Fellowship,  and  by  NIH  grants 5R37 AR26710  and  5R01NS09343  to  Dr. \nBizzi. \n\nReferences \n\nBarronR. L., Mucciardi A.  N., CookF. J., CraigJ. N., Barron A. R., 1984,  Adaptive \nlearning networks:  Development and application in the United States of algorithms \nrelated to GMDH,  In Farlow S.  J.,  ed.,  Self-Organizing Methods  in  Modeling,  pages \n25-65,  Marcel Dekker,  New  York. \nGabor  D.,  1961,  A  universal  nonlinear filter,  predictor,  and  simulator  which opti(cid:173)\nmizes  itself by a  learning process,  Proc.  lEE,  108B:422-438. \nGrigoriev  D.  Y.,  Karpinski  M.,  Singer  M.  F.,  1990,  Fast  parallel  algorithms  for \nsparse polynomial interpolation over finite fields,  SIAM J.  Computing,  19(6):1059-\n1063. \nIkeda S.,  Ochiai M.,  Sawaragi Y.,  1976,  Sequential  GMDH  algorithm and  its  ap(cid:173)\nplication  to  river  flow  prediction,  IEEE  Trans.  Systems,  Man,  and  Cybernetics, \nSMC-6(7):473-479. \nIvakhnenko  A.  G.,  1971,  Polynomial  theory  of complex  systems, \nSystems,  Man,  and  Cybernetics, SMC-1( 4):364-378. \nSanger  T.  D.,  1991a,  Basis-function  trees  as  a  generalization  of local  variable  se(cid:173)\nlection  methods  for  function  approximation,  In  Lippmann  R.  P.,  Moody  J.  E., \nTouretzky D. S., ed.s,  Advances in  Neural Information  Processing Systems 3,  pages \n700-706,  Morgan  Kaufmann,  Proc.  NIPS'90,  Denver CO. \nSanger T. D., 1991b,  A tree-structured adaptive network for function approximation \nin high  dimensional spaces,  IEEE  Trans.  Neural Networks,  2(2):285-293. \nSutton R. S.,  Matheus C.  J.,  1991,  Learning  polynomial functions  by feature  con(cid:173)\nstruction,  In  Proc.  Eighth  Inti.  Workshop  on  Machine  Learning, Chicago. \nWidrow B., Hoff M.  E., 1960,  Adaptive switching circuits,  In IRE WESCON Conv. \nRecord,  Part 4,  pages 96-104. \n\nIEEE  Trans. \n\n\f", "award": [], "sourceid": 538, "authors": [{"given_name": "Terence", "family_name": "Sanger", "institution": null}, {"given_name": "Richard", "family_name": "Sutton", "institution": null}, {"given_name": "Christopher", "family_name": "Matheus", "institution": null}]}