{"title": "A Comparison of Projection Pursuit and Neural Network Regression Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 1159, "page_last": 1166, "abstract": null, "full_text": "A  Comparison of Projection Pursuit and  Neural \n\nNetwork  Regression  Modeling \n\nJellq-Nellg  Hwang,  Hang Li, \nInformation  Processing  Laboratory \n\nDept.  of Elect.  Engr.,  FT-lO \n\nUniversity of Washington \n\nSeattle WA  98195 \n\nMartin  Maechler,  R.  Douglas  Martin,  Jim  Schimert \n\nDepartment of Statistics \n\nMail  Stop:  GN-22 \n\nUniversity of Washington \n\nSeattle, WA 98195 \n\nAbstract \n\nTwo  projection  based  feedforward  network  learning  methods  for  model(cid:173)\nfree  regression  problems  are  studied  and  compared  in  this  paper:  one  is \nthe  popular  back-propagation learning  (BPL);  the  other  is  the  projection \npursuit learning  (PPL).  Unlike  the  totally  parametric  BPL  method,  the \nPPL  non-parametrically  estimates  unknown  nonlinear  functions  sequen(cid:173)\ntially (neuron-by-neuron and layer-by-Iayer) at each iteration while jointly \nestimating  the  interconnection  weights.  In  terms  of  learning  efficiency, \nboth  methods  have  comparable  training  speed  when  based  on  a  Gauss(cid:173)\nNewton  optimization  algorithm  while  the  PPL  is  more  parsimonious.  In \nterms of learning robustness  toward noise  outliers,  the  BPL is  more sensi(cid:173)\ntive to  the  outliers. \n\n1 \n\nINTRODUCTION \n\nThe back-propagation  learning  (BPL)  networks  have  been  used  extensively for  es(cid:173)\nsentially two distinct problem types, namely model-free regression and classification, \n1159 \n\n\f1160 \n\nHwang,  Li,  Maechler,  Martin,  and Schimert \n\nwhich  have  no  a  priori assumption  about  the  unknown  functions  to  be  identified \nother than imposes a  certain degree of smoothness.  The projection pursuit learning \n(PPL)  networks  have  also  been  proposed  for  both types  of problems  (Friedman85 \n[3]),  but to  date there appears to have been much less  actual  use  of PPLs for  both \nregression and classification than of BPLs.  In this paper, we shall concentrate on re(cid:173)\ngression modeling applications of BPLs and PPLs since the regression setting is one \nin which some fairly deep  theory is available for  PPLs in the case of low-dimensional \nregression  (Donoh089  [2],  Jones87  [6]). \n\nA  multivariate  model-free  regression  problem  can  be  stated  as  follows:  given  n \npairs of vector observations, (Yl  ,  Xl)  = (Yll,\u00b7\u00b7\u00b7, Ylq;  Xll,\u00b7\u00b7\u00b7, Xlp ),  which have been \ngenerated  from  unknown models \n\n1=1,2,\u00b7.\u00b7,n; \n\ni=I,2,\u00b7\u00b7\u00b7,q \n\nYIi=gi(XI)+tli, \n\n(1) \nwhere  {y,}  are  called  the  multivariable  \"response\"  vector  and  {x,}  are  called  the \n\"independent  variables\"  or  the  \"carriers\".  The  {gd  are  unknown  smooth  non(cid:173)\nparametric  (model-free)  functions  from  p-dimensional  Euclidean  space  to  the  real \nline,  i.e.,  gi:  RJ>  ~ R,  Vi.  The  {tli}  are  random  variables  with  zero  mean, \nE(tli] = 0,  and independent of {x,}.  Often the {tli} are assumed to be independent \nand  identically distributed (iid)  as  well. \nThe goal of regression  is to generate the estimators, 91, 92,  ... , 9q,  to best approxi(cid:173)\nmate  the unknown functions,  gl,  g2,  ... , gq,  so that they can  be used for  prediction \nof a  new  Y given  a  new x:  Yi  = gi(X),  Vi. \n\n2  A  TWO-LAYER PERCEPTRON AND \n\nBACK-PROPAGATION LEARNING \n\nSeveral  recent  results  have  shown  that  a  two-layer  (one  hidden  layer)  perceptron \nwith  sigmoidal  nodes  can  in  principle  represent  any  Borel-measurable function  to \nany desired accuracy, assuming \"enough\"  hidden neurons are used.  This, along with \nthe  fact  that  theoretical  results  are  known for  the  PPL in  the  analogous  two-layer \ncase,  justifies focusing  on  the  two-layer  perceptron for  our studies  here. \n\n2.1  MATHEMATICAL FORMULATION \n\nA  two-layer  percept ron  can  be mathematically formulated  as  follows: \n\np L WkjXj  -\n\n(h  = wf x  - (h, \n\nk = 1,  2, \n\nm \n\nYi \n\nj=1 \n\nm \n\nk=l \n\nm \n\nk=1 \n\n(2) \n\nwhere  Uk  denotes  the  weighted  sum  input  of the  kth  neuron  in  the  hidden  layer; \nOk  denotes  the  bias  of the  kth  neuron  in  the  hidden  layer;  Wkj  denotes  the  input(cid:173)\nlayer weight linked between the  kth  hidden neuron  and the  jth  neuron of the  input \n\n\fA Comparison of Projection Pursuit and Neural  Network Regression  Modeling \n\n1161 \n\nlayer  (or  ph  element  of the  input  vector  x);  f3ik  denotes  the  output-layer  weight \nlinked between the ith  output neuron and the kth  hidden neuron;  fk  is the nonlinear \nactivation function,  which is usually assumed to be a fixed  monotonically increasing \n(logistic)  sigmoidal function,  u( u)  =  1/(1 + e- U ). \nThe  above  formulation  defines  quite  explicitly  the  parametric  representation  of \nfunctions  which  are  being  used  to  approximate  {gi(X),  i  = 1,2\"\", q}.  A  sim(cid:173)\nple  reparametrization  allows us  to write  gi(X)  in  the form: \n\nA() \ngj  x  = ~ f3ikU( \n\nm \n\"'\"' \nk=l \n\nT \n\nakx-/-lk \n) \n\nSk \n\n(3) \n\nwhere ak  is a unit length version of weight vector Wk.  This formulation reveals how \n{gd  are  built  up  as  a  linear  combination  of sigmoids  evaluated  at  translates  (by \n/-lk)  and scaled  (by Sk)  projection  of x  onto the unit length vector  ak. \n\n2.2  BACK-PROPAGATION LEARNING AND  ITS VARIATIONS \n\nHistorically, the training of a  multilayer perceptron uses  back-propagation learning \n(BPL). There are two common  types of BPL:  the  batch one  and the  sequentialone. \nThe  batch  BPL  updates  the  weights  after  the  presentation of the  complete  set  of \ntraining  data.  Hence,  a  training  iteration  incorporates  one  sweep  through  all  the \ntraining  patterns.  On  the  other  hand,  the  sequential  BPL  adjusts  the  network \nparameters  as  training  patterns  are  presented,  rather  than  after  a  complete  pass \nthrough  the  training  set.  The  sequential  approach  is  a  form  of  Robbins-Monro \nStochastic  Approximation. \n\nWhile  the  two-layer  perceptron  provides  a  very  powerful  nonparametric  modeling \ncapability, the BPL training can be slow and inefficient since only the first  derivative \n(or gradient) information about the training error is  utilized.  To speed up  the train(cid:173)\ning process, several  second-order optimization  algorithms,  which  take advantage of \nsecond derivative (or  Hessian  matrix)  information, have been proposed for  training \nperceptrons (Hwang90 [4]).  For example,  the Gauss-Newton method is  also  used in \nthe  PPL  (Friedman85  [3]). \n\nThe fixed  nonlinear nodal  (sigmoidal) function  is  a  monotone non decreasing  differ(cid:173)\nentiable function with very simple first derivative form, and possesses nice properties \nfor  numerical computation.  However, it does not interpolate/extrapolate efficiently \nin  a wide variety of regression  applications.  Several attempts have been proposed to \nimprove  the  choice  of nonlinear  nodal  functions;  e.g.,  the Gaussian  or  bell-shaped \nfunction,  the  locally  tuned  radial  basis functions,  and  semi-parametric  (non-fixed \nnodal function)  nonlinear functions  used  in  PPLs and  hidden  Markov  models. \n\n2.3  RELATIONSHIP TO  KERNEL  APPROXIMATION AND  DATA \n\nSMOOTHING \n\nIt  is  instructive  to  compare  the  two-layer  perceptron  approximation  in  Eq. \n(3) \nwith  the  well-known  kernel method for  regression.  A  kernel K(.)  is  a  non-negative \nsymmetric function  which integrates to unity.  Most kernels are also unimodal, with \n\n\f1162 \n\nHwang,  Li,  Maechler,  Martin,  and Schimert \n\nmode  at the origin,  K(tl) ~ K(t 2),  0 < tl < t 2.  A kernel estimate of gi(X)  has the \nform \n\n_ \ngK,i(X) = ~ Yli  hq  K( \n\n~  1 \n\nIIx - xIII \n), \n\nh9 \n\n1=1 \n\n(4) \n\nwhere  h  is  a  bandwidth parameter and  q is  the dimension of YI  vector.  Typically a \ngood value of h will be chosen by a data-based cross-validation method.  Consider for \na  moment the special case of the kernel approximator and the two-layer perceptron \nin Eq.  (3)  respectively, with  scalar YI  and  XI,  i.e., with p = q = 1 (hence unit length \ninterconnection weight  Q'  = 1 by definition): \n\n~ .!.K( Ilx - xdl) = ~ :\"K(x - XI) \n~ YI  h \n1=1 \nm \n\n~ YI h  h '  \n1=1 \n\nh \n\nL ,BkO\"( X  -\nSk \nk=1 \n\nIlk) \n\ng(X) \n\n(5) \n\n(6) \n\nThis reveals some  important connections between the two approaches. \nSuppose  that for  g( x),  we  set  0\"  = K,  i.e.,  0\"  is  a  kernel  and in fact  identical  to the \nkernel  K,  and that ,Bk,llk,sk = s  have been  chosen  (trained), say by BPL. That is, \nall {sd are constrained to a single  unknown parameter value s.  In general,  m  < n, \nor even m  is  a  modest fraction  of n when  the  unknown function  g(x)  is  reasonably \nsmooth.  Furthermore, suppose that h has been chosen by cross validation.  Then one \ncan expect 9K(X)  ~ gq(x),  particularly  in  the event that  the  {1lA:}  are  close  to  the \nobserved  values  {x,}  and  X  is close  to  a  specific  Ilk  value  (relative  to  h).  However, \nin this case where we  force  Sk  = S, one might expect gK(X)  to be a somewhat better \nestimate overall than 9q(x),  since the former  is  more local in character. \nOn  the  other  hand,  when  one  removes  the  restriction  Sk  = s,  then  BPL  leads \nto  a  local  bandwidth  selection,  and  in  this  case  one  may  expect  gq(x)  to  provide \nbetter approximation than 9K(X)  when  the function  g(x)  has  considerably  varying \ncurvature, gll(X),  and/or considerably varying error  variance for  the noise (Ii  in Eq. \n(1).  The reason  is  that a fixed  bandwidth kernel estimate can not cope as well with \nchanging curvature and/or  noise  variance  as  can  a  good  smoothing method  which \nuses  a  good  local  bandwidth selection  method.  A  small  caveat is  in  order:  if m  is \nfairly large,  the estimation of a separate bandwidth for each kernel location, Ilk,  may \ncause some  increased  variability in  gq (x)  by virtue of using many more  parameters \nthan are needed to adequately represent a nearly optimal local bandwidth selection \nmethod.  Typically a nearly optimal local bandwidth function will have some degree \nof smoothness, which reflects smoothly varying curvature and/or noise variance, and \na good local bandwidth selection  method should reflect the smoothness constraints. \nThis is  the case  in  the high-quality  \"supersmoother\", designed for  applications like \nthe  PPL (to  be  discussed),  which  uses  cross-validation  to select  bandwidth  locally \n(Friedman85 [3]),  and  combines this feature with  considerable speed. \nThe above arguments are probably equally valid without the restriction  u = J(,  be(cid:173)\ncause two sigmoids of opposite signs (via choice of two {,Bk})  that are appropriately \n\n\fA Comparison of Projection Pursuit and Neural  Network Regression Modeling \n\n1163 \n\nshifted,  will  approximate  a  kernel  up  to  a  scaling  to enforce  unity  area.  However, \nthere  is  a  novel  aspect:  one  can  have  a  separate local  bandwidth  for  each  half of \nthe  kernel,  thereby using  an asymmetric  kernel,  which might improve  the  approxi(cid:173)\nmation  capabilities  relative  to symmetric  kernels  with  a  single  local  bandwidth in \nsome situations. \n\nIn  the  multivariate  case,  the  curse  of dimensionality  will  often  render  useless  the \nkernel  approximator 9K,i(X)  given  by  Eq.  (4).  Instead  one  might consider  using  a \nprojection pursuit  kernel  (PPK) approximator: \n\n9PPK,i(X) = LL Yli  hk J\u00ab(1:kX~kD:kXI) \n\nn  m IT   T \n\n1=1  k=l \n\n(7) \n\nwhere  a  different  bandwidth  hk  is  used  for  each  direction  D:k .  In  this  case,  the \nsimilarities and differences between the PPK estimate and the BPL estimate 9q,i(X) \nbecome  evident. \n\nThe main difference between the two methods is that PPK performs explicit smooth(cid:173)\ning in each direction D:k  using a kernel smoother, whereas BPL does implicit smooth(cid:173)\ning with  both  fJk  (replacing  Yli/ h k)  and  /-lk  (replacing aT XI)  being  determined  by \nnonlinear  least  squares  optimization.  In  both  PPK  and  BPL,  the  D:k  and  hk  are \ndetermined  by  nonlinear  optimization  (cross-validation  choices  of bandwidth  pa(cid:173)\nrameters are  inherently nonlinear  optimization  problems)  (Friedman85  [3]). \n\n3  PROJECTION PURSUIT LEARNING  NETWORKS \n\nThe projection pursuit learning  (PPL)  is  a  statistical procedure proposed for  mul(cid:173)\ntivariate data analysis using  a  two-layer  network given  in  Eq.  (2).  This procedure \nderives  its  name  from  the  fact  that  it  interprets  high  dimensional  data  through \nwell-chosen  lower-dimensional  projections.  The  \"pursuit\"  part  of the  name  refers \nto  optimization with respect  to the projection  directions. \n\n3.1  COMPARATIVE STRUCTURES  OF  PPL AND  BPL \n\nSimilar  to  a  BPL  perceptron,  a  PPL  network  forms  projections  of  the  data  in \ndirections  determined  from  the  interconnection  weights.  However,  unlike  a  BPL \nperceptron,  which  employs  a  fixed  set  of nonlinear  (sigmoidal)  functions,  a  PPL \nnon-parametrically estimates the nonlinear  nodal functions  based on  nonlinear op(cid:173)\ntimization approach which involves use of a one-dimensional data-smoother  (e.g.,  a \nleast  squares  estimator  followed  by  a  variable  window  span  data averaging  mech(cid:173)\nanism)  (Friedman85  [3]) .  Therefore,  it  is  important to  note  that  a  PPL  network \nis  a  semi-parametric  learning  network,  which  consists  of both  parametrically  and \nnon-parametrically  estimated  elements.  This  is  in  contrast  to  a  BPL  perceptron, \nwhich  is  a  completely  parametric model. \n\n3.2  LEARNING  STRATEGIES OF  PPL \n\nIn comparison with a batch BPL, which employs either 1st-order gradient descent or \n2nd-order  Newton-like methods  to estimate the weights of all  layers  simultaneously \n\n\f1164 \n\nHwang,  Li,  Maechler,  Martin, and Schimert \n\nafter  all  the  training  patterns  are  presented,  a  PPL  learns  neuron-by-neuron  and \nlayer-by-Iayer  cyclically after all  the  training patterns are  presented.  Specifically, it \napplies linear least squares  to estimate the output-layer weights,  a one-dimensional \ndata  smoother  to  estimate  the  nonlinear  nodal  functions  of each  hidden  neuron, \nand  the  Gauss-Newton  nonlinear  least squares method  to estimate the  input-layer \nweights. \n\nThe  PPL  procedure  uses  the  batch learning  technique  to  iteratively  minimize  the \nmean  squared  error,  E,  over  all  the  training  data.  All  the  parameters  to  be  esti(cid:173)\nmated  are  hierarchically  divided  into  m  groups  (each  associated  with  one  hidden \nneuron), and each group, say the kth  group, is further  divided into three subgroups: \nthe  output-layer weights,  {,Bik,  i  =  1\"\", q},  connected  to  the  kth  hidden  neuron; \nthe nonlinear function, h( u), of the kth hidden neuron; and the input-layer weights, \n{Wkj,  j  = 1\"\"  ,p}, connected to the  kth  hidden neuron.  The PPL starts from  up(cid:173)\ndating the parameters associated with the first  hidden neuron  (group)  by updating \neach subgroup,  {,Bid,  h(u), and  {Wlj}  consecutively  (layer-by-Iayer)  to  minimize \nthe mean  squared error E.  It then  updates the parameters associated with the sec(cid:173)\nond hidden  neuron by consecutively updating  {,Bi2},  h(u), and  {W2j}.  A complete \nupdating pass ends at the updating of the parameters associated with the mth  (the \nlast) hidden neuron by consecutively updating {,Bim},  fm(u),  and {wmj}.  Repeated \nupdating passes  are  made over  all  the groups until convergence  (i.e.,  in our studies \nof Section  4,  we  use  the  stopping  criterion  that \nbe  smaller  than  a \nprespecified  small  constant, ~ = 0.005). \n\nIE(new)_E(old)1 \n\nE(old) \n\n4  LEARNING  EFFICIENCY IN  BPL AND  PPL \n\nHaving discussed the  \"parametric\" BPL and the  \"semi-parametric\"  PPL from struc(cid:173)\ntural,  computational,  and  theoretical  viewpoints,  we  have  also  made  a  more  prac(cid:173)\ntical  comparison  of learning  efficiency  via  a  simulation  stUdy.  For  simplicity  of \ncomparison, we  confine  the simulations to the two-dimensional univariate case,  i.e., \np = 2,  q =  1.  This  is  an  important situation  in  practice,  because  the  models  can \nbe  visualized  graphically as functions  y = g(Xl' X2). \n\n4.1  PROTOCOLS  OF THE  SIMULATIONS \n\nNonlinear  Functions:  There  are  five  nonlinear  functions  gU)  :  [0,1]2  --+  R  in(cid:173)\nvestigated  (Maechler90  [7]),  which  are  scaled such that  the standard  deviation  is  1 \n(for  a  large regular grid of 2500 points on  [0,1]2),  and  translated to make  the  range \nnonnegative. \n\nTraining  and  Test  Data:  Two  independent  variables  (carriers)  (Xll'  X12) \nwere  generated  from  the  uniform  distribution  U([O,I]2),  i.e.,  the  abscissa  values \n{(Xll'  X12)}  were  generated  as  uniform  random  variates on  [0,1]  and  independent \nfrom  each other.  We  generated  225  pairs  {(xu,  X12)}  of abscissa  values,  and  used \nthis  same  set  for  experiments  of  all  five  different  functions,  thus  eliminating  an \nunnecessary  extra  random component  of the  simulation.  In  addition  to one  set of \nnoiseless  training  data,  another  set  of noisy  training  data  was  also  generated  by \nadding  iid Gaussian  noises. \n\n\fA Comparison of Projection Pursuit and Neural  Network Regression  Modeling \n\n1165 \n\nAlgorithm Used:  The PPL simulations were  conducted using  the  S-Plus pack(cid:173)\nage  (S-Plus90 [1])  implementation of PPL, where 3 and 5 hidden neurons were tried \n(with 5 and 7 maximum working hidden neurons used separately to avoid the overfit(cid:173)\nting).  The S-Plus implementation is  based on the Friedman code (Friedman85 [3]), \nwhich uses a  Gauss-Newton method for  updating the lower layer weights.  To obtain \na fair  comparison, the  BPL was  implemented using a  batch Gauss-Newton method \n(rather  than  the  usual  gradient  descent,  which  is  slower)  on  two-layer  perceptrons \nwith  linear  output  neurons  and  nonlinear sigmoidal  hidden  neurons  (Hwang90  [4], \nHwang9I  [5]),  where  5  and  10  hidden  neurons were  tried. \n\nIndependent Test Data Set:  The assessment of performance was done by com(cid:173)\nparing the fitted  models with  the  \"true\"  function  counterparts on  a  large indepen(cid:173)\ndent test set.  Throughout all  the simulations, we  used  the same set of test  data for \nperformance assessment, i.e.,  {g(j)( Xll,  X/2)}, of size N  = 10000, namely a regularly \nspaced grid  on  [0,1]2,  defined  by  its  marginals. \n\n4.2  SIMULATION  RESULTS  IN LEARNING EFFICIENCY \n\nTo summarize the simulation results in  learning efficiency, we focused  on  the chosen \nthree  aspects:  accuracy,  parsimony, and  speed. \n\nLearning Accuracy:  The accuracy determined by the absolute L2  error measure \nof the independent test  data in  both learning methods are  quite comparable either \ntrained  by  noiseless  or  noisy  data (Hwang9I  [5]).  Note  that  our  comparisons  are \nbased  on  5  &  10  hidden  neurons  of  BPLs  and  3  &  5  hidden  neurons  of  PPLs. \nThe reason of choosing different number of hidden neurons will  be explained  in  the \nlearning parsimony section. \n\nIn  comparison  with  BPL,  the PPL is  more parsimonious \n\nLearning  Parsimony: \nin training all types of nonlinear functions,  i.e., in order to achieve comparable accu(cid:173)\nracy to the BPLs for  a two-layer perceptrons, the PPLs require fewer hidden neurons \n(more parsimonious)  to approximate the desired  true function  (Hwang9I  [5]).  Sev(cid:173)\neral  factors  may  contribute  to this favorable  performance.  First  and  foremost,  the \ndata-smoothing technique creates more  pertinent nonlinear  nodal functions, so  the \nnetwork  adapts  more  efficiently  to  the  observation  data  without  using  too  many \nterms  (hidden  neurons)  of interpolative  projections.  Secondly,  the  batch  Gauss(cid:173)\nNewton  BPL  updates all  the weights in  the network simultaneously while  the  PPL \nupdates cyclically (neuron-by-neuron and layer-by-layer), which allows the most re(cid:173)\ncent  updating  information  to  be  used  in  the  subsequent  updating.  That  is,  more \nimportant projection  directions  can  be  determined  first  so  that  the  less  important \nprojections can have a easier search (the same argument used in  favoring the Gauss(cid:173)\nSeidel method over  the Jacobi method  in  an  iterative linear  equation solver). \n\nLearning  Speed:  As  we  reported  earlier  (Maechler90  [7]),  the  PPL  took  much \nless  time  (1-2  order  of magnitude speedup)  in  achieving accuracy comparable with \nthat of the sequential gradient descent BPL. Interestingly, when  compared with the \nbatch Gauss-Newton  BPL, the  PPL took  quite similar  amount of time over all  the \nsimulations  (under  the same  number of hidden  neurons  and  the same  convergence \n\n\f1166 \n\nHwang,  Li,  Maechler,  Martin, and Schimert \n\nthreshold e = 0.005).  In  all  simulations,  both  the  BPLs  and  PPLs  can  converge \nunder  100  iterations most of the time. \n\n5  SENSITIVITY TO  OUTLIERS \n\nBoth  BPL's and  PPL's are  types  of nonlinear  least squares  estimators.  Hence like \nall  least  squares  procedures,  they  are  all  sensitive  to  outliers.  The  outliers  may \ncome from  large errors in  measurements, generated by heavy tailed deviations from \na  Gaussian  distribution  for  the noise  iii  in  Eq.  (1). \n\nWhen in presence of additive Gaussian  noises  without outliers,  most functions  can \nbe well approximated by 5-10 hidden neurons using BPL or with 3-5 hidden neurons \nusing PPL. When the Gaussian noise is altered by adding one outlier, the BPL with \n5-10  hidden  neurons can still  approximate the  desired  function  reasonably  well  in \ngeneral at the  sacrifice of the magnified  error  around  the vicinity of the outlier.  If \nthe  number  of outliers  increases  to  3  in  the  same  corner,  the  BPL  can  only  get \na  \"distorted\"  approximation of the  desired  function.  On  the  other hand,  the  PPL \nwith 5 hidden neurons can successfully approximate the desired function and remove \nthe  single  outlier.  In  case  of three  outliers,  the  PPL  using  simple  data smoothing \ntechniques can  no longer  keep  its robustness  in  accuracy of approximation. \n\nAcknowledgements \n\nThis  research  was  partially  supported  through  grants  from  the  National  Science \nFoundation under  Grant No.  ECS-9014243. \n\nReferences \n\n[1]  S-Plus Users Manual (Version 3.0).  Statistical Science Inc.,  Seattle, WA, 1990. \n[2]  D.L.  Donoho  and  I.M.  Johnstone.  Projection-based approximation and  a  du(cid:173)\n\nality with kernel methods.  The Annals of Statistics, Vol.  17,  No.1, pp. 58-106, \n1989. \n\n[3]  J .H.  Friedman.  Classification  and  multiple  regression  through  projection  pur(cid:173)\n\nsuit.  Technical  Report  No.  12,  Department of Statistics,  Stanford  University, \nJanuary 1985. \n\n[4]  J.  N.  Hwang and  P.  S.  Lewis.  From nonlinear optimization  to neural  network \nlearning.  In Proc.  24th  Asilomar Conf.  on  Signals,  Systems,  &  Computers,  pp. \n985-989,  Pacific Grove, CA,  November  1990. \n\n[5]  J.  N.  Hwang,  H.  Li,  D.  Martin,  J.  Schimert.  The learning  parsimony  of pro(cid:173)\n\njection  pursuit  and  back-propagation  networks.  In  25th  Asilomar  Conf.  on \nSignals,  Systems,  &  Computers,  Pacific Grove,  CA,  November  1991. \n\n[6]  L.K. Jones.  On a conjecture of Huber concerning the convergence of projection \n\npursuit  regression.  The Annals of Statistics,  Vol.  15,  No.  2,880-882,  1987. \n\n[7]  M.  Maechler,  D.  Martin, J. Schimert,  M.  Csoppenszky and J.  N.  Hwang.  Pro(cid:173)\n\njection pursuit learning  networks for  regression.  in  Proc.  2nd Int'l Conf.  Tools \nfor AI,  pp.  350-358,  Washington  D.C.,  November  1990. \n\n\f", "award": [], "sourceid": 578, "authors": [{"given_name": "Jenq-Neng", "family_name": "Huang", "institution": null}, {"given_name": "Hang", "family_name": "Li", "institution": null}, {"given_name": "Martin", "family_name": "Maechler", "institution": null}, {"given_name": "R.", "family_name": "Martin", "institution": null}, {"given_name": "Jim", "family_name": "Schimert", "institution": null}]}