{"title": "Optimal Kernel Shapes for Local Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 540, "page_last": 546, "abstract": null, "full_text": "Optimal Kernel  Shapes for  Local  Linear \n\nRegression \n\nDirk Ormoneit \n\nTrevor Hastie \n\nDepartment of Statistics \n\nStanford University \n\nStanford,  CA 94305-4065 \normoneit@stat.stanjord.edu \n\nAbstract \n\nLocal linear regression performs very well in many low-dimensional \nforecasting problems.  In high-dimensional spaces, its performance \ntypically  decays  due  to the  well-known  \"curse-of-dimensionality\". \nA possible way to approach this problem is  by varying the  \"shape\" \nof the weighting kernel.  In this work we suggest a new, data-driven \nmethod to estimating the optimal  kernel  shape.  Experiments  us(cid:173)\ning an artificially generated data set and data from  the UC  Irvine \nrepository show the benefits of kernel shaping. \n\n1 \n\nIntroduction \n\nLocal linear regression has attracted considerable attention in  both statistical and \nmachine learning literature as a  flexible  tool for  nonparametric regression analysis \n[Cle79, FG96, AMS97].  Like most statistical smoothing approaches, local modeling \nsuffers  from  the so-called  \"curse-of-dimensionality\",  the  well-known  fact  that  the \nproportion  of the training data that lie  in  a  fixed-radius  neighborhood  of a  point \ndecreases  to  zero  at  an  exponential  rate  with  increasing  dimension  of the  input \nspace.  Due  to  this  problem,  the bandwidth  of a  weighting  kernel  must  be  chosen \nvery  big  so  as  to contain  a  reasonable sample fraction .  As  a  result,  the estimates \nproduced are typically highly  biased.  One  possible way  to reduce the  bias of local \nlinear  estimates  is  to  vary  the  \"shape\"  of the  weighting  kernel.  In  this  work,  we \nsuggest a  method for  estimating the optimal kernel shape using the training data. \nFor this purpose, we  parameterize the kernel in  terms of a suitable  \"shape matrix\" , \nL, and minimize the mean squared forecasting error with respect to L.  For such an \napproach to be meaningful, the  \"size\"  of the weighting kernel must be constrained \nduring  the  minimization  to  avoid  overfitting.  We  propose  a  new,  entropy-based \nmeasure  of the  kernel  size  as  a  constraint.  By  analogy  to  the  nearest  neighbor \napproach  to  bandwidth  selection  [FG96],  the suggested  measure  is  adaptive  with \nregard to the local data density.  In addition, it leads to an efficient gradient descent \nalgorithm for  the computation of the optimal kernel  shape.  Experiments using an \nartificially  generated  data set  and  data from  the  UC  Irvine  repository  show  that \nkernel shaping can improve the performance of local linear estimates substantially. \nThe remainder of this  work is  organized as follows.  In Section  2 we  briefly  review \n\n\fOptimal Kernel Shapes for Local Linear Regression \n\n541 \n\nlocal linear models and introduce our notation.  In Section 3 we formulate an objec(cid:173)\ntive function for kernel shaping, and in Section 4 we discuss entropic neighborhoods. \nSection 5 describes our experimental results and Section 6 presents conclusions. \n\n2  Local Linear  Models \n\nConsider a  nonlinear regression  problem where  a  continuous response  y  E  JR  is  to \nbe  predicted  based  on  a  d-dimensional  predictor  x  E  JRd.  Let  D  ==  {(Xt, Yt), t  = \n1, . .. ,T}  denote  a  set  of training  data.  To  estimate  the  conditional  expectation \nf(xo)  ==  E[ylxo],  we  consider the local linear expansion  f(x)  ~ 0:0 + (x  - xo),/3o  in \nthe neighborhood of Xo.  In detail, we minimize the weighted least squares criterion \n\nT \n\nC(o:,/3;xo)  ==  ~)Yt - 0:  - (Xt  - xo)'/3)2k(xt,xo) \n\n(1) \n\nt=1 \n\nto determine estimates of the parameters 0:0  and /30.  Here k(xt, xo) is a non-negative \nweighting  kernel that  assigns  more  weight  to  residuals  in  the neighborhood  of  Xo \nthan  to  residuals  distant  from  Xo. \nIn  multivariate  problems,  a  standard  way  of \ndefining k(xt, xo)  is  by applying a univariate, non-negative \"mother kernel\"  </>(z)  to \nthe distance measure  Ilxt  - xolln  ==  J(Xt - xo)'O(Xt - xo): \n\nk(xt, xo)  == \n\n:  (1lxt  - xolln) \n\n. \nES=1  </>  (1lx s  - xolln) \n\n(2) \n\nHere  0  is  a  positive  definite  d  x  d  matrix  determining  the  relative  importance \nassigned to different  directions of the input space.  For  example,  if </>(z)  is  a  stan(cid:173)\ndard  normal  density,  k(xt, xo)  is  a  normalized  multivariate  Gaussian  with  mean \nXo  and  covariance matrix  0- 1 .  Note  that  k(xt, xo)  is  normalized  so  as  to  satisfy \nE;=1 k(xt, xo)  = 1.  Even  though  this  restriction  is  not  relevant  directly  with  re(cid:173)\ngard to the estimation of 0:0  and /30,  it will  be needed in our discussion of entropic \nneighborhoods in Section 4. \nUsing the shorthand notation i(xo, 0) ==  (&0, ~b)\"  the solution of the minimization \nproblem  (1)  may be written conveniently as \n\ni(xo,O)  =  (X'W X)-1 X'WY, \n\n(3) \nwhere X  is the T  x  (d + 1)  design matrix with rows  (1, x~ - xb)\"  Y  is the vector of \nresponse values,  and W  is  a  TxT diagonal matrix with  entries Wt,t  =  k(xt, xo). \nThe resulting local linear fit  at  Xo  using the inverse covariance matrix 0  is  simply \n!(xo; 0)  ==  &0.  Obviously,  !(xo; 0)  depends  on  0  through  the  definition  of the \nweighting kernel  (2).  In the discussion below, our focus is on choices of 0  that lead \nto favorable estimates of the unknown function value  f(xo). \n\n3  Kernel  Shaping \n\nThe local  linear  estimates  resulting from  different  choices  of 0  vary  considerably \nin  practice.  A common strategy is  to choose  0  proportional to the inverse sample \ncovariance matrix.  The remaining problem of finding  the optimal scaling factor is \nequivalent  to  the  problem of bandwidth  selection  in  univariate smoothing  [FG96, \nBBB99].  For  example,  the  bandwidth  is  frequently  chosen  as  a  function  of the \ndistance between  Xo  and its kth nearest neighbor in  practical  applications  [FG96]. \nIn this paper, we take a  different viewpoint and argue that optimizing the  \"shape\" \n\n\f542 \n\nD.  Onnoneit and T.  Hastie \n\nof the weighting kernel is  at least as important as optimizing the bandwidth.  More \nspecifically,  for  a  fixed  \"volume\"  of the weighting  kernel,  the  bias  of the estimate \ncan  be reduced  drastically by shrinking the kernel  in  directions  of large nonlinear \nvariation of f (x),  and stretching it in directions of small nonlinear variation.  This \nidea  is  illustrated  using  the  example  shown  in  Figure  1.  The  plotted  function \nis  sigmoidal  along  an  index  vector  K,  and  constant  in  directions  orthogonal  to  K,. \nTherefore,  a  \"shaped\"  weighting kernel  is  shrunk in  the direction  K,  and  stretched \northogonally to K\"  minimizing the exposure of the kernel to the nonlinear variation. \n\nFigure 1:  Left:  Example of a single  index model of the form  y  = g(X'K)  with  K = (1,1) \nand g(z) = tanh(3z).  Right:  The contours of g(z)  are  straight lines orthogonal  to  K. \n\n0== A' (LL' + I). \n\nTo  distinguish  formally  the metric and the bandwidth of the weighting  kernel,  we \nrewrite  0  as follows: \n\n(4) \nHere A corresponds to the inverse bandwidth, and L  may be interpreted as a metric(cid:173)\nor  shape-matrix.  Below  we  suggest  an  algorithm  which  is  designed  to  minimize \nthe  bias  with  respect  to  the  kernel  metric.  Clearly,  for  such  an  approach  to  be \nmeaningful, we need to restrict the  \"volume\" of the weighting kernel; otherwise, the \nbias of the estimate could be minimized trivially by choosing a zero bandwidth.  For \nexample, we might define A contingent on L so as to satisfy 101  =  c for some constant \nc.  A serious disadvantage of this idea is  that,  by  contrast to the nearest  neighbor \napproach, 101  is  independent of the design.  As  a  more  appropriate alternative,  we \ndefine A in terms of a  measure of the number of neighboring observations.  In detail, \nwe  fix  the  volume  of k(xt, xo)  in  terms  of the  \"entropy\"  of the  weighting  kernel. \nThen,  we  choose  A so  as  to  satisfy  the  resulting  entropy  constraint.  Given  this \ndefinition of the bandwidth, we determine the metric of k (Xt, xo)  by minimizing the \nmean squared prediction error: \n\nT \n\nC(L; D) ==  I)Yt - f(Xt; 0\u00bb2 \n\nt=l \n\n(5) \n\nwith respect  to L.  In this  way,  we  obtain an  approximation of the  optimal kernel \nshape because  the expectation of C(L; D)  differs  from  the  bias  only  by  a  variance \nterm which is  independent of L.  Details of the entropic neighborhood criterion and \nof the numerical minimization procedure are described next. \n\n4  Entropic Neighborhoods \n\nWe mentioned previously that, for a given shape matrix L, we choose the bandwidth \nparameter A in  (4)  so as to fulfill  a  volume constraint on the weighting kernel.  For \nthis purpose, we interpret the kernel weights k(xt, xo) as probabilities.  In particular, \n\n\fOptimal Kernel Shapes for Local Linear Regression \n\n543 \n\nas  k(Xt, xo)  > 0 and Et k(xt, xo)  =  1 by definition  (2),  we  can formulate  the local \nentropy of k(xt, xo): \n\nH(O)  ==  - I: k(xt, xo) log k(xt, xo). \n\nT \n\nt=l \n\n(6) \n\nThe  entropy  of a  probability  distribution  is  typically  thought  of  as  a  measure  of \nuncertainty.  In  the  context  of  the  weighting  kernel  k(xt, xo),  H(O)  can  be  used \nas  a  smooth measure of the  \"size\"  of the neighborhood that is  used  for  averaging. \nTo  see  this,  note  that  in  the extreme  case  where  equal  weights  are  placed  on  all \nobservations  in  D,  the entropy is  maximized.  At  the other  extreme,  if the single \nnearest  neighbor of Xo  is  assigned  the entire weight  of one,  the entropy  attains its \nminimum  value  zero.  Thus,  fixing  the  entropy  at  a  constant  value  c  is  similar  to \nfixing  the  number  k in  the  nearest  neighbor  approach.  Besides justifying  (6),  the \ncorrespondence between k and c can also be used to derive a more intuitive volume \nparameter than the entropy level c.  We specify c in terms of a hypothetical weighting \nkernel  that  places  equal  weight  on  the  k  nearest  neighbors  of  Xo  and  zero  weight \non  the  remaining  observations.  Note  that  the  entropy  of this  hypothetical  kernel \nis  log k.  Thus, it is  natural to characterize the size of an entropic neighborhood in \nterms of k,  and then to determine A by  numerically solving the nonlinear equation \nsystem  (for  details, see  [OH99]) \n\nH(O)  =  logk. \n\n(7) \n\nMore precisely, we report the number of neighbors in terms of the equivalent sample \nfraction  p ==  kiT to further  intuition.  This idea is  illustrated  in  Figure  2  using  a \none- and a two-dimensional example.  The equivalent  sample fractions  are p  =  30% \nand  p  =  50%,  respectively.  Note  that in both  cases  the weighting  kernel  is  wider \nin  regions with few  observations, and narrower in regions  with many observations. \nAs  a  consequence,  the  number  of observations  within  contours  of equal weighting \nremains approximately constant across the input space. \n\n\" \n\n0.2 . \n\n': .\n\n. \n\n. ,.  0\u00b7:\u00b7\u00b7:\u00b7\u00b7.\"\u00b7\u00b7\u00b7----------\n.  .  .  .,', .  ~', .~,' .,'. ': . \n\nFigure 2:  Left:  Univariate  weighting  kernel  k(-, xo)  evaluated  at  Xo  =  0.3  and  Xo  = 0.7 \nbased on a sample data set of 100 observations (indicated by the bars at the bottom) .  Right: \nMultivariate weighting kernel k(\u00b7, xo)  based on a sample data set of 200 observations.  The \ntwo ellipsoids correspond to 95% contours of a weighting kernel evaluated at (0.3,0.3)' and \n(0.6,0.6)' . \n\n03 \n\n04 \n\n0.1 \n\nO. \n\not \n\n1 \n\nTo  summarize,  we  define  the  value  of A by  fixing  the  equivalent  sample  fraction \nparameter  p,  and  subsequently  minimize  the  prediction  error  on  the  training  set \nwith  respect  to  the  shape  matrix  L.  Note  that  we  allow  for  the  possibility  that \nL  may  be  of  reduced  rank  I  :::;  d  as  a  means  of  controlling  the  number  of  free \nparameters.  As  a minimization procedure, we use a variant of gradient descent that \n\n\f544 \n\nD.  Ormoneit and T.  Hastie \n\naccounts  for  the entropy constraint.  In particular, our algorithm relies  on the fact \nthat  (7)  is  differentiable with respect to L.  Due to space limitations, the interested \nreader is  referred  to  [OH99]  for  a  formal  derivation  of the  involved  gradients  and \nfor  a  detailed description  of the optimization procedure. \n\n5  Experiments \n\nIn this section we  compare kernel  shaping to standard local linear regression using \na  fixed  spherical kernel  in two  examples.  First,  we  evaluate the performance using \na  simple  toy  problem which  allows  us  to estimate confidence  intervals  for  the pre(cid:173)\ndiction  accuracy using  Monte  Carlo simulation.  Second,  we  investigate a  data set \nfrom  the machine learning data base at UC  Irvine [BKM98]. \n\n5.1  Mexican Hat Function \n\nIn our first example, we employ Monte Carlo simulation to evaluate the performance \nof kernel shaping in a five-dimensional regression problem.  For this purpose, 20 sets \nof 500 data points each  are generated independently according to the model \n\ny  =  coS(SJxI + x~) . exp( -(xi + x~)). \n\n(8) \n\nHere  the  predictor  variables  Xl, ... ,X5  are  drawn  according  to  a  five-dimensional \nstandard normal distribution.  Note that, even though the regression is  carried out \nin  a  five-dimensional  predictor  space,  y  is  really  only  a  function  of the  variables \nXl  and  X2 .  In  particular,  as  dimensions  two  through  five  do  not  contribute  any \ninformation with regard to the value of y, kernel shaping should effectively  discard \nthese variables.  Note  also that there is  no  noise in this example. \n\nFigure  3: \nLeft:  \"True\"  Mexican  hat  function.  Middle:  Local  linear  estimate  using  a \nspherical  kernel  (p = 2%).  Right:  Local  linear  estimate  using  kernel  shaping  (p = 2%) . \nBoth estimates are based on  a  training set consisting of 500  data points. \n\nFigure 3 shows  a plot of the true function,  the spherical estimate, and the estimate \nusing kernel shaping as functions  of Xl  and  X2.  The true function  has the familiar \n\"Mexican hat\"  shape, which is  recovered by the estimates to different  degrees.  We \nevaluate the local linear estimates for values of the equivalent neighborhood fraction \nparameter p in the range from  1% to 15%.  Note that, to warrant a fair comparison, \nwe  used the entropic neighborhood also to determine the bandwith of the spherical \nestimate.  For  each  value  of  p,  20  models  are  estimated  using  the  20  artificially \ngenerated  training  sets,  and  subsequently  their  performance  is  evaluated  on  the \ntraining set and on the test set of 31  x 31  grid points shown in Figure 3.  The shape \nmatrix L  has maximal  rank 1 =  5 in  this  experiment.  Our  results  for  local  linear \nregression  using  the  spherical kernel  and  kernel  shaping are  summarized in  Table \n1.  Performance is  measured in terms  of the mean  R 2-value  of the 20  models,  and \nstandard deviations are reported in parenthesis. \n\n\fOptimal Kernel Shapes for Local Linear Regression \n\n545 \n\nAlgorithm \nspherical kernel \nspherical kernel \nspherical  kernel \nspherical  kernel \nspherical  kernel \nkernel shaping \nkernel shaping \nkernel shaping \nkernel shaping \n\np=l% \np=2% \np=5% \np =  10% \np= 20% \np= 1% \np=2% \np=5% \np= 15% \n\nTraining R2 \n0.961  (0.005) \n0.871  (0.014) \n0.680  (0.029) \n0.507  (0.038) \n0.341  (0.039) \n0.995  (0.001) \n0.984  (0.002) \n0.923  (0.009) \n0.628  (0.035) \n\n0.215  (0.126) \n0.293  (0.082) \n0.265  (0.043) \n0.213  (0.030) \n0.164  (0.021) \n0.882  (0.024) \n0.909  (0.017) \n0.836  (0.023) \n0.517  (0 .035) \n\nTable 1:  Performances in  the toy  problem.  The results  for  kernel shaping were  obtained \nusing  200  gradient descent steps with step size a  =  0.2. \n\nThe  results  in  Table  1  indicate  that  the  optimal  performance  on  the  test  set  is \nobtained using the parameter values  p  =  2%  both for  kernel  shaping  (R2  =  0.909) \nand  for  the spherical  kernel  (R2  =  0.293).  Given  the  large  difference  between  the \nR2  values, we  conclude that kernel shaping clearly outperforms the spherical kernel \non this  data set. \n\n----\n\nFigure 4:  The eigenvectors of the estimate of n obtained on  the first of 20  training sets. \nThe graphs are  ordered from  left  to right  by increasing eigenvalues  (decreasing  extension \nof the kernel in  that direction):  0.76,0.76,  0.76,  33.24,  34.88. \n\nFinally,  Figure 4  shows  the  eigenvectors  of the optimized n on  the  first  of the  20 \ntraining sets.  The eigenvectors are arranged according to the size of the correspond(cid:173)\ning eigenvalues.  Note that the two rightmost eigenvectors, which correspond to the \ndirections  of  minimum  kernel  extension,  span  exactly  the  Xl -x2-space  where  the \ntrue function  lives.  The kernel  is  stretched in  the remaining  directions, effectively \ndiscarding nonlinear contributions from  X3,  X4,  and X5' \n\n5.2  Abalone Database \n\nThe  task in  our second example is  to predict  the  age of abalone  based  on  several \nmeasurements.  More  specifically,  the  response  variable  is  obtained  by  counting \nthe  number  of rings  in  the  shell  in  a  time-consuming  procedure.  Preferably,  the \nage  of the abalone  could  be predicted from  alternative measurements that may  be \nobtained more easily.  In the data set, eight candidate measurements including sex, \ndimensions, and various weights are reported along with the number of rings of the \nabalone as predictor variables.  We  normalize these variables to zero mean and unit \nvariance prior to estimation.  Overall, the data set consists of 4177 observations.  To \nprevent possible artifacts resulting from the order of the data records, we randomly \ndraw  2784  observations as  a  training set and use the remaining  1393  observations \nas  a  test  set.  Our  results  are  summarized  in  Table  2  using  various  settings  for \nthe  rank l,  the equivalent  fraction  parameter p,  and the gradient descent  step size \na.  The optimal choice  for  p  is  20%  both for  kernel  shaping  (R2  = 0.582)  and  for \nthe spherical kernel  (R2  =  0.572).  Note that the performance improvement  due to \nkernel shaping is  negligible in this experiment. \n\n\f546 \n\nD.  Ormoneit and T.  Hastie \n\nKernel \nspherical kernel \nspherical kernel \nspherical kernel \nspherical kernel \nspherical  kernel \nspherical kernel \nkernel  shaping \nkernel shaping \nkernel  shaping \nkernel shaping \nkernel shaping \nkernel shaping \n\np = 0.05 \np = 0.10 \nP =  0.20 \nP = 0.50 \np =  0.70 \nP = 0.90 \n\nl  - 5,  p - 0.20,  a  =  0.5 \nl = 5,  p = 0.20,  a  = 0.2 \nl = 2,  P = 0.10,  a  = 0.2 \nl = 2,  P = 0.20,  a  = 0.2 \nl = 2,  P = 0.50,  a  = 0.2 \nl  =  2,  p =  0.20,  a  =  0.5 \n\nTraining R2 \n\n0.752 \n0.686 \n0.639 \n0.595 \n0.581 \n0.568 \n0.705 \n0.698 \n0.729 \n0.663 \n0.603 \n0.669 \n\n0.543 \n0.564 \n0.572 \n0.565 \n0.552 \n0.533 \n0.575 \n0.577 \n0.574 \n0.582 \n0.571 \n0.582 \n\nTable  2:  Results using the Abalone database after  200  gradient descent steps. \n\n6  Conclusions \n\nWe  introduced  a  data-driven  method  to  improve  the  performance  of local  linear \nestimates  in  high  dimensions  by  optimizing the shape of the  weighting kernel.  In \nour experiments we  found  that kernel shaping clearly outperformed local linear re(cid:173)\ngression  using  a  spherical  kernel  in  a  five-dimensional  toy  example,  and  led  to  a \nsmall  performance improvement  in  a  second,  real-world  example.  To  explain  the \nresults  of the  second  experiment,  we  note  that  kernel  shaping  aims  at  exploiting \nglobal  structure in  the  data.  Thus,  the  absence  of a  larger  performance improve(cid:173)\nment may suggest simply that no  corresponding structure prevails in that data set. \nThat is,  even  though optimal kernel shapes exist  locally,  they may vary accross the \npredictor space so that they cannot be approximated by any particular global shape. \nPreliminary experiments using a localized variant of kernel shaping did not lead to \nsignificant performance improvements in our experiments . \n\nAcknowledgments \n\nThe work of Dirk Ormoneit was supported by a grant of the Deutsche Forschungsge(cid:173)\nmeinschaft  (DFG)  as part of its post-doctoral program.  Trevor Hastie was partially \nsupported  by  NSF  grant DMS-9803645  and NIH  grant  ROI-CA-72028-01.  Carrie \nGrimes pointed us to misleading formulations in earlier drafts of this  work. \n\nReferences \n[AMS97]  C.  G.  Atkeson, A. W. Moore, and S. Schaal.  Locally weighted learning.  Artificial \n\nIntelligence  Review,  11:11-73,  1997. \n\n[BBB99]  M.  Birattari,  G.  Bontempi,  and H.  Bersini.  Lazy  learning  meets  the  recursive \nleast  squares  algorithm.  In M.  J.  Kearns,  S.  A.  Solla,  and D.  A.  Cohn,  editors, \nAdvances  in Neural  Information  Processing  Systems  11.  The MIT Press,  1999. \n[BKM98]  C.  Blake,  E.  Koegh,  and  C.  J.  Merz.  UCI  Repository  of  machine  learning \n\ndatabases.  http://vvw.ics.uci.edu/-mlearn/MLRepository.html. \n\n[Cle79]  W . S.  Cleveland.  Robust locally weighted regression and smoothing scatterplots. \n\n[FG96] \n\nJournal  of the  American Statistical  Association,  74:829-836,  1979. \nJ.  Fan  and 1.  Gijbels.  Local  Polynomial  Modelling  and  Its  Applications.  Chap(cid:173)\nman &  Hall,  1996. \n\n[OH99]  D.  Ormoneit  and T . Hastie.  Optimal kernel  shapes for  local  linear  regression. \n\nTech.  report 1999-11,  Department of Statistics,  Stanford University,  1999. \n\n\f", "award": [], "sourceid": 1755, "authors": [{"given_name": "Dirk", "family_name": "Ormoneit", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}