{"title": "Rodeo: Sparse Nonparametric Regression in High Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 707, "page_last": 714, "abstract": null, "full_text": "Rodeo: Sparse Nonparametric Regression in High Dimensions\n\nJohn Lafferty School of Computer Science Carnegie Mellon University\n\nLarry Wasserman Department of Statistics Carnegie Mellon University\n\nAbstract\nWe present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method--called rodeo (regularization of derivative expectation operator)--conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems.\n\n1\n\nIntroduction\n\nEstimating a high dimensional regression function is notoriously difficult due to the \"curse of dimensionality.\" Minimax theory precisely characterizes the curse. Let Yi = m(Xi ) + i , i = 1, . . . , n where Xi = (Xi (1), . . . , Xi (d)) Rd is a d-dimensional covariate, m : Rd R is the unknown function to estimate, and i N (0, 2 ). Then if m is in W2 (c), the d-dimensional Sobolev ball of order two and radius c, it is well known that (1) lim inf n4/(4+d) inf sup R(mn , m) > 0 ,\nn\n\n( where R(mn , m) = Em mn (x) - m(x))2 dx is the risk of the estimate mn constructed on a sample of size n (Gyorfi et al. 2002). Thus, the best rate of convergence is n-4/(4+d) , which is impractically slow if d is large.\n\nm n m W 2 ( c) b\n\nHowever, for some applications it is reasonable to expect that the true function only depends on a small number of the total covariates. Suppose that m satisfies such a sparseness condition, so that m(x) = m(xR ) where xR = (xj : j R), R {1, . . . , d} is a subset of the d covariates, of size r = |R| d. We call {xj }j R the relevant variables. Under this sparseness assumption we can hope to achieve the better minimax convergence rate of n-4/(4+r) if the r relevant variables can be isolated. Thus, we are faced with the problem of variable selection in nonparametric regression. A large body of previous work has addressed this fundamental problem, which has led to a variety of methods to combat the curse of dimensionality. Many of these are based\n\n\f\non very clever, though often heuristic techniques. For additive models of the form j f (x) = fj (xj ), standard methods like stepwise selection, Cp and AIC can be used (Hastie et al. 2001). For spline models, Zhang et al. (2005) use likelihood basis pursuit, essentially the lasso adapted to the spline setting. CART (Breiman et al. 1984) and MARS (Friedman 1991) effectively perform variable selection as part of their function fitting. More recently, Li et al. (2005) use independence testing for variable selection and Buhlmann and Yu (2005) introduced a boosting approach. While these methods have met with varying degrees of empirical success, they can be challenging to implement and demanding computationally. Moreover, these methods are typically difficult to analyze theoretically, and so often come with no formal guarantees. Indeed, the theoretical analysis of sparse parametric estimators such as the lasso (Tibshirani 1996) is difficult, and only recently has significant progress been made on this front (Donoho 2004; Fu and Knight 2000). In this paper we present a new approach to sparse nonparametric function estimation that is both computationally simple and amenable to theoretical analysis. We call the general framework rodeo, for regularization of derivative expectation operator. It is based on the idea that bandwidth and variable selection can be simultaneously performed by computing the infinitesimal change in a nonparametric estimator as a function of the smoothing parameters, and then thresholding these derivatives to effectively get a sparse estimate. As a simple version of this principle we use hard thresholding, effectively carrying out a sequence of hypothesis tests. A modified version that replaces testing with soft thresholding effectively solves a sequence of lasso problems. The potential appeal of this approach is that it can be based on relatively simple and theoretically well understood nonparametric techniques such as local linear smoothing, leading to methods that are simple to implement and can be used in high dimensional problems. Moreover, we show that the rodeo can achieve near optimal minimax rates of convergence, and therefore circumvents the curse of dimensionality when the true function is indeed sparse. When applied in one dimension, our method yields a locally optimal bandwidth. We present experiments on both synthetic and real data that demonstrate the effectiveness of the new approach.\n\n2\n\nRodeo: The Main Idea\n\nthe derivative of h(s) along the path. A biased, low variance estimator of M (1) is m1 (x). An unbiased estimator of D(h) is T mh (x) mh (x) Z (h) = ,..., . (2) h1 hd The naive estimator 1 Z d (s), h(s) s (3) m(x) = m1 (x) -\n0\n\nThe key idea in our approach is as follows. Fix a point x and let mh (x) denote an estimator of m(x) based on a vector of smoothing parameters h = (h1 , . . . , hd ). If c is a scalar, then we write h = c to mean h = (c, . . . , c). Let M (h) = E(mh (x)) denote the mean of mh (x). For now, assume that xi is one of the observed data points and that m0 (x) = Yi . In that case, m(x) = M (0) = E(Yi ). If P = (h(t) : 0 t 1) is a smooth path through the set of smoothing parameters with h(0) = 0 and h(1) = 1 (or any other fixed, large bandwidth) then 1 1 D d dM (h(s)) m(x) = M (0) = M (1) - ds = M (1) - (s), h(s) s ds 0 0 T M M where D(h) = M (h) = hj , . . . , hj is the gradient of M (h) and h(s) = dh(s) is ds\n\n\f\nh2\n\nStart\n\nRodeo path\n\nIdeal path\n\nFigure 1: The bandwidths for the relevant variables (h2 ) are shrunk, while the bandwidths for the irrelevant variables (h1 ) are kept relatively large. The simplest rodeo algorithm shrinks the bandwidths in discrete steps 1, , 2 , . . . for some 0 < < 1.\n\nOptimal bandwidth h1 is identically equal to m0 (x) = Yi , which has poor risk since the variance of Z (h) is large for small h. However, our sparsity assumption on m suggests that there should be paths for which D(h) is also sparse. Along such a path, we replace Z (h) with an estimator D(h) that makes use of the sparsity assumption. Our estimate of m(x) is then 1 D d (s), h(s) s . (4) m(x) = m1 (x) -\n0\n\nTo implement this idea we need to do two things: (i) we need to find a sparse path and (ii) we need to take advantage of this sparseness when estimating D along that path.\n\nThe key observation is that if xj is irrelevant, then we expect that changing the bandwidth hj for that variable should cause only a small change in the estimator mh (x). Conversely, if xj is relevant, then we expect that changing the bandwidth hj for that variable should cause a large change in the estimator. Thus, Zj = mh (x)/ hj should discriminate between relevant and irrelevant covariates. To simplify the procedure, we can replace the continuum of bandwidths with a discrete set where each hj B = {h0 , h0 , 2 h0 , . . .} for some 0 < < 1. Moreover, we can proceed in a greedy fashion by estimating D(h) sequentially with hj B and setting Dj (h) = 0 when hj < hj , where hj is the first h such that |Zj (h)| < j (h) for some threshold j . This greedy version, coupled with the hard threshold estimator, yields m(x) = mb (x). A conceptual illustration of the idea is h shown in Figure 1. This idea can be implemented using a greedy algorithm, coupled with the hard threshold estimator, to yield a bandwidth selection procedure based on testing. This approach to bandwidth selection is similar to that of Lepski et al. (1997), which uses a more refined test leads to estimators that achieve good spatial adaptation over large function classes. Our approach is also similar to a method of Ruppert (1997) that uses a sequence of decreasing bandwidths and then estimates the optimal bandwidth by estimating the mean squared error as a function of bandwidth. Our greedy approach tests whether an infinitesimal change in the bandwidth from its current setting leads to a significant change in the estimate, and is more easily extended to a practical method in higher dimensions. Related work of Hristache et al. (2001) focuses on variable selection in multi-index models rather than on bandwidth estimation.\n\n3\n\nRodeo using Local Linear Regression\n\nWe now present the multivariate case in detail, using local linear smoothing as the basic method since it is known to have many good properties. Let x = (x(1), . . . , x(d)) be some point at which we want to estimate m. Let mH (x) denote the local linear estimator of\n\n\f\nm(x) using bandwidth matrix H . Thus,\nT T mH (x) = eT (Xx Wx Xx )-1 Xx Wx Y , 1\n\nis called the effective kernel. We assume that the covariates are random with sampling density f (x), and make the same assumptions as Ruppert and Wand (1994) in their analysis of the bias and variance of local linear regreusion. In particular, (i) the kernel K has coms pact support with zero odd moments and u K (u) du = 2 (K )I and (ii) the sampling density f (x) is continuously differentiable and strictly positive. In the version of the algorithm that follows, we take K to be a product kernel and H to be diagonal with elements h = (h1 , . . . , hd ). Our method is based on the statistic in mh (x) Gj (Xi , x, h)Yi = Zj = hj =1\n G(u,x,h) . hj\n\nwhere e1 = (1, 0, . . . , 0)T , and Wx is the diagonal matrix with (i, i) element KH (Xi - x) and KH (u) = |H |-1 K (H -1 u). The estimator mH can be written as mH (x) = n i=1 G(Xi , x, h) Yi where 1 K T T -1 G(u, x, h) = e1 (Xx Wx Xx ) (6) H (u - x) (u - x)T\n\n1 . Xx = . . 1\n\n\n\n (X1 - x)T . . . T (Xn - x)\n\n(5)\n\n(7)\n\nwhere Gj (u, x, h) = Zj =\n\nStraightforward calculations show that (8)\n\n and thus Zj = e (Xx Wx Xx )-1 Xx Wx Dj (Y - Xx(). For example, with the Gaus.sian 1 1 2 kernel K (u) = exp(-u /2) we have Dj = h3 diag X1j - xj )2 , . . . , (Xnj - xj )2\nj\n\nand Wx / hj = Wx Dj where log K ((X1j - xj )/hj ) log K ((Xnj - xj )/hj ) Dj = diag ,..., hj hj\n\n where = (Xx Wx Xx )-1 Xx Wx Y is the coefficient vector for the local linear fit. Note d -1 that the factor |H | = i=1 1/hi in the kernel cancels in the expression for m, and therefore we can ignore it in our calculation of Zj . Assuming a product kernel we have jd jd (9) K ((Xnj - xj )/hj ) K ((X1j - xj )/hj ), . . . , Wx = diag =1 =1\n\n mh (x) Wx = = e (Xx Wx Xx )-1 Xx (Y - Xx ) 1 hj hj\n\n(\n\n10)\n\nLet\n\nj s2 j\n\n j (h) = E(Zj |X1 , . . . , Xn ) =\n\nin\n\nGj (Xi , x, h)m(Xi ) in Gj (Xi , x, h)2 .\n\n(11) (12)\n\n=1\n\n s2 (h) = V(Zj |X1 , . . . , Xn ) = 2 j\n\n=1\n\nThen the hard thresholding version of the rodeo algorithm is given in Figure 2. The algorithm requires that we insert an estimate of in (12). One estimate of can be obtained by generalizing a method of Rice (1984). For i < , let di = Xi - X . Fix an integer J and let E denote the set of pairs (i, ) corresponding to the J smallest i 2 2 2 values of di . Now define 2 = 21 ,E (Yi - Y ) . Then E( ) = + bias where J\n\n\f\nRodeo: Hard thresholding version\n1. Select paramete1 0 < < 1 and initial bandwidth h0 slowly decreasing to zero, r . with h0 = / log log n Let cn = (1) be a sequence satisfying dcn = (log n). 2. Initialize the bandwidths, and activate all covariates: (a) hj = h0 , j = 1, 2, . . . , d. (b) A = {1, 2, . . . , d}\n\nFigure 2: The hard thresholding version of the rodeo, which can be applied using the derivatives Zj of any nonparametric smoother. bias D supx j\nR\n\n4. Output bandwidths h = (h1 , . . . , hd ) and estimator m(x) = mh (x). \n\n3. While A is nonempty, do for each j A: (a) Compute the estimated derivative expectation: Zj (equation 7) and sj (equation 12). 2 (b) Compute the threshold j = sj log(dcn ). (c) If |Zj | j , then set hj hj , otherwise remove j from A.\n\ntradeoff: large J makes positively biased, and small J makes 2 highly variable. Note however that the bias is mitigated by sparsity (small r). This is the estimator used in our examples.\n\nf (x) xj 2\n\nw ith D = maxi,E Xi - X . There is a bias-variance\n\n4\n\nAnalysis\n\nIn this section we present some results on the properties of the resulting estimator. Formally, we use a triangular array approach so that f (x), m(x), d and r can all change as n changes. For convenience of notation we assume that the covariates are numbered such that the relevant variables xj correspond to 1 j r, and the irrelevant variables to j > r. To begin, we state the following technical lemmas on the mean and variance of Zj . Lemma 4.1 . Suppose that K is a product kernel with bandwidth vector h = (h1 , . . . , hd ). If the sampling density f is uniform, then j = 0 for all j Rc . More generally, assuming that r is bounded, we have the following when hj 0: If j Rc the derivative of the bias is 2 2 E[mH (x) - m(x)] = -tr (HR HR ) 2 (j log f (x)) hj + oP (hj ) (13) j = hj a H 0 R where the Hessian of m(x) is H = nd HR = diag(h2 , . . . , h2 ). For j R r 1 00 we have j = E[mH (x) - m(x)] = hj 2 mj j (x) + oP (hj ). (14) hj Lemma 4.2 . Let C = s2 = Var(Zj |X1 , . . . , Xn ) = j 2\nR (K ) 4m(x)\n\nw here R(K ) = C nh2 j\n\nk\n\n=1\n\nd\n\nK\n\n(u)2 du. Then, if hj = o(1), 1 . 1 + oP (1) (15) hk\n\n\f\nThese lemmas parallel the calculations of Ruppert and Wand (1994) except for the difference that the irrelevant variables have different leading terms in the expansions than relevant variables. Our main theoretical result characterizes the asymptotic running time, selected bandwidths, and risk of the algorithm. In order to get a practical algorithm, we need to make assumptions on the functions m and f . (A1) For some constant k > 0, each j > r satisfies j log f (x) = O (A2) For each j r, l ogk n n1/4 (\n\n16)\n\nmj j (x) = 0 .\n\n(17)\n\nExplanation of the Assumptions. To give the intuition behind these assumptions, recall from Lemma 4.1 that A j hj + oP (hj ) j r j = (18) Bj hj + oP (hj ) j > r where 2 Aj = 2 mj j (x), Bj = -tr(H H)2 (j log f (x))2 . (19) Moreover, j = 0 when the sampling density f is uniform or the data are on a regular grid. Consider assumption (A1). If f is uniform then this assumption is automatically satisfied since then j (s) = 0 for j > r. More generally, j is approximately proportional to (j log f (x))2 for j > r which implies that |j | 0 for irrelevant variables if f is sufficiently smooth in the variable xj . Hence, assumption (A1) can be interpreted as requiring that f is sufficiently smooth in the irrelevant dimensions. Now consider assumption (A2). Equation (18) ensures that j is proportional to hj |mj j (x)| for small hj . Since we take the initial bandwidth h0 to be decreasingly slowly with n, (A2) implies that |j (h)| chj |mj j (x)| for some constant c > 0, for sufficiently large n. In the following we write Yn = OP (an ) to mean that Yn = OP (bn an ) where bn is logarithmic in n; similarly, an = (bn ) if an = (bn cn ) where cn is logarithmic in n.\n\nTheorem 4.3 . Suppose assumptions (A1) and (A2) hold. In addition, suppose that dmin = minj r |mj j (x)| = (1) and dmax = maxj r |mj j (x)| = O(1). Then the number of iterations Tn until the rodeo stops satisfies 1 - 1 P 1 (20) log1/ (nan ) Tn log1/ (nbn ) 4+r 4+r where an = (1) and bn = O(1). Moreover, the algorithm outputs bandwidths h that satisfy h - 1 P j for all j > r 1 (21) logk n and h - P 0 (nbn )-1/(4+r) h h0 (nan )-1/(4+r) for all j r 1. (22) j Corollary 4.4 . Under the conditions of Theorem 4.3, the risk R(h ) of the rodeo estimator satisfies . n -4/(4+r ) (23) R(h ) = OP\n\n\f\nIn the one-dimensional case, this result shows that the algorithm recovers the locally optimal bandwidth, giving an adaptive estimator, and in general attains the optimal (up to logarithmic factors) minimax rate of convergence. The proofs of these results are given in the full version of the paper.\n\n5\n\nSome Examples and Extensions\n\nFigure 3 illustrates the rodeo on synthetic and real data. The left plot shows the bandwidths obtained on a synthetic dataset with n = 500 points of dimension d = 20. The covariates are generated as xi Uniform(0, 1), the true function is m(x) = 2(x1 + 1)2 + 2 sin(10x2 ), and = 1. The results are averaged over 50 randomly generated data sets; note that the displayed bandwidth paths are not monotonic because of this averaging. The plot shows how the bandwidths of the relevant variables shrink toward zero, while the bandwidths of the irrelevant variables remain large. Simulations on other synthetic data sets, not included here, are similar and indicate that the algorithm's performance is consistent with our theoretical analysis. The framework introduced here has many possible generalizations. While we have focused on estimation of m locally at a point x, the idea can be extended to carry out global bandwidth and variable selection by averaging over multiple evaluation points x1 , . . . , xk . These could be points interest for estimation, could be randomly chosen, or could be taken to be identical to the observed Xi s. In addition, it is possible to consider more general paths, for example using soft thresholding or changing only the bandwidth corresponding to the largest |Zj |/j . Such a version of the rodeo can be seen as a nonparametric counterpart to least angle regression (LARS) (Efron et al. 2004), a refinement of forward stagewise regression in which one adds the covariate most correlated with the residuals of the current fit, in small, incremental steps. Note first that Zj is essentially the correlation between the Yi s and the Gj (Xi , x, h)s (the change in the effective kernel). Reducing the bandwidth is like adding in more of that variable. Suppose now that we make the following modifications to the rodeo: (i) change the bandwidths one at a time, based on the largest Zj = |Zj |/j , (ii) reduce the bandwidth continuously, rather than in discrete steps, until the largest Zj is equal to the next largest. Figure 3 (right) shows the result of running this greedy version of the rodeo on the diabetes dataset used to illustrate LARS. The algorithm averages Zj over a randomly chosen set of k = 100 data points. The resulting variable ordering is seen to be very similar to, but different from, the ordering obtained from the parametric LARS fit.\n\nAcknowledgments\nWe thank the reviewers for their helpful comments. Research supported in part by NSF grants IIS-0312814, IIS-0427206, and DMS-0104016, and NIH grants R01-CA54852-07 and MH57881.\n\nReferences\nL. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth Publishing Co Inc, 1984. P. Buhlmann and B. Yu. Boosting, model selection, lasso and nonnegative garrote. Technical report, Berkeley, 2005.\n\n\f\n1.0\n\n12\n\nAverage Bandwidth\n\nBandwidth\n\n11 6 16 8 3 4 15 18 19 5 7 10 3 17 20 9 14\n\n0.8\n\n0.3\n\n0.6\n\n0.4\n\n0.5\n\n3\n\n9\n\n7\n\n4\n\n1\n\n2\n\n8\n\n0.4\n\n0.2\n\n1 2\n\n0.0\n\n0.0 0\n\n0.1\n\n0.2\n\n5\n\n10\n\n15\n\n20\n\n40\n\n60\n\n80\n\n100\n\nRodeo Step\n\nGreedy Rodeo Step\n\nFigure 3: Left: Average bandwidth output by the rodeo for a function with r = 2 relevant variables in d = 20 dimensions (n = 500, with 50 trials). Covariates are generated as xi Uniform(0, 1), the true function is m(x) = 2(x1 + 1)3 + 2 sin(10x2 ), and = 1, fit at the test point x = 1 ( 1 , . . . , 2 ). The variance is greater for large step sizes since the rodeo runs that long for fewer data 2 sets. Right: Greedy rodeo on the diabetes data, used to illustrate LARS (Efron et al. 2004). A set of k = 100 of the total n = 442 points were sampled (d = 10), and the bandwidth for the variable with largest average |Zj |/j was reduced in each step. The variables were selected in the order 3 (body mass index), 9 (serum), 7 (serum), 4 (blood pressure), 1 (age), 2 (sex), 8 (serum), 5 (serum), 10 (serum), 6 (serum). The parametric LARS algorithm adds variables in the order 3, 9, 4, 7, 2, 10, 5, 8, 6, 1. One notable difference is in the position of the age variable.\nD. Donoho. For most large underdetermined systems of equations, the minimal 1 -norm near-solution approximates the sparest near-solution. Technical report, Stanford, 2004. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32:407499, 2004. J. H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19:167, 1991. W. Fu and K. Knight. Asymptotics for lasso type estimators. The Annals of Statistics, 28:13561378, 2000. L. Gyorfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer-Verlag, 2002. T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001. M. Hristache, A. Juditsky, J. Polzehl, and V. Spokoiny. Structure adaptive approach for dimension reduction. Ann. Statist., 29:15371566, 2001. O. V. Lepski, E. Mammen, and V. G. Spokoiny. Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors. The Annals of Statistics, 25:929947, 1997. L. Li, R. D. Cook, and C. Nachsteim. Model-free variable selection. J. R. Statist. Soc. B., 67:285299, 2005. J. Rice. Bandwidth choice for nonparametric regression. The Annals of Statistics, 12:12151230, 1984. D. Ruppert. Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. Journal of the American Statistical Association, 92:10491062, 1997. D. Ruppert and M. P. Wand. Multivariate locally weighted least squares regression. The Annals of Statistics, 22:13461370, 1994. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological, 58:267288, 1996. H. Zhang, G. Wahba, Y. Lin, M. Voelker, R. K. Ferris, and B. Klein. Variable selection and model building via likelihood basis pursuit. J. of the Amer. Stat. Assoc., 99(467):659672, 2005.\n\n\f\n", "award": [], "sourceid": 2808, "authors": [{"given_name": "Larry", "family_name": "Wasserman", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}