{"title": "Principles of Risk Minimization for Learning Theory", "book": "Advances in Neural Information Processing Systems", "page_first": 831, "page_last": 838, "abstract": null, "full_text": "Principles of Risk Minimization \n\nfor Learning Theory \n\nV. Vapnik \n\nAT &T Bell Laboratories \nHolmdel, NJ 07733, USA \n\nAbstract \n\nLearning is posed as a problem of function estimation, for which two princi(cid:173)\nples of solution are considered: empirical risk minimization and structural \nrisk minimization. These two principles are applied to two different state(cid:173)\nments of the function estimation problem: global and local. Systematic \nimprovements in prediction power are illustrated in application to zip-code \nrecognition. \n\n1 \n\nINTRODUCTION \n\nThe structure of the theory of learning differs from that of most other theories for \napplied problems. The search for a solution to an applied problem usually requires \nthe three following steps: \n\n1. State the problem in mathematical terms. \n2. Formulate a general principle to look for a solution to the problem. \n3. Develop an algorithm based on such general principle. \n\nThe first two steps of this procedure offer in general no major difficulties; the \nthird step requires most efforts, in developing computational algorithms to solve \nthe problem at hand. \n\nIn the case of learning theory, however, many algorithms have been developed, but \nwe still lack a clear understanding of the mathematical statement needed to describe \nthe learning procedure, and of the general principle on which the search for solutions \n831 \n\n\f832 \n\nVapnik \n\nshould be based. This paper is devoted to these first two steps, the statement of \nthe problem and the general principle of solution. \n\nThe paper is organized as follows. First, the problem of function estimation is \nstated, and two principles of solution are discussed: the principle of empirical risk \nminimization and the principle of structural risk minimization. A new statement \nis then given: that of local estimation of function, to which the same principles are \napplied. An application to zip-code recognition is used to illustrate these ideas. \n\n2 FUNCTION ESTIMATION MODEL \n\nThe learning process is described through three components: \n\n1. A generator of random vectors x, drawn independently from a fixed but unknown \ndistribution P(x). \n2. A supervisor which returns an output vector y to every input vector x, according \nto a conditional distribution function P(ylx), also fixed but unknown. \n3. A learning machine capable of implementing a set of functions !(x, w), wE W. \nThe problem of learning is that of choosing from the given set of functions the one \nwhich approximates best the supervisor's response. The selection is based on a \ntraining set of e independent observations: \n\n(1) \n\nThe formulation given above implies that learning corresponds to the problem of \nfunction approximation. \n\n3 PROBLEM OF RISK MINIMIZATION \n\nIn order to choose the best available approximation to the supervisor's response, \nwe measure the loss or discrepancy L(y, !(x, w\u00bb between the response y of the \nsupervisor to a given input x and the response !(x, w) provided by the learning \nmachine. Consider the expected value of the loss, given by the risk functional \n\nR(w) = J L(y, !(x, w\u00bbdP(x,y). \n\n(2) \n\nThe goal is to minimize the risk functional R( w) over the class of functions \n!(x, w), w E W. But the joint probability distribution P(x, y) = P(ylx )P(x) \nis unknown and the only available information is contained in the training set (1). \n\n4 EMPIRICAL RISK MINIMIZATION \n\nIn order to solve this problem, the following induction principle is proposed: the \nrisk functional R( w) is replaced by the empirical risk functional \n\n1 \n\nl \n\nE(w) = i LL(Yi,!(Xi'W\u00bb \n\ni=l \n\n(3) \n\n\fPrinciples of Risk Minimization for Learning Theory \n\n833 \n\nconstructed on the basis of the training set (1). The induction principle of empirical \nrisk minimization (ERM) assumes that the function I(x, wi) ,which minimizes E(w) \nover the set w E W, results in a risk R( wi) which is close to its minimum. \nThis induction principle is quite general; many classical methods such as least square \nor maximum likelihood are realizations of the ERM principle. \n\nThe evaluation of the soundness of the ERM principle requires answers to the fol(cid:173)\nlowing two questions: \n1. Is the principle consistent? (Does R( wi) converge to its minimum value on the \nset wE W when f - oo?) \n2. How fast is the convergence as f increases? \n\nThe answers to these two questions have been shown (Vapnik et al., 1989) to be \nequivalent to the answers to the following two questions: \n1. Does the empirical risk E( w) converge uniformly to the actual risk R( w) over \nthe full set I(x, w), wE W? Uniform convergence is defined as \nf -\n\nProb{ sup IR(w) - E(w)1 > \u00a3} -\n\n(4) \n\nwEW \n\n0 \n\nas \n\n00. \n\n2. What is the rate of convergence? \n\nIt is important to stress that uniform convergence (4) for the full set of functions is \na necessary and sufficient condition for the consistency of the ERM principle. \n\n5 VC-DIMENSION OF THE SET OF FUNCTIONS \n\nThe theory of uniform convergence of empirical risk to actual risk developed in \nthe 70's and SO's, includes a description of necessary and sufficient conditions as \nwell as bounds for the rate of convergence (Vapnik, 19S2). These bounds, which \nare independent of the distribution function P(x,y), are based on a quantitative \nmeasure of the capacity of the set offunctions implemented by the learning machine: \nthe VC-dimension of the set. \n\nFor simplicity, these bounds will be discussed here only for the case of binary pat(cid:173)\ntern recognition, for which y E {O, 1} and I(x, w), wE W is the class of indicator \nfunctions. The loss function takes only two values L(y, I(x, w)) = 0 if y = I(x, w) \nand L(y, I(x, w)) = 1 otherwise. In this case, the risk functional (2) is the prob(cid:173)\nability of error, denoted by pew). The empirical risk functional (3), denoted by \nv(w), is the frequency of error in the training set. \nThe VC-dimension of a set of indicator functions is the maximum number h of \nvectors which can be shattered in all possible 2h ways using functions in the set. \nFor instance, h = n + 1 for linear decision rules in n-dimensional space, since they \ncan shatter at most n + 1 points. \n\n6 RATES OF UNIFORM CONVERGENCE \n\nThe notion of VC-dimension provides a bound to the rate of uniform convergence. \nFor a set of indicator functions with VC-dimension h, the following inequality holds: \n\n\f834 \n\nVapnik \n\nProb{ SUp IP(w) - v(w)1 > c} < (-h) exp{-e fl\u00b7 \n\n2 \n\n2fe h \n\nwEW \n\nIt then follows that with probability 1 - T}, simultaneously for all w E W, \n\npew) < v(w) + Co(f/h, T}), \n\nwith confidence interval \n\nC (f/h \n\no \n\n,T}-V \n\n) _ . Ih(1n 21/h + 1) - In T} \n\nf \n\n. \n\n(5) \n\n(6) \n\n(7) \n\nThis important result provides a bound to the actual risk P( w) for all w E W, \nincluding the w\u00b7 which minimizes the empirical risk v(w). \nThe deviation IP(w) - v(w)1 in (5) is expected to be maximum for pew) close \nto 1/2, since it is this value of pew) which maximizes the error variance u(w) = \nJ P( w)( 1 - P( w)). The worst case bound for the confidence interval (7) is thus \nlikely be controlled by the worst decision rule. The bound (6) is achieved for the \nworst case pew) = 1/2, but not for small pew), which is the case of interest. A \nuniformly good approximation to P( w) follows from considering \n\nProb{ sup \nwEW \n\npew) - v(w) \n\n(j(w) \n\n> e}. \n\n(8) \n\nThe variance of the relative deviation (P( w) - v( w))/ (j( w) is now independent of w. \nA bound for the probability (8), if available, would yield a uniformly good bound \nfor actual risks for all P( w). \nSuch a bound has not yet been established. But for pew) \u00ab 1, the approximation \n(j(w) ~ JP(w) is true, and the following inequality holds: \n\nProb{ sup \nwEW \n\npew) - v(w) \n\nJP(w) \n\n> e} < (-) exp{--}. \n\n2le h \nh \n\ne2f \n4 \n\nIt then follows that with probability 1 - T}, simultaneously for all w E W, \n\npew) < v(w) + CI(f/h, v(w), T}), \n\n(9) \n\n(10) \n\nwith confidence interval \n\nCI(l/h,v(w),T}) =2 (h(ln2f/h;l)-lnT}) (1+ 1 \n\n. \n(11) \nNote that the confidence interval now depends on v( w), and that for v( w) = 0 it \nreduces to \n\n+ h(1n 2f/h + 1) -In T} \n\nv(w)f \n\n) \n\nCI(f/ h, 0, T}) = 2C'5(f/ h, T}), \n\nwhich provides a more precise bound for real case learning. \n\n7 STRUCTURAL RISK MINIMIZATION \n\nThe method of ERM can be theoretically justified by considering the inequalities \n(6) or (10). When l/h is large, the confidence intervals Co or C1 become small, and \n\n\fPrinciples of Risk Minimization for Learning Theory \n\n835 \n\ncan be neglected. The actual risk is then bound by only the empirical risk, and the \nprobability of error on the test set can be expected to be small when the frequency \nof error in the training set is small. \nHowever, if ljh is small, the confidence interval cannot be neglected, and even \nv( w) = 0 does not guarantee a small probability of error. In this case the minimiza(cid:173)\ntion of P( w) requires a new principle, based on the simultaneous minimization of \nv( w) and the confidence interval. It is then necessary to control the VC-dimension \nof the learning machine. \nTo do this, we introduce a nested structure of subsets Sp = {lex, w), wE Wp}, such \nthat \n\nThe corresponding VC-dimensions of the subsets satisfy \n\nSlCS2C ... CSn . \n\nhl < h2 < ... < hn . \n\nThe principle of structure risk minimization (SRM) requires a two-step process: the \nempirical risk has to be minimized for each element of the structure. The optimal \nelement S* is then selected to minimize the guaranteed risk, defined as the sum \nof the empirical risk and the confidence interval. This process involves a trade-off: \nas h increases the minimum empirical risk decreases, but the confidence interval \nmcreases. \n\n8 EXAMPLES OF STRUCTURES FOR NEURAL NETS \n\nThe general principle of SRM can be implemented in many different ways . Here \nwe consider three different examples of structures built for the set of functions \nimplemented by a neural network . \n\n1. Structure given by the architecture of the neural network. Consider an \nensemble of fully connected neural networks in which the number of units in one of \nthe hidden layers is monotonically increased. The set of implement able functions \nmakes a structure as the number of hidden units is increased. \n\n2. Structure given by the learning procedure. Consider the set of functions \nS = {lex, w), w E W} implementable by a neural net of fixed architecture. The \nparameters {w} are the weights of the neural network. A structure is introduced \nthrough Sp = {lex, w), Ilwll < Cp } and Cl < C2 < ... < Cn. For a convex \nloss function, the minimization of the empirical risk within the element Sp of the \nstructure is achieved through the minimizat.ion of \n\n1 \n\nl \n\nE(w\"P) = l LL(Yi,!(Xi'W\u00bb +'PllwI1 2 \n\ni=l \n\nwith appropriately chosen Lagrange multipliers II > 12 > ... > In' The well-known \n\"weight decay\" procedure refers to the minimization of this functional. \n\n3. Structure given by preprocessing. Consider a neural net with fixed ar(cid:173)\nchitecture. The input representation is modified by a transformation z = K(x, 13), \nwhere the parameter f3 controls the degree of the degeneracy introduced by this \ntransformation (for instance f3 could be the width of a smoothing kernel). \n\n\f836 \n\nVapnik \nA structure is introduced in the set of functions S = {!(I\u00abx, 13), w), w E W} \nthrough 13 > CP1 and Cl > C2 > ... > Cn\u00b7 \n\n9 PROBLEM OF LOCAL FUNCTION ESTIMATION \n\nThe problem of learning has been formulated as the problem of selecting from the \nclass of functions !(x, w), w E W that which provides the best available approxi(cid:173)\nmation to the response of the supervisor. Such a statement of the learning problem \nimplies that a unique function !( x, w\u00b7) will be used for prediction over the full input \nspace X. This is not necessarily a good strategy: the set !(x, w), w E W might \nnot contain a good predictor for the full input space, but might contain functions \ncapable of good prediction on specified regions of input space. \n\nIn order to formulate the learning problem as a problem of local function approxi(cid:173)\nmation, consider a kernel I\u00abx - Xo, b) ~ 0 which selects a region of input space of \nwidth b, centered at xo. For example, consider the rectangular kernel, \n\nI< (x _ x b) = { 1 \n\n,. \n\n0, \n\nif Ix - ~o I < b \n\n0 otherwIse \n\nand a more general general continuous kernel, such as the gaussian \n\nr \n\n/ig(x-xo,b)=exp-{ \n\n(x-xO)2 \n\nb2 \n\n}. \n\nThe goal is to minimize the local risk functional \n\nR(w, b, xo) = \n\nL(y, !(x, w\u00bb K(xo, b) dP(x, V)\u00b7 \n\nK(x - Xo, b) \n\nJ \n\nThe normalization is defined by \n\nK(xo, b) = J K(x - Xo, b) dP(x). \n\n(12) \n\n(13) \n\nThe local risk functional (12) is to be minimized over the class of functions \n!(x, w), w E Wand over all possible neighborhoods b E (0,00) centered at xo. \nAs before, the joint probability distribution P( x, y) is unknown, and the only avail(cid:173)\nable information is contained in the training set (1). \n\n10 EMPIRICAL RISK MINIMIZATION FOR LOCAL \n\nESTIMATION \n\nIn order to solve this problem, the following induction principle is proposed: for \nfixed b, the local risk functional (12) is replaced by the empirical risk functional \n\nE(w,b,xo) = l L..tL(Yj,!(Xj,w\u00bb \n\n1 ~ \n\ni=l \n\nK(Xi - Xo, b) \n\n1\u00ab \n\nb) \n\n, \n\n(14) \n\nXo, \n\n\fPrinciples of Risk Minimization for Learning Theory \n\n837 \n\nconstructed on the basis of the training set. The empirical risk functional (14) is \nto be minimized over w E W. In the simplest case, the class of functions is that of \nconstant functions, I(x, w) = C( w). Consider the following examples: \n1. K-Nearest Neighbors Method: For the case of binary pattern recogni(cid:173)\ntion, the class of constant indicator functions contains only two functions: either \n\nI(x, w) = \u00b0 for all x, or I(x, w) = 1 for all x. The minimization of the empirical \n\nrisk functional (14) with the rectangular kernel Kr(x-xo,b) leads to the K-nearest \nneighbors algorithm. \n2. Watson-Nadaraya Method: For the case y E R, the class of constant func(cid:173)\ntions contains an infinite number of elements, I(x,w) = C(w), C(w) E R. The \nminimization of the empirical risk functional (14) for general kernel and a quadratic \nloss function L(y, I(x, w)) = (y - I(x, w))2 leads to the estimator \n\nl \n\nI( \n\nXo \n\n) - \"\". K(Xi - Xo, b) \n- ~YI l . \n\ni=1 L;=I/\\ (x; - xo, b) \n\n, \n\nwhich defines the Watson-Nadaraya algorithm. \n\nThese classical methods minimize (14) with a fixed b over the class of constant \nfunctions. The supervisor's response in the vicinity of Xo is thus approximated by a \nconstant, and the characteristic size b of the neighborhood is kept fixed, independent \nof Xo. \nA truly local algorithm would adjust the parameter b to the characteristics of the \nregion in input space centered at Xo . Further improvement is possible by allowing \nfor a richer class of predictor functions I(x, w) within the selected neighborhood. \nThe SRM principle for local estimation provides a tool for incorporating these two \nfeatures. \n\n11 STRUCTURAL RISK MINIMIZATION FOR LOCAL \n\nESTIMATION \n\nThe arguments that lead to the inequality (6) for the risk functional (2) can be \nextended to the local risk functional (12), to obtain the following result: with \nprobability 1 - T}, and simultaneously for all w E Wand all b E (0,00) \n\nR(w,b,xo) < E(w,b,xo) + C2(flh, b, T}). \n\n(15) \n\nThe confidence interval C2(flh, b, T}) reduces to Co(llh, T}) in the b -+ 00 limit. \nAs before, a nested structure is introduced in the class of functions, and the empirical \nrisk (14) is minimized with respect to both w E Wand bE (0,00) for each element \nof the structure. The optimal element is then selected to minimize the guaranteed \nrisk, defined as the sum of the empirical risk and the confidence interval. For fixed \nb this process involves an already discussed trade-off: as h increases, the empirical \nrisk decreases but the confidence interval increases. A new trade-off appears by \nvarying b at fixed h: as b increases the empirical risk increases, but the confidence \ninterval decreases. The use of b as an additional free parameter allows us to find \ndeeper minima of the guaranteed risk. \n\n\f838 \n\nVapnik \n\n12 APPLICATION TO ZIP-CODE RECOGNITION \n\nWe now discuss results for the recognition of the hand written and printed digits in \nthe US Postal database, containing 9709 training examples and 2007 testing exam(cid:173)\nples. Human recognition of this task results in an approximately 2.5% prediction \nerror (Sackinger et al., 1991). \nThe learning machine considered here is a five-layer neural network with shared \nweights and limited receptive fields. When trained with a back-propagation algo(cid:173)\nrithm for the minimization of the empirical risk, the network achieves 5.1% predic(cid:173)\ntion error (Le Cun et al., 1990). \n\nFurther performance improvement with the same network architecture has required \nthe introduction a new induction principle. Methods based on SRM have achieved \nprediction errors of 4.1% (training based on a double-back-propagation algorithm \nwhich incorporates a special form of weight decay (Drucker, 1991\u00bb and 3.95% (using \na smoothing transformation in input space (Simard, 1991\u00bb. \n\nThe best result achieved so far, of 3.3% prediction error, is based on the use of the \nSRM for local estimation of the predictor function (Bottou, 1991). \n\nIt is obvious from these results that dramatic gains cannot be achieved through \nminor algorithmic modifications, but require the introduction of new principles. \n\nAcknowledgements \n\nI thank the members of the Neural Networks research group at Bell Labs, Holmdel, \nfor supportive and useful discussions. Sara Solla, Leon Bottou, and Larry Jackel \nprovided invaluable help to render my presentation more clear and accessible to the \nneural networks community. \n\nReferences \n\nV. N. Vapnik (1982), Estimation of Dependencies Based on Empirical Data, \nSpringer-Verlag (New York). \n\nV. N. Vapnik and A. J a. Chervonenkis (1989) 'Necessary and sufficient conditions \nfor consistency of the method of empirical risk minimization' [in Russian], Year(cid:173)\nbook of the Academy of Sciences of the USSR on Recognition, Classification, and \nForecasting, 2, 217-249, Nauka (Moscow) (English translation in preparation). \n\nE. Sackinger and J. Bromley (1991), private communication. \n\nY. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and \nL. D. Jackel (1990) 'Handwritten digit recognition with a back-propagation net(cid:173)\nwork', Neural Information Processing Systems 2, 396-404, ed. by D. S. Touretzky, \nMorgan Kaufmann (California). \n\nH. Drucker (1991), private communication. \n\nP. Simard (1991), private communication. \n\nL. Bottou (1991), private communication. \n\n\f", "award": [], "sourceid": 506, "authors": [{"given_name": "V.", "family_name": "Vapnik", "institution": null}]}