{"title": "Structural Risk Minimization for Character Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 471, "page_last": 479, "abstract": null, "full_text": "Structural Risk Minimization \n\nfor Character Recognition \n\nI. Guyon, V. Vapnik, B. Boser, L. Bottou, and S. A. Solla \n\nAT&T Bell Laboratories \nHolmdel, NJ 07733, USA \n\nAbstract \n\nThe method of Structural Risk Minimization refers to tuning the capacity \nof the classifier to the available amount of training data. This capac(cid:173)\nity is influenced by several factors, including: (1) properties of the input \nspace, (2) nature and structure of the classifier, and (3) learning algorithm. \nActions based on these three factors are combined here to control the ca(cid:173)\npacity of linear classifiers and improve generalization on the problem of \nhandwritten digit recognition. \n\n1 RISK MINIMIZATION AND CAPACITY \n\n1.1 EMPIRICAL RISK MINIMIZATION \n\nA common way of training a given classifier is to adjust the parameters w in the \nclassification function F( x, w) to minimize the training error Etrain, i.e. the fre(cid:173)\nquency of errors on a set of p training examples. Etrain estimates the expected risk \nbased on the empirical data provided by the p available examples. The method is \nthus called Empirical Risk Minimization. But the classification function F(x, w*) \nwhich minimizes the empirical risk does not necessarily minimize the generalization \nerror, i.e. the expected value of the risk over the full distribution of possible inputs \nand their corresponding outputs. Such generalization error Egene cannot in general \nbe computed, but it can be estimated on a separate test set (Ete$t). Other ways of \n\n471 \n\n\f472 \n\nGuyon, Vapnik, Boser, Bottou, and Solla \n\nestimating Egene include the leave-one-out or moving control method [Vap82] (for \na review, see [Moo92]). \n\n1.2 CAPACITY AND GUARANTEED RISK \n\nAny family of classification functions {F(x, w)} can be characterized by its capacity. \nThe Vapnik-Chervonenkis dimension (or VC-dimension) [Vap82] is such a capacity, \ndefined as the maximum number h of training examples which can be learnt without \nerror, for all possible binary labelings. The VC-dimension is in some cases simply \ngiven by the number of free parameters of the classifier, but in most practical cases \nit is quite difficult to determine it analytically. \nThe VC-theory provides bounds. Let {F(x, w)} be a set of classification functions \nof capacity h. With probability (1 - 71), for a number of training examples p > h, \nsimultaneously for all classification functions F{x, w), the generalization error Egene \nis lower than a guaranteed risk defined by: \n\nEguarant = Etrain + ((p, h, Etrain, 71) , \n\n(1) \n\nwhere ((p, h, Etrain, 71) is proportional to (0 = [h(ln2p/h+ 1) - 71l/p for small Etrain, \nand to Fa for Etrain close to one [Vap82,Vap92]. \nFor a fixed number of training examples p, the training error decreases monoton(cid:173)\nically as the capacity h increases, while both guaranteed risk and generalization \nerror go through a minimum. Before the minimum, the problem is overdetermined: \nthe capacity is too small for the amount of training data. Beyond the minimum \nthe problem is underdetermined. The key issue is therefore to match the capacity \nof the classifier to the amount of training data in order to get best generalization \nperformance. The method of Structural Risk Minimization (SRM) [Vap82,Vap92] \nprovides a way of achieving this goal. \n\n1.3 STRUCTURAL RISK MINIMIZATION \n\nLet us choose a family of classifiers {F(x, w)}, and define a structure consisting of \nnested subsets of elements of the family: S1 C S2 c ... C Sr c .... By defining \nsuch a structure, we ensure that the capacity hr of the subset of classifiers Sr is less \nthan hr+l of subset Sr+l. The method of SRM amounts to finding the subset sopt \nfor which the classifier F{x, w*) which minimizes the empirical risk within such \nsubset yields the best overall generalization performance. \n\nTwo problems arise in implementing SRM: (I) How to select sopt? (II) How to find \na good structure? Problem (I) arises because we have no direct access to Egene. \nIn our experiments, we will use the minimum of either E te3t or Eguarant to select \nsopt, and show that these two minima are very close. A good structure reflects the \na priori knowledge of the designer, and only few guidelines can be provided from the \ntheory to solve problem (II). The designer must find the best compromise between \ntwo competing terms: Etrain and i. Reducing h causes ( to decrease, but Etrain \nto increase. A good structure should be such that decreasing the VC-dimension \nhappens at the expense of the smallest possible increase in training error. We now \nexamine several ways in which such a structure can be built. \n\n\fStructural Risk Minimization for Character Recognition \n\n473 \n\n2 PRINCIPAL COMPONENT ANALYSIS, OPTIMAL \n\nBRAIN DAMAGE, AND WEIGHT DECAY \n\nConsider three apparently different methods of improving generalization perfor(cid:173)\nmance: Principal Component Analysis (a preprocessing transformation of input \nspace) [The89], Optimal Brain Damage (an architectural modification through \nweight pruning) [LDS90], and a regularization method, Weight Decay (a modifi(cid:173)\ncation of the learning algorithm) [Vap82]. For the case of a linear classifier, these \nthree approaches are shown here to control the capacity of the learning system \nthrough the same underlying mechanism: a reduction of the effective dimension of \nweight space, based on the curvature properties of the Mean Squared Error (M SE) \ncost function used for training. \n\n2.1 LINEAR CLASSIFIER AND MSE TRAINING \n\nConsider a binary linear classifier F(x, w) = (}o(wT x), where w T is the transpose of \nwand the function {}o takes two values 0 and 1 indicating to which class x belongs. \nThe VC-dimension of such classifier is equal to the dimension of input space 1 (or \nthe number of weights): h = dim(w) = dim(x) = n. \nThe empirical risk is given by: \n\nEtrain = ! L(yk - {}o(wT xk\u00bb2 , \n\np \n\np k=l \n\n(2) \n\nwhere xk is the kth example, and yk is the corresponding desired output. The \nproblem of minimizing Etrain as a function of w can be approached in different \nways [DH73], but it is often replaced by the problem of minimizing a Mean Square \nError (MSE) cost function, which differs from (2) in that the nonlinear function (}o \nhas been removed. \n\n2.2 CURVATURE PROPERTIES OF THE MSE COST FUNCTION \n\nThe three structures that we investigate rely on curvature properties of the M S E \ncost function. Consider the dependence of MSE on one of the parameters Wi. \nTraining leads to the optimal value wi for this parameter. One way of reducing \nthe capacity is to set Wi to zero. For the linear classifier, this reduces the VC(cid:173)\ndimension by one: h' = dim(w) - 1 = n - 1. The MSE increase resulting from \nsetting Wi = 0 is to lowest order proportional to the curvature of the M SEat wi. \nSince the decrease in capacity should be achieved at the smallest possible expense in \nM S E increase, directions in weight space corresponding to small M S E curvature \nare good candidates for elimination. \nThe curvature of the M S E is specified by the Hessian matrix H of second derivatives \nof the M SE with respect to the weights. For a linear classifier, the Hessian matrix is \ngiven by twice the correlation matrix of the training inputs, H = (2/p) 2:~=1 xkxkT. \nThe Hessian matrix is symmetric, and can be diagonalized to get rid of cross terms, \n\n1 We assume, for simplicity, that the first component of vector x is constant and set to \n\n1, so that the corresponding weight introduces the bias value. \n\n\f474 \n\nGuyon, Vapnik, Boser, Bottou, and SaBa \n\nto facilitate decisions about the simultaneous elimination of several directions in \nweight space. The elements of the Hessian matrix after diagonalization are the \neigenvalues Ai; the corresponding eigenvectors give the principal directions wi of \nthe MSE. In the rotated axis, the increase IlMSE due to setting w: = 0 takes a \nsimple form: \n\n(3) \n\nThe quadratic approximation becomes an exact equality for the linear classifier. \nPrincipal directions wi corresponding to small eigenvalues Ai of H are good candi(cid:173)\ndates for elimination. \n\n2.3 PRINCIPAL COMPONENT ANALYSIS \n\nOne common way of reducing the capacity of a classifier is to reduce the dimension \nof the input space and thereby reduce the number of necessary free parameters \n(or weights). Principal Component Analysis (PCA) is a feature extraction method \nbased on eigenvalue analysis. Input vectors x of dimension n are approximated by a \nlinear combination of m ~ n vectors forming an ortho-normal basis. The coefficients \nof this linear combination form a vector x' of dimension m. The optimal basis in \nthe least square sense is given by the m eigenvectors corresponding to the m largest \neigenvalues of the correlation matrix of the training inputs (this matrix is 1/2 of H). \nA structure is obtained by ranking the classifiers according to m. The VC-dimension \nof the classifier is reduced to: h' = dim(x/) = m. \n\n2.4 OPTIMAL BRAIN DAMAGE \n\nFor a linear classifier, pruning can be implemented in two different but equivalent \nways: (i) change input coordinates to a principal axis representation, prune the \ncomponents corresponding to small eigenvalues according to PCA, and then train \nwith the M SE cost function; (ii) change coordinates to a principal axis represen(cid:173)\ntation, train with M S E first, and then prune the weights, to get a weight vector \nw' of dimension m < n. Procedure (i) can be understood as a preprocessing, \nwhereas procedure (ii) involves an a posteriori modification of the structure of the \nclassifier (network architecture). The two procedures become identical if the weight \nelimination in (ii) is based on a 'smallest eigenvalue' criterion. \n\nProcedure (ii) is very reminiscent of Optimal Brain Damage (OBD), a weight prun(cid:173)\ning procedure applied after training. In OBD, the best candidates for pruning are \nthose weights which minimize the increase IlM SE defined in equation (3). The m \nweights that are kept do not necessarily correspond to the largest m eigenvalues, \ndue to the extra factor of (wi*)2 in equation (3). In either implementation, the \nVC-dimension is reduced to h' = dim(w /) = dim(x/) = m. \n\n2.5 WEIGHT DECAY \n\nCapacity can also be controlled through an additional term in the cost function, to \nbe minimized simultaneously with Al S E. Linear classifiers can be ranked according \nto the norm IIwll2 = L1=1 wJ of the weight vector. A structure is constructed \n\n\fStructural Risk Minimization for Character Recognition \n\n475 \n\nby allowing within the subset Sr only those classifiers which satisfy IIwll2 < Cr. \nThe positive bounds Cr form an increasing sequence: Cl < C2 < '\" < Cr < ... \nThis sequence can be matched with a monotonically decreasing sequence of positive \nLagrange multipliers 11 ~ 12 ~ ... ~ Ir > ... , such that our training problem stated \nas the minimization of M S E within a specific set Sr is implemented through the \nminimization of a new cost function: MSE + 'rllwIl 2 \u2022 This is equivalent to the \nWeight Decay procedure (WD). In a mechanical analogy, the term ,rllwll2 is like \nthe energy of a spring of tension Ir which pulls the weights to zero. As it is easier to \npull in the directions of small curvature of the MSE, WD pulls the weights to zero \npredominantly along the principal directions of the Hessian matrix H associated \nwith small eigenvalues. \n\nIn the principal axis representation, the minimum w-Y of the cost function \nMSE + ,lIwIl2, is a simple function of the minimum wO of the MSE in the \nI -+ 0+ limit: wI = w? Ad(Ai + I)' The weight w? is attenuated by a factor \nAd (Ai + I)' Weights become negligible for I ~ Ai, and remain unchanged for \nI \u00ab: Ai\u00b7 The effect of this attenuation can be compared to that of weight pruning. \nPruning all weights such that Ai < I reduces the capacity to: \n\nn \n\nh' = L: 8-y(Ai) , \n\ni=1 \n\nwhere 8-y(u) = 1 if U > I and 8-y(u) = 0 otherwise. \nBy analogy, we introduce the Weight Decay capacity: \n\nh' = t Ai \n\ni=1 Ai + I \n\n. \n\n(4) \n\n(5) \n\nThis expression arises in various theoretical frameworks \nvalid only for broad spectra of eigenvalues. \n\n[Moo92,McK92]' and is \n\n3 SMOOTHING, HIGHER-ORDER UNITS, AND \n\nREGULARIZATION \n\nCombining several different structures achieves further performance improvements. \nThe combination of exponential smoothing (a preprocessing transformation of input \nspace) and regularization (a modification of the learning algorithm) is shown here to \nimprove character recognition . The generalization ability is dramatically improved \nby the further introduction of second-order units (an architectural modification). \n\n3.1 SMOOTHING \n\nSmoothing is a preprocessing which aims at reducing the effective dimension of \ninput space by degrading the resolution: after smoothing, decimation of the inputs \ncould be performed without further image degradation. Smoothing is achieved here \nthrough convolution with an exponential kernel: \n\nLk Ll PIXEL(i + k,j + I) exp[-~Jk2 + 12] \n\nBLURRED.PIXEL(i,j) = \n\nLk Ll exp[-fj k2 + 12] \n\nIJ \n\n' \n\n\f476 \n\nGuyon, Vapnik, Boser, Bottou, and Soil a \n\nwhere {3 is the smoothing parameter which determines the structure. \nConvolution with the chosen kernel is an invertible linear operation. Such prepro(cid:173)\ncessing results in no capacity change for a MSE-trained linear classifier. Smoothing \nonly modifies the spectrum of eigenvalues and must be combined with an eigenvalue(cid:173)\nbased regularization procedure such as OBD or WD, to obtain performance improve(cid:173)\nment through capacity decrease. \n\n3.2 HIGHER-ORDER UNITS \n\nHigher-order (or sigma-pi) units can be substituted for the linear units to get poly(cid:173)\nnomial classifiers: F(x, w) = 6o(wTe(x)), where e(x) is an m-dimensional vector \n(m > n) with components: x}, X2, ... , Xn , (XIXt), (XIX2), .\u2022\u2022 , (xnx n ), \u2022\u2022\u2022 , (X1X2 \u2022\u2022. Xn) . \nThe structure is geared towards increasing the capacity, and is controlled by the or(cid:173)\nder of the polynomial: Sl contains all the linear terms, S2 linear plus quadratic, etc. \nComputations are kept tractable with the method proposed in reference [Pog75]. \n\n4 EXPERIMENTAL RESULTS \n\nExperiments were performed on the benchmark problem of handwritten digit recog(cid:173)\nnition described in reference [GPP+S9]. The database consists of 1200 (16 x 16) \nbinary pixel images, divided into 600 training examples and 600 test examples. \n\nIn figure 1, we compare the results obtained by pruning inputs or weights with \nPCA and the results obtained with WD. The overall appearance of the curves is \nvery similar. In both cases, the capacity (computed from (4) and (5)) decreases as \na function of r, whereas the training error increases. For the optimum value r*, \nthe capacity is only 1/3 of the nominal capacity, computed solely on the basis of \nthe network architecture. At the price of some error on the training set, the error \nrate on the test set is only half the error rate obtained with r = 0+ . \nThe competition between capacity and training error always results in a unique \nminimum of the guaranteed risk (1). It is remarkable that our experiments show \nthe minimum of Eguarant coinciding with the minimum of E te1t \u2022 Any of these two \nquantities can therefore be used to determine r*. In principle, another independent \ntest set should be used to get a reliable estimate of Egene (cross-validation). It \nseems therefore advantageous to determine r* using the minimum of Eguarant and \nuse the test set to predict the generalization performance. \nUsing Eguarant to determine r* raises the problem of determining the capacity of the \nsystem. The capacity can be measured when analytic computation is not possible. \nMeasurements performed with the method proposed by Vapnik, Levin, and Le Cun \nyield results in good agreement with those obtained using (5). The method yields \nan effective VC\u00b7dimension which accounts for the global capacity of the system, \nincluding the effects of input data, architecture, and learning algorithm 2. \n\n2 Schematically, measurements of the effective VC.dimension consist of splitting the \n\ntraining data into two subsets. The difference between Etrain in these subsets is maxi(cid:173)\nmized. The value of h is extracted from the fit to a theoretical prediction for such maximal \ndiscrepancy. \n\n\fStructural Risk Minimization for Character Recognition \n\n477 \n\n%error \n\n1 \n\n12 \n11 \n10 \n9 \n8 \n\na , \n\n1 \n0 \n-5 \n\nEtest \n\nEtrain \n\n--------\n\nI \nI \nI \nI \nI \n\n-4 \n\n-3 \n\n-2 \n\n-1 \n\n*0 \n\n1 \n\n-4 \n\n-3 \n\n-2 \n\n-1 \n\n*0 \n\n1 \n\n10CJ-;.-& \n\n\"1 \n\n10;-; __ \n\n\"1 \n\n%error \n\nh' \n\nb \n\n1 \n12 \n11 \n10 \n9 \n8 \n7 \n6 \n5 \n4 \n3 \n2 \n1 \n0 \n-5 \n\nI \nEtest \n\nEtrain \n\n-4 \n\n-3 \n\n-2 \n\n10;-;'-& \n\n0 \n\n\"1* \n\n260 \n250 \n240 \n230 \n220 \n210 \n200 \n190 \n180 \n170 \n160 \n150 \n140 \n130 \n120 \n110 \n100 \n90 \n80 \n60 \n50 \n40 \n30 \n20 \n10 \n0 \n-5 \n\n70 -------------\n\n-4 \n\n-3 \n\n-2 \n\n-1 \n\n0 \n\n1 \n\nloq-qamma \n\n\"1* \n\nFigure 1: Percent error and capacity h' as a function of log r (linear classifier, no \nsmoothing): (a) weight/input pruning via peA (r is a threshold), (b) WD (r is the \ndecay parameter). The guaranteed risk has been rescaled to fit in the figure. \n\n\f478 \n\nGuyon, Vapnik, Boser, Bottou, and Solla \n\nTable 1: Eteat for Smoothing, WD, and Higher-Order Combined. \n\nI (3 \n0 \n1 \n2 \n10 \nany \n\nII \"I \n\"1* \n\"1* \n\"1* \n'Y'~ \n0+ \n\nII 13t order I 2nd order I \n\n6.3 \n5.0 \n4.5 \n4.3 \n12.7 \n\n1.5 \n0.8 \n1.2 \n1.3 \n3.3 \n\nIn table 1 we report results obtained when several structures are combined. Weight \ndecay with 'Y = \"1* reduces E te3t by a factor of 2. Input space smoothing used in \nconjunction with WD results in an additional reduction by a factor of 1.5. The \nbest performance is achieved for the highest level of smoothing, (3 = 10, for which \nthe blurring is considerable. As expected, smoothing has no effect in the absence \nofWD. \nThe use of second-order units provides an additional factor of 5 reduction in Ete3t \u2022 \nFor second order units, the number of weights scales like the square of the number \nof inputs n 2 = 66049. But the capacity (5) is found to be only 196, for the optimum \nvalues of \"I and (3. \n\n5 CONCLUSIONS AND EPILOGUE \n\nOur results indicate that the VC-dimension must measure the global capacity of \nthe system. It is crucial to incorporate the effects of preprocessing of the input data \nand modifications of the learning algorithm. Capacities defined solely on the basis \nof the network architecture give overly pessimistic upper bounds. \n\nThe method of SRM provides a powerful tool for tuning the capacity. We have \nshown that structures acting at different levels (preprocessing, architecture, learn(cid:173)\ning mechanism) can produce similar effects. We have then combined three different \nstructures to improve generalization. These structures have interesting comple(cid:173)\nmentary properties. The introduction of higher-order units increases the capacity. \nSmoothing and weight decay act in conjunction to decrease it. \n\nElaborate neural networks for character recognition [LBD+90,GAL +91] also incor(cid:173)\nporate similar complementary structures. In multilayer sigmoid-unit networks, the \ncapacity is increased through additional hidden units. Feature extracting neurons \nintroduce smoothing, and regularization follows from prematurely stopping training \nbefore reaching the M S E minimum. When initial weights are chosen to be small, \nthis stopping technique produces effects similar to those of weight decay. \n\n\fStructural Risk Minimization for Character Recognition \n\n479 \n\nAcknowledgments \n\nWe wish to thank L. Jackel's group at Bell Labs for useful discussions, and are \nparticularly grateful to E. Levin and Y. Le Cun for communicating to us the un(cid:173)\npublished method of computing the effective VC-dimension. \n\nReferences \n\n[DH73] \n\nR.O. Duda and P.E. Hart. Pattern Classification And Scene Analysis. \nWiley and Son, 1973. \n\n[GAL +91] I. Guyon, P. Albrecht, Y. Le Cun, J. Denker, and W. Hubbard. Design \nof a neural network character recognizer for a touch terminal. Pattern \nRecognition, 24(2), 1991. \n\n[GPP+89] I. Guyon, I. Poujaud, L. Personnaz, G. Dreyfus, J. Denker, and Y. Le \nCun. Comparing different neural network architectures for classifying \nhandwritten digits. In Proceedings of the International Joint Conference \non Neural Networks, volume II, pages 127-132. IEEE, 1989. \n\n[LBD+90] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub(cid:173)\nbard, and L. D. Jackel. Back-propagation applied to handwritten zip(cid:173)\ncode recognition. Neural Computation, 1(4), 1990. \n\n[LDS90] Y. Le Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. S. \nTouretzky, editor, Advances in Neural Information Processing Systems \n2 (NIPS 89), pages 598-605. Morgan Kaufmann, 1990. \n\n[McK92] D. McKay. A practical bayesian framework for backprop networks. In \n\n[Mo092] \n\n[Pog75] \n\n[The89] \n\n[Vap82] \n\n[Vap92] \n\nthis volume, 1992. \nJ. Moody. Generalization, weight decay and architecture selection for \nnon-linear learning systems. In this volume, 1992. \nT. Poggio. On optimal nonlinear associative recall. Bioi. Cybern., \n(9)201, 1975. \nC. W . Therrien. Decision, Estimation and Classification: An Introduc(cid:173)\ntion to Pattem Recognition and Related Topics. Wiley, 1989. \nV. Vapnik. Estimation of Dependences Based on Empirical Data. \nSpringer-Verlag, 1982. \nV Vapnik. Principles of risk minimization for learning theory. In this \nvolume, 1992. \n\n\f", "award": [], "sourceid": 512, "authors": [{"given_name": "I.", "family_name": "Guyon", "institution": null}, {"given_name": "V.", "family_name": "Vapnik", "institution": null}, {"given_name": "B.", "family_name": "Boser", "institution": null}, {"given_name": "L.", "family_name": "Bottou", "institution": null}, {"given_name": "S. A.", "family_name": "Solla", "institution": null}]}