{"title": "Tangent Prop - A formalism for specifying selected invariances in an adaptive network", "book": "Advances in Neural Information Processing Systems", "page_first": 895, "page_last": 903, "abstract": null, "full_text": "Tangent Prop - A formalism for specifying \nselected invariances in an adaptive network \n\nPatrice Simard \n\nAT&T Bell Laboratories \n101 Crawford Corner Rd \n\nHolmdel, NJ 07733 \n\nBernard Victorri \nUniversite de Caen \nCaen 14032 Cedex \n\nFrance \n\nYann Le Cun \n\nAT&T Bell Laboratories \n101 Crawford Corner Rd \n\nHolmdel, NJ 07733 \n\nJohn Denker \n\nAT&T Bell Laboratories \n101 Crawford Corner Rd \n\nHolmdel, NJ 07733 \n\nAbstract \n\nIn many machine learning applications, one has access, not only to training \ndata, but also to some high-level a priori knowledge about the desired be(cid:173)\nhavior of the system. For example, it is known in advance that the output \nof a character recognizer should be invariant with respect to small spa(cid:173)\ntial distortions of the input images (translations, rotations, scale changes, \netcetera). \nWe have implemented a scheme that allows a network to learn the deriva(cid:173)\ntive of its outputs with respect to distortion operators of our choosing. \nThis not only reduces the learning time and the amount of training data, \nbut also provides a powerful language for specifying what generalizations \nwe wish the network to perform. \n\n1 \n\nINTRODUCTION \n\nIn machine learning, one very often knows more about the function to be learned \nthan just the training data. An interesting case is when certain directional deriva(cid:173)\ntives of the desired function are known at certain points. For example, an image \n895 \n\n\f896 \n\nSimard, Victorri, Le Cun, and Denker \n\nFigure 1: Top: Small rotations of an original digital image of the digit \"3\" (center). \nMiddle: Representation of the effect of the rotation in the input vector space space \n(assuming there are only 3 pixels). Bottom: Images obtained by moving along the \ntangent to the transformation curve for the same original digital image (middle). \n\nrecognition system might need to be invariant with respect to small distortions of \nthe input image such as translations, rotations, scalings, etc.; a speech recognition \nsystem n.ight need to be invariant to time distortions or pitch shifts. \nIn other \nwords, the derivative of the system's output should be equal to zero when the input \nis transformed in certain ways. \n\nGiven a large amount of training data and unlimited training time, the system \ncould learn these invariances from the data alone, but this is often infeasible. The \nlimitation on data can be overcome by training the system with additional data \nobtained by distorting (translating, rotating, etc.) \nthe original patterns (Baird, \n1990). The top of Fig. 1 shows artificial data generated by rotating a digital image of \nthe digit \"3\" (with the original in the center). This procedure, called the \"distortion \nmodel\" , has two drawbacks. First, the user must choose the magnitude of distortion \nand how many instances should be generated. Second, and more importantly, the \ndistorted data is highly correlated with the original data. This makes traditional \nlearning algorithms such as back propagation very inefficient. The distorted data \ncarries only a very small incremental amount of information, since the distorted \npatterns are not very different from the original ones. It may not be possible to \nadjust the learning system so that learning the invariances proceeds at a reasonable \nrate while learning the original points is non-divergent. \n\nThe key idea in this paper is that it is possible to directly learn the effect on \nthe output of distorting the input, independently from learning the undistorted \n\n\fTangent Prop-A formalism for specifying selected invariances in an adaptive network \n\n897 \n\nF(x) \n\nF(x) \n\nx1 \n\nx2 \n\nx3 \n\nx4 \n\nx \n\nx1 \n\nx2 \n\nx3 \n\nx4 \n\nx \n\nFigure 2: Learning a given function (solid line) from a limited set of example (Xl \nto X4). The fitted curves are shown in dotted line. Top: The only constraint is that \nthe fitted curve goes through the examples. Bottom: The fitted curves not only \ngoes through each examples but also its derivatives evaluated at the examples agree \nwith the derivatives of the given function. \n\npatterns. When a pattern P is transformed (e.g. rotated) with a transformation \ns that depends on one parameter a (e.g. the angle of the rotation), the set of all \nthe transformed patterns S(P) = {sea, P) Va} is a one dimensional curve in the \nvector space of the inputs (see Fig. 1). In certain cases, such as rotations of digital \nimages, this curve must be made continuous using smoothing techniques, as will be \nshown below. When the set of transformations is parameterized by n parameters \nai (rotation, translation, scaling, etc.), S(P) is a manifold of at most n dimensions. \nThe patterns in S(P) that are obtained through small transformations of P, i.e. \nthe part of S( P) that is close to P, can be approximated by a plane tangent to \nthe manifold S(P) at point P. Small transformations of P can be obtained by \nadding to P a linear combination of vectors that span the tangent plane (tangent \nvectors). The images at the bottom of Fig. 1 were obtained by that procedure. \nMore importantly, the tangent vectors can be used to specify high order constraints \non the function to be learned, as explained below. \n\nTo illustrate the method, consider the problem of learning a single-valued function \nF from a limited set of examples. Fig. 2 (left) represents a simple case where the \ndesired function F (solid line) is to be approximated by a function G (dotted line) \nfrom four examples {(Xi, F(Xi))}i=1,2,3,4. As exemplified in the picture, the fitted \nfunction G largely disagrees with the desired function F between the examples. If \nthe functions F and G are assumed to be differentiable (which is generally the case), \nthe approximation G can be greatly improved by requiring that G's derivatives \nevaluated at the points {xd are equal to the derivatives of F at the same points \n(Fig. 2 right). This result can be extended to multidimensional inputs. In this case, \nwe can impose the equality of the derivatives of F and G in certain directions, not \nnecessarily in all directions of the input space. \nSuch constraints find immediate use in traditional learning problems. It is often the \ncase that a priori knowledge is available on how the desired function varies with \n\n\f898 \n\nSimard, Victorri, Le Cun, and Denker \n\npattern P \n\npattern P \nrotated by ex \n\n-\n\ntangent \nvector \n\n--\n\nFigure 3: How to compute a tangent vector for a given transformation (in this case \na rotation). \n\nrespect to some transformations of the input. It is straightforward to derive the \ncorresponding constraint on the directional derivatives of the fitted function G in \nthe directions of the transformations (previously named tangent vectors). Typical \nexamples can be found in pattern recognition where the desired classification func(cid:173)\ntion is known to be invariant with respect to some transformation of the input such \nas translation, rotation, scaling, etc., in other words, the directional derivatives of \nthe classification function in the directions of these transformations is zero. \n\n2 \n\nIMPLEMENTATION \n\nThe implementation can be divided into two parts. The first part consists in com(cid:173)\nputing the tangent vectors. This part is independent from the learning algorithm \nused subsequently. The second part consists in modifying the learning algorithm \n(for instance backprop) to incorporate the information about the tangent vectors. \nPart I: Let x be an input pattern and s be a transformation operator acting \non the input space and depending on a parameter a. If s is a rotation operator \nfor instance, then s( a, x) denotes the input x rotated by the angle a. We will \nrequire that the transformation operator s be differentiable with respect to a and \nx, and that s(O, x) = x. The tangent vector is by definition 8s(a, x)/8a. It can be \napproximated by a finite difference, as shown in Fig. 3. In the figure, the input space \nis a 16 by 16 pixel image and the patterns are images of handwritten digits. The \ntransformations considered are rotations of the digit images. The tangent vector \nis obtained in two steps. First the image is rotated by an infinitesimal amount a. \nThis is done by computing the rotated coordinates of each pixel and interpolating \nthe gray level values at the new coordinates. This operation can be advantageously \ncombined with some smoothing using a convolution. A convolution with a Gaussian \nprovides an efficient interpolation scheme in O(nm) multiply-adds, where nand m \nare the (gaussian) kernel and image sizes respectively. The next step is to subtract \n(pixel by pixel) the rotated image from the original image and to divide the result \n\n\fTangent Prop-A formalism for specifying selected invariances in an adaptive network \n\n899 \n\nby the scalar 0 (see Fig. 3). If Ie types of transformations are considered, there \nwill be Ie different tangent vectors per pattern. For most algorithms, these do not \nrequire any storage space since they can be generated as needed from the original \npattern at negligible cost. \nPart IT: Tangent prop is an extension of the backpropagation algorithm, allowing \nit to learn directional derivatives. Other algorithms such as radial basis functions \ncan be extended in a similar fashion. \n\nTo implement our idea, we will modify the usual weight-update rule: \nis replaced with ~w = -7] ow (E + J.tEr) \n\noE \n~w = -7] ow \n\n0 \n\n(1) \n\nwhere 7] is the learning rate, E the usual objective function, Er an additional objec(cid:173)\ntive function (a regularizer) that measures the discrepancy between the actual and \ndesired directional derivatives in the directions of some selected transformations, \nand J.t is a weighting coefficient. \nLet x be an input pattern, y = G(x) be the input-output function of the network. \nThe regularizer Er is of the form \n\nwhere Er(x) is \n\nEr(x) \n\n:e e trainingset \n\n(2) \n\nHere, Ki(x) is the desired directional derivative of G in the direction induced by \ntransformation Si applied to pattern x. The second term in the norm symbol is the \nactual directional derivative, which can be rewritten as \n\n= G'{x). OSi(O, x) \n\n00 \n\n0=0 \n\n0=0 \n\nwhere G'(x) is the Jacobian of G for pattern x, and OSi(O, x)Joo is the tangent \nvector associated to transformation Si as described in Part I. Multiplying the tangent \nvector by the Jacobian involves one forward propagation through a \"linearized\" \nversion of the network. In the special case where local invariance with respect to \nthe Si'S is desired, Ki(x) is simply set to o. \nComposition of transformations: The theory of Lie groups (Gilmore, 1974) \nensures that compositions of local (small) transformations Si correspond to linear \ncombinations of the corresponding tangent vectors (the local transformations Si \nhave a structure of Lie algebra). Consequently, if Er{x) = 0 is verified, the network \nderivative in the direction of a linear combination of the tangent vectors is equal \nto the same linear combination of the desired derivatives. In other words if the \nnetwork is successfully trained to be locally invariant with respect to, say, horizontal \ntranslation and vertical translations, it will be invariant with respect to compositions \nthereof. \nWe have derived and implemented an efficient algorithm, \"tangent prop\" , for per(cid:173)\nforming the weight update (Eq. 1). It is analogous to ordinary backpropagation, \n\n\f900 \n\nSimard, Victorri, Le Cun, and Denker \n\nW'+l \n\nIti \n\nW l+1 \n\nIti \n\ne: l \n\nb'.-l , \n\nx\u00b7 , \n'-I \n\nNetwork \n\nj3J-1 \n\ne;-I \nJacobian nework \n\nFigure 4: forward propagated variables (a, x, a, e), and backward propagated vari(cid:173)\nables (b, y, p, t/J) in the regular network (roman symbols) and the Jacobian (lin(cid:173)\nearized) network (greek symbols) \n\nbut in addition to propagating neuron activations, it also propagates the tangent \nvectors. The equations can be easily derived from Fig. 4. \nForward propagation: \n\na~ = ~ wL x'.-l \nI, , \n\u2022 \n\nL...J \ni \n\nx~ = u(aD \n\nTangent forward propagation: \n\n, _ ~ , ~'-1 \nai - L...J wW\"i \n\ni \n\ne! = u'(a~)a~ \n\nTangent gradient backpropagation: \n\n(31 - ~ w'+1.I.l+1 \ni - L...J \nIt \n\nIti \u00a5lit \n\nGradient backpropagation: \n\nb' - ~ w1+ 1yl+1 \ni - L...J \nIt \n\nIti \n\nIt \n\nWeight update: \n\n8[E(W, Up) + I'Er (W, Up, Tp)] _ 1-1 , + ~'-l.I.' \n\u00a5Ii \n\n- Xi Yi \n\nI'\\oi \n\nw\u00b7\u00b7 I, \n8 ' \n\n(3) \n\n(4) \n\n(5) \n\n(6) \n\n(7) \n\n\fTangent Prop--A formalism for specifying selected invariances in an adaptive network \n\n901 \n\n60 \n\n50 \n\n%Erroron \nthe test set \n\n20 \n\n10 \n\n160 \n\n320 \n\nTraining set size \n\nFigure 5: Generalization performance curve as a function of the training set size for \nthe tangent prop and the backprop algorithms \n\nThe regularization parameter jJ is tremendously important, because it determines \nthe tradeoff between minimizing the usual objective function and minimizing the \ndirectional derivative error. \n\n3 RESULTS \n\nTwo experiments illustrate the advantages of tangent prop. The first experiment \nis a classification task, using a small (linearly separable) set of 480 binarized hand(cid:173)\nwritten digit. The training sets consist of 10, 20, 40, 80, 160 or 320 patterns, and \nthe training set contains the remaining 160 patterns. The patterns are smoothed \nusing a gaussian kernel with standard deviation of one half pixel. For each of the \ntraining set patterns, the tangent vectors for horizontal and vertical translation \nare computed. The network has two hidden layers with locally connected shared \nweights, and one output layer with 10 units (5194 connections, 1060 free parame(cid:173)\nters) (Le Cun, 1989). The generalization performance as a function of the training \nset size for traditional backprop and tangent prop are compared in Fig. 5. We have \nconducted additional experiments in which we implemented not only translations \nbut also rotations, expansions and hyperbolic deformations. This set of 6 gener(cid:173)\nators is a basis for all linear transformations of coordinates for two dimensional \nimages. It is straightforward to implement other generators including gray-Ievel(cid:173)\nshifting, \"smooth\" segmentation, local continuous coordinate transformations and \nindependent image segment transformations. \n\nThe next experiment is designed to show that in applications where data is highly \n\n\f902 \n\nSimard, Victorri, Le Cun, and Denker \n\nAv\"ge NMSE VI 1ge \n\nA-. NMSE VI. \n\n0.15 \n\n0.1 \n\n.15 \n\n.1 \n\no \no 1000 2000 3000 4000 5000 6000 7000 8000 0000 10000 \n\noL-~~==~~=;~==+=~~~ \n1000 2000 3000 4000 5000 6000 7000 8000 0000 10000 \n0 \n\n-\n\n\" \n\n..... \n\n15 \n\no \n\n-0.5 \n\n-1 \n\n..... \n\n15 -\n\n\" \n\n0 \n\n-.5 \n\n-1 \n\n-1 .5 +--_+_-_--+_-_+_-_-_ \n\n-1.5 +--_+_-_--+--_+_-_-__t \n\n-1 .5 \n\n-1 \n\n1.5 \n\n-1.5 \n\n-1 \n\n0 \n\n0.5 \n\n-0.5 \nDistortion model \n\no \n\n.5 \n\n- .5 \nTangent prop \n\n1.5 \n\nFigure 6: Comparison of the distortion model (left column) and tangent prop (right \ncolumn). The top row gives the learning curves (error versus number of sweeps \nthrough the training set). The bottom row gives the final input-output function of \nthe network; the dashed line is the result for unadorned back prop. \n\n\fTangent Prop-A formalism for specifying selected invariances in an adaptive network \n\n903 \n\ncorrelated, tangent prop yields a large speed advantage. Since the distortion model \nimplies adding lots of highly correlated data, the advantage of tangent prop over \nthe distortion model becomes clear. \nThe task is to approximate a function that has plateaus at three locations. We want \nto enforce local invariance near each of the training points (Fig. 6, bottom). The \nnetwork has one input unit, 20 hidden units and one output unit. Two strategies are \npossible: either generate a small set of training point covering each of the plateaus \n(open squares on Fig. 6 bottom), or generate one training point for each plateau \n(closed squares), and enforce local invariance around them (by setting the desired \nderivative to 0). The training set of the former method is used as a measure the \nperformance for both methods. All parameters were adjusted for approximately \noptimal performance in all cases. The learning curves for both models are shown in \nFig. 6 (top). Each sweep through the training set for tangent prop is a little faster \nsince it requires only 6 forward propagations, while it requires 9 in the distortion \nmodel. As can be seen, stable performance is achieved after 1300 sweeps for the \ntangent prop, versus 8000 for the distortion model. The overall speedup is therefore \nabout 10. \nTangent prop in this example can take advantage of a very large regularization term. \nThe distortion model is at a disadvantage because the only parameter that effec(cid:173)\ntively controls the amount of regularization is the magnitude of the distortions, and \nthis cannot be increased to large values because the right answer is only invariant \nunder small distortions. \n\n4 CONCLUSIONS \n\nWhen a priori information about invariances exists, this information must be made \navailable to the adaptive system. There are several ways of doing this, including the \ndistortion model and tangent prop. The latter may be much more efficient in some \napplications, and it permits separate control of the emphasis and learning rate for \nthe invariances, relative to the original training data points. Training a system to \nhave zero derivatives in some directions is a powerful tool to express invariances to \ntransformations of our choosing. Tests of this procedure on large-scale applications \n(handwritten zipcode recognition) are in progress. \n\nReferences \n\nBaird, H. S. (1990). Document Image Defect Models. In IAPR 1990 Workshop on \n\nSytactic and Structural Pattern Recognition, pages 38-46, Murray Hill, NJ. \n\nGilmore, R. (1974). Lie Groups, Lie Algebras and some of their Applications. Wiley, \n\nNew York. \n\nLe Cun, Y. (1989) . Generalization and Network Design Strategies. In Pfeifer, R., \nSchreter, Z., Fogelman, F., and Steels, L., editors, Connectionism in Perspec(cid:173)\ntive, Zurich, Switzerland. Elsevier. an extended version was published as a \ntechnical report of the University of Toronto. \n\n\f", "award": [], "sourceid": 536, "authors": [{"given_name": "Patrice", "family_name": "Simard", "institution": null}, {"given_name": "Bernard", "family_name": "Victorri", "institution": null}, {"given_name": "Yann", "family_name": "LeCun", "institution": null}, {"given_name": "John", "family_name": "Denker", "institution": null}]}