{"title": "Handwritten Digit Recognition with a Back-Propagation Network", "book": "Advances in Neural Information Processing Systems", "page_first": 396, "page_last": 404, "abstract": null, "full_text": "396 \n\nLe Cun, Boser, Denker, Henderson, Howard, Hubbard and Jackel \n\nHandwritten Digit Recognition with a \n\nBack-Propagation Network \n\nY. Le Cun, B. Boser, J. S. Denker, D. Henderson, \n\nR. E. Howard, W. Hubbard, and L. D. Jackel \nAT&T Bell Laboratories, Holmdel, N. J. 07733 \n\nABSTRACT \n\nWe present an application of back-propagation networks to hand(cid:173)\nwritten digit recognition. Minimal preprocessing of the data was \nrequired, but architecture of the network was highly constrained \nand specifically designed for the task. The input of the network \nconsists of normalized images of isolated digits. The method has \n1 % error rate and about a 9% reject rate on zipcode digits provided \nby the U.S. Postal Service. \n\nINTRODUCTION \n\n1 \nThe main point of this paper is to show that large back-propagation (BP) net(cid:173)\nworks can be applied to real image-recognition problems without a large, complex \npreprocessing stage requiring detailed engineering. Unlike most previous work on \nthe subject (Denker et al., 1989), the learning network is directly fed with images, \nrather than feature vectors, thus demonstrating the ability of BP networks to deal \nwith large amounts of low level information. \n\nPrevious work performed on simple digit images (Le Cun, 1989) showed that the \narchitecture of the network strongly influences the network's generalization ability. \nGood generalization can only be obtained by designing a network architecture that \ncontains a certain amount of a priori knowledge about the problem. The basic de(cid:173)\nsign principle is to minimize the number of free parameters that must be determined \nby the learning algorithm, without overly reducing the computational power of the \nnetwork. This principle increases the probability of correct generalization because \n\n\fHandwritten Digit Recognition with a Back-Propagation Network \n\n397 \n\ntl( If!?-()() rt'r .A..3 ~CJ->i \n\nI \n\nFigure 1: Examples of original zip codes from the testing set. \n\nit results in a specialized network architecture that has a reduced entropy (Denker \net al., 1987; Patarnello and Carnevali, 1987; Tishby, Levin and Solla, 1989; Le Cun, \n1989). On the other hand, some effort must be devoted to designing appropriate \nconstraints into the architecture. \n\n2 ZIPCODE RECOGNITION \nThe handwritten digit-recognition application was chosen because it is a relatively \nsimple machine vision task: the input consists of black or white pixels, the digits \nare usually well-separated from the background, and there are only ten output \ncategories. Yet the problem deals with objects in a real two-dimensional space and \nthe mapping from image space to category space has both considerable regularity \nand considerable complexity. The problem has added attraction because it is of \ngreat practical value. \nThe database used to train and test the network is a superset of the one used in \nthe work reported last year (Denker et al., 1989). We emphasize that the method \nof solution reported here relies more heavily on automatic learning, and much less \non hand-designed preprocessing. \n\nThe database consists of 9298 segmented numerals digitized from handwritten zip(cid:173)\ncodes that appeared on real U.S. Mail passing through the Buffalo, N.Y. post office. \nExamples of such images are shown in figure 1. The digits were written by many \ndifferent people, using a great variety of sizes, writing styles and instruments, with \nwidely varying levels of care. This was supplemented by a set of 3349 printed dig(cid:173)\nits coming from 35 different fonts. The training set consisted of 7291 handwritten \ndigits plus 2549 printed digits. The remaining 2007 handwritten and 700 printed \ndigits were used as the test set. The printed fonts in the test set were different from \nthe printed fonts in the training set.One important feature of this database, which \n\n\f398 \n\nLe Cun, Boser, Denker, Henderson, Howard, Hubbard and Jackel \n\nFigure 2: Examples of normalized digits from the testing set. \n\nis a common feature to all real-world databases, is that both the training set and \nthe testing set contain numerous examples that are ambiguous, unclassifiable, or \neven misclassified. \n\n3 PREPROCESSING \nAcquisition, binarization, location of the zip code, and preliminary segmentation \nwere performed by Postal Service contractors (Wang and Srihari, 1988). Some of \nthese steps constitute very hard tasks in themselves. The segmentation (separating \neach digit from its neighbors) would be a relatively simple task if we could assume \nthat a character is contiguous and is disconnected from its neighbors, but neither \nof these assumptions holds in practice. Many ambiguous characters in the database \nare the result of mis-segmentation (especially broken 5's) as can be seen on figure 2. \n\nAt this point, the size of a digit varies but is typically around 40 by 60 pixels. Since \nthe input of a back-propagation network is fixed size, it is necessary to normalize \nthe size of the characters. This was performed using a linear transformation to \nmake the characters fit in a 16 by 16 pixel image. This transformation preserves \nthe aspect ratio of the character, and is performed after extraneous marks in the \nimage have been removed. Because of the linear transformation, the resulting image \nis not binary but has multiple gray levels, since a variable number of pixels in the \noriginal image can fall into a given pixel in the target image. The gray levels of \neach image are scaled and translated to fall within the range -1 to 1. \n\n4 THE NETWORK \nThe remainder ofthe recognition is entirely performed by a multi-layer network. All \nof the connections in the network are adaptive, although heavily constrained, and \nare trained using back-propagation. This is in contrast with earlier work (Denker \net al., 1989) where the first few layers of connections were hand-chosen constants. \nThe input of the network is a 16 by 16 normalized image and the output is composed \n\n\fHandwritten Digit Recognition with a Back-Propagation Network \n\n399 \n\nof 10 units: one per class. When a pattern belonging to class i is presented, the \ndesired output is +1 for the ith output unit, and -1 for the other output units. \n\nFigure 3: Input image (left), weight vector (center), and resulting feature map \n(right). The feature map is obtained by scanning the input image with a single \nneuron that has a local receptive field, as indicated. White represents -1, black \nrepresents + 1. \n\nA fully connected network with enough discriminative power for the task would have \nfar too many parameters to be able to generalize correctly. Therefore a restricted \nconnection-scheme must be devised, guided by our prior knowledge about shape \nrecognition. There are well-known advantages to performing shape recognition by \ndetecting and combining local features. We have required our network to do this \nby constraining the connections in the first few layers to be local. In addition, if \na feature detector is useful on one part of the image, it is likely to be useful on \nother parts of the image as well. One reason for this is that the salient features of a \ndistorted character might be displaced slightly from their position in a typical char(cid:173)\nacter. One solution to this problem is to scan the input image with a single neuron \nthat has a local receptive field, and store the states of this neuron in corresponding \nlocations in a layer called a feature map (see figure 3). This operation is equivalent \nto a convolution with a small size kernel, followed by a squashing function. The \nprocess can be performed in parallel by implementing the feature map as a plane \nof neurons whose weight vectors are constrained to be equal. That is, units in a \nfeature map are constrained to perform the same operation on different parts of the \nimage. An interesting side-effect of this weight sharing technique, already described \nin (Rumelhart, Hinton and Williams, 1986), is to reduce the number of free param(cid:173)\neters by a large amount, since a large number of units share the same weights. In \naddition, a certain level of shift invariance is present in the system: shifting the \ninput will shift the result on the feature map, but will leave it unchanged otherwise. \nIn practice, it will be necessary to have multiple feature maps, extracting different \nfeatures from the same image. \n\n\f400 \n\nLe Cun, Boser, Denker, Henderson, Howard, Hubbard and Jackel \n\n4 \n\n5 \n6 \nX X \nX X X X X \n\n1 \n\n2 \n\n3 \n1 X X X \n2 \n3 \n4 \n\n7 \n\n8 \n\n9 \n\n10 11 12 \n\nX X X \n\nX X \nX X X X X \n\nTable 1: Connections between H2 and H3. \n\nThe idea of local, convolutional feature maps can be applied to subsequent hidden \nlayers as well, to extract features of increasing complexity and abstraction. Inter(cid:173)\nestingly, higher level features require less precise coding of their location. Reduced \nprecision is actually advantageous, since a slight distortion or translation of the in(cid:173)\nput will have reduced effect on the representation. Thus, each feature extraction in \nour network is followed by an additional layer which performs a local averaging and \na subsampling, reducing the resolution of the feature map. This layer introduces \na certain level of invariance to distortions and translations. A functional module \nof our network consists of a layer of shared-weight feature maps followed by an \naveraging/subsampling layer. This is reminiscent of the Neocognitron architecture \n(Fukushima and Miyake, 1982), with the notable difference that we use backprop \n(rather than unsupervised learning) which we feel is more appropriate to this sort \nof classification problem. \n\nThe network architecture, represented in figure 4, is a direct extension of the ones \ndescribed in (Le Cun, 1989; Le Cun et al., 1990a). The network has four hidden \nlayers respectively named HI, H2, H3, and H4. Layers HI and H3 are shared-weights \nfeature extractors, while H2 and H4 are averaging/subsampling layers. \n\nAlthough the size of the active part of the input is 16 by 16, the actual input is a 28 \nby 28 plane to avoid problems when a kernel overlaps a boundary. HI is composed \nof 4 groups of 576 units arranged as 4 independent 24 by 24 feature maps. These \nfour feature maps will be designated by HI.l, HI.2, HI.3 and HIA. Each unit in \na feature map takes its input from a 5 by 5 neighborhood on the input plane. As \ndescribed above, corresponding connections on each unit in a given feature map are \nconstrained to have the same weight. In other words, all of the 576 units in H1.1 \nuses the same set of 26 weights (including the bias). Of course, units in another \nmap (say HI.4) share another set of 26 weights. \nLayer H2 is the averaging/subsampling layer. It is composed of 4 planes of size 12 \nby 12. Each unit in one of these planes takes inputs on 4 units on the corresponding \nplane in HI. Receptive fields do not overlap. All the weights are constrained to be \nequal, even within a single unit. Therefore, H2 performs a local averaging and a 2 \nto 1 sUbsampling of HI in each direction. \n\nLayer H3 is composed of 12 feature maps. Each feature map contains 64 units \narranged in a 8 by 8 plane. As before, these feature maps will be designated as \nH2.1, H2.2 ... H2.12. The connection scheme between H2 and H3 is quite similar \nto the one between the input and HI, but slightly more complicated because H3 \nhas multiple 2-D maps. Each unit receptive field is composed of one or two 5 by \n\n\fHandwritten Digit Recognition with a Back.Propagation Network \n\n401 \n\nFigure 4: Network Architecture with 5 layers of fully-adaptive connections. \n\n\f402 \n\nLe Cun, Boser, Denker, Henderson, Howard, Hubbard and Jackel \n\n5 neighborhoods centered around units that are at identical positions within each \nH2 maps. Of course, all units in a given map are constrained to have identical \nweight vectors. The maps in H2 on which a map in H3 takes its inputs are chosen \naccording to a scheme described on table 1. According to this scheme, the network \nis composed of two almost independent modules. Layer H4 plays the same role as \nlayer H2, it is composed of 12 groups of 16 units arranged in 4 by 4 planes. \n\nThe output layer has 10 units and is fully connected to H4. \nIn summary, the \nnetwork has 4635 units, 98442 connections, and 2578 independent parameters. This \narchitecture was derived using the Optimal Brain Damage technique (Le Cun et al., \n1990b) starting from a previous architecture (Le Cun et al., 1990a) that had 4 times \nmore free parameters. \n\n5 RESULTS \nAfter 30 training passes the error rate on training set (7291 handwritten plus 2549 \nprinted digits) was 1.1% and the MSE was .017. On the whole test set (2007 \nhandwritten plus 700 printed characters) the error rate was 3.4% and the MSE was \n0.024. All the classification errors occurred on handwritten characters. \n\nIn a realistic application, the user is not so much interested in the raw error rate \nas in the number of rejections necessary to reach a given level of accuracy. In our \ncase, we measured the percentage of test patterns that must be rejected in order \nto get 1% error rate. Our rejection criterion was based on three conditions: the \nactivity level of the most-active output unit should by larger than a given threshold \nt 1 , the activity level of the second most-active unit should be smaller than a given \nthreshold t2, and finally, the difference between the activity levels of these two units \nshould be larger than a given threshold td. The best percentage of rejections on \nthe complete test set was 5.7% for 1% error. On the handwritten set only, the \nresult was 9% rejections for 1 % error. It should be emphasized that the rejection \nthresholds were obtained using performance measures on the test set. About half \nthe substitution errors in the testing set were due to faulty segmentation, and an \nadditional quarter were due to erroneous assignment of the desired category. Some \nof the remaining images were ambiguous even to humans, and in a few cases the \nnetwork misclassified the image for no discernible reason. \n\nEven though a second-order version of back-propagation was used, it is interesting \nto note that the learning takes only 30 passes through the training set. We think \nthis can be attributed to the large amount of redundancy present in real data. A \ncomplete training session (30 passes through the training set plus test) takes about \n3 days on a SUN SP ARCstation 1 using the SN2 connectionist simulator (Bottou \nand Le Cun, 1989). \n\nAfter successful training, the network was implemented on a commercial Digital \nSignal Processor board containing an AT&T DSP-32C general purpose DSP chip \nwith a peak performance of 12.5 million multiply-add operations per second on 32 bit \nfloating point numbers. The DSP operates as a coprocessor in a PC connected to \na video camera. The PC performs the digitization, binarization and segmentation \n\n\fHandwritten Digit Recognition with a Back-Propagation Network \n\n403 \n\nFigure 5: Atypical data. The network classifies these correctly, even though they \nare quite unlike anything in the training set. \n\nof the image, while the DSP performs the size-normalization and the classification. \nThe overall throughput of the digit recognizer including image acquisition is 10 to \n12 classifications per second and is limited mainly by the normalization step. On \nnormalized digits, the DSP performs more than 30 classifications per second. \n\n6 CONCLUSION \nBack-propagation learning was successfully applied to a large, real-world task. Our \nresults appear to be at the state of the art in handwritten digit recognition. The \nnetwork had many connections but relatively few free parameters. The network \narchitecture and the constraints on the weights were designed to incorporate geo(cid:173)\nmetric knowledge about the task into the system. Because of its architecture, the \nnetwork could be trained on a low-level representation of data that had minimal \npreprocessing (as opposed to elaborate feature extraction). Because of the redun(cid:173)\ndant nature of the data and because of the constraints imposed on the network, the \nlearning time was relatively short considering the size of the training set. Scaling \nproperties were far better than one would expect just from extrapolating results of \nback-propagation on smaller, artificial problems. Preliminary results on alphanu(cid:173)\nmeric characters show that the method can be directly extended to larger tasks. \nThe final network of connections and weights obtained by back-propagation learn(cid:173)\ning was readily implementable on commercial digital signal processing hard ware. \nThroughput rates, from camera to classified image, of more than ten digits per \nsecond were obtained. \n\nAcknowledgments \n\nWe thank the US Postal Service and its contractors for providing us with the zip(cid:173)\ncode database. We thank Henry Baird for useful discussions and for providing the \nprinted-font database. \n\nReferences \nBottou, L.-Y. and Le Cun, Y. (1989). SN2: A Simulator for Connectionist Models. \n\nNeuristique SA, Paris, France. \n\n\f404 \n\nLe Cun, Boser, Denker, Henderson, Howard, Hubbard and Jackel \n\nDenker, J., Schwartz, D., Wittner, B., Solla, S. A., Howard, R., Jackel, L., and \n\nHopfield, J. (1987). Large Automatic Learning, Rule Extraction and General(cid:173)\nization. Complex Systems, 1:877-922. \n\nDenker, J. S., Gardner, W. R., Graf, H. P., Henderson, D., Howard, R. E., Hub(cid:173)\n\nbard, W., Jackel, L. D., Baird, H. S., and Guyon, I. (1989). Neural Network \nRecognizer for Hand-Written Zip Code Digits. In Touretzky, D., editor, Neu(cid:173)\nral Information Processing Systems, volume 1, pages 323-331, Denver, 1988. \nMorgan Kaufmann. \n\nFukushima, K. and Miyake, S. (1982). Neocognitron: A new algorithm for pattern \nrecognition tolerant of deformations and shifts in position. Pattern Recognition, \n15:455-469. \n\nLe Cun, Y. (1989). Generalization and Network Design Strategies. In Pfeifer, R., \n\nSchreter, Z., Fogelman, F., and Steels, L., editors, Connectionism in Perspec(cid:173)\ntive, Zurich, Switzerland. Elsevier. \n\nLe Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., \nand Jackel, L. D. (1990a). Back-Propagation Applied to Handwritten Zipcode \nRecognition. Neural Computation, 1(4). \n\nLe Cun, Y., Denker, J. S., Solla, S., Howard, R. E .. , and Jackel, L. D. (1990b). Op(cid:173)\n\ntimal Brain Damage. In Touretzky, D., editor, Neural Information Processing \nSystems, volume 2, Denver, 1989. Morgan Kaufman. \n\nPatarnello, S. and Carnevali, P. (1987). Learning Networks of Neurons with Boolean \n\nLogic. Europhysics Letters, 4(4):503-508. \n\nRumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal \n\nrepresentations by error propagation. In Parallel distributed processing: Explo(cid:173)\nrations in the microstructure of cognition, volume I, pages 318-362. Bradford \nBooks, Cambridge, MA. \n\nTishby, N., Levin, E., and Solla, S. A. (1989). Consistent Inference of Probabilities \nin Layered Networks: Predictions and Generalization. In Proceedings of the \nInternational Joint Conference on Neural Networks, Washington DC. \n\nWang, C. H. and Srihari, S. N. (1988). A Framework for Object Recognition in a \nVisually Complex Environment and its Application to Locating Address Blocks \non Mail Pieces. International Journal of Computer Vision, 2:125. \n\n\f", "award": [], "sourceid": 293, "authors": [{"given_name": "Yann", "family_name": "LeCun", "institution": null}, {"given_name": "Bernhard", "family_name": "Boser", "institution": null}, {"given_name": "John", "family_name": "Denker", "institution": null}, {"given_name": "Donnie", "family_name": "Henderson", "institution": null}, {"given_name": "R.", "family_name": "Howard", "institution": null}, {"given_name": "Wayne", "family_name": "Hubbard", "institution": null}, {"given_name": "Lawrence", "family_name": "Jackel", "institution": null}]}