{"title": "A Self-Organizing Integrated Segmentation and Recognition Neural Net", "book": "Advances in Neural Information Processing Systems", "page_first": 496, "page_last": 503, "abstract": null, "full_text": "A Self-Organizing Integrated Segmentation And \n\nRecognition Neural Net \n\nJim Keeler * \n\nMCC \n\n3500 West Balcones Center Drive \n\nAustin, TX 78729 \n\nDavid E. Rumelhart \nPsychology Department \n\nStanford University \nStanford, CA 94305 \n\nAbstract \n\nWe present a neural network algorithm that simultaneously performs seg(cid:173)\nmentation and recognition of input patterns that self-organizes to detect \ninput pattern locations and pattern boundaries. We demonstrate this neu(cid:173)\nral network architecture on character recognition using the NIST database \nand report on results herein. The resulting system simultaneously seg(cid:173)\nments and recognizes touching or overlapping characters, broken charac(cid:173)\nters, and noisy images with high accuracy. \n\n1 \n\nINTRODUCTION \n\nStandard pattern recognition systems usually involve a segmentation step prior to \nthe recognition step. For example, it is very common in character recognition to \nsegment characters in a pre-processing step then normalize the individual characters \nand pass them to a recognition engine such as a neural network, as in the work of \nLeCun et al. 1988, Martin and Pittman (1988). \n\nThis separation between segmentation and recognition becomes unreliable if the \ncharacters are touching each other, touching bounding boxes, broken, or noisy. \nOther applications such as scene analysis or continuous speech recognition pose \nsimilar and more severe segmentation problems. The difficulties encountered in \nthese applications present an apparent dilemma: one cannot recognize the patterns \n\n*keeler@mcc.com Reprint requests: coila@mcc.com or at the above address. \n\n496 \n\n\fA Self-Organizing Integrated Segmentation and Recognition Neural Net \n\n497 \n\nO \n\nutputs: pz = - - -\nI + Sz \n\nSz \n\nSumming Units: \n\nSz = LXxyz \n\nxy \n\nGrey-scale \nInput image \nI(.X,y) \n\nI \nI \nI \nItfIM ............. \nI \nI \nI \nI~~~ \nI \nI \nI \nI \n\n5 \n\nI , \n2.AFI \n1ABY/ \n\u00b0AW \n\nFigure 1: The ISR network architecture. The input image may contain several \ncharacters and is presented to the network in a two-dimensional grey-scale image. \nThe units in the first block, hij\", have linked-local receptive field connections to the \ninput image. Block 2, Hr'JI'z\" has a three-dimensional linked-local receptive field \nto block 1, and the exponential unit block, block 3, has three-dimensional linked(cid:173)\nlocal receptive field connections to block 2. These linked fields insure translational \ninvariance (except for edge-effects at the boundary). The exponential unit block \nhas one layer for each output category. These units are the output units in the test \nmode, but hidden units during training: the exponential unit activity is summed \nover (sz) to project out the positional information, then converted to a probability \nPz. Once trained, the exponential unit layers serve as \"smart histograms\" giving \nsharp peaks of activity directly above the corresponding characters in the input \nimage, as shown to the left. \n\n\f498 \n\nKeeler and Rumelhart \n\nuntil they are segmented, yet in many cases one cannot segment the patterns until \nthey are recognized. \n\nA solution to this apparent dilemm is to simultaneously segment and recognize \nthe patterns. Integration of the segmentation and recognition steps is essential for \nfurther progress in these difficult pattern recognition tasks, and much effort has been \ndevoted to this topic in speech recognition. For example, Hidden Markov models \nintegrate the task of segmentation and recognition as a part of the word-recognition \nmodule. Nevertheless, little neural network research in pattern recognition has \nfocused on the integrated segmentation and recognition (ISR) problem. \n\nThere are several ways to achieve ISR in a neural network. The first use of back(cid:173)\npropagation ISR neural networks for character recognition was reported by Keeler, \nRumelhart and Leow (1991a). The ISR neural network architecture is similar to \nthe time-delayed neural network architecture for speech recognition used by Lang, \nHinton, and Waibel (1990). \n\nThe following section outlines the neural network algorithm and architecture. De(cid:173)\ntails and rationale for the exact structure and assumptions of the network can be \nfound in Keeler et al. (1991a,b). \n\n2 NETWORK ARCHITECTURE AND ALGORITHM \n\nThe basic organization of the network is illustrated in Figure 2. The input consists \nof a twcrdimensional grey-scale image representing the pattern to be processed. We \ndesignate this input pattern by the twcrdimensional field lex, y). In general, we \nassume that any pattern can be presented at any location and that the characters \nmay touch, overlap or be broken or noisy. The input then projects to a linked-Iocal(cid:173)\nreceptive-field block of sigmoidal hidden units (to enforce translational invariance). \nWe designate the activation of the sigmoidal units in this block by hij It. \n\nThe second block of hidden units, H 1:'J/' z', is a linked-local receptive field block of \nsigmoidal units that receives input from a three-dimensional receptive field in the \nhiilt block. In a standard neural network architecture we would normally connect \nblock H to the output units. However we connect block H to a block of exponential \nunits X1:J/z, The X block serves as the outputs after the network has been trained; \nthere is a sheet of exponential units for each output category. These units are \nconnected to block H via a linked-local receptive field structure. X1:J/z = e\"''''\"\u00b7, \nwhere the net input to the unit is \n\nTJ1:J/Z = L W:,~~z,H1:'J/'z' + /3z, \n\n1:'J/' \n\n(1) \n\nand W:,~~z' is the weight from hidden unit H1:'J/'z' to the exponential unit X1:J/z, \nSince we use linked weights in each block, the entire structure is translationally \ninvariant. We make use of this property in our training algorithm and project out \nthe positional information by summing over the entire layer, Sz = L1:Y X1:J/z, This \nallows us to give non-specific target information in the form of \"the input contains \na 5 and a 3, but I will not say where.\" We do this by converting the summed \ninformation in\"to an output probability, pz = 1!5 \u2022. \n\n\fA Self-Organizing Integrated Segmentation and Recognition Neural Net \n\n499 \n\n2.1 The learning Rule \n\nThere are two objective functions that we have used to train ISR networks: cross \nentropy and total-sum-square-error. I = Ez tzlnpz + (1 -\nt z)ln(l - Pz), where t z \nequals 1 if pattern z is presented and 0 otherwise. Computing the gradient with \nrespect to the net input to a particular exponential unit yields the following term \nin our learning rule: \n\n~ - (t _ \n8TJ~yz \n\nz \n\n-\n\npz \n\nX~yz \n\n) \nE~y X~yz \n\n(2) \n\nIt should be noted that this is a kind of competitive rule in which the learning is \nproportional to the relative strength of the activation at the unit at a particular \nlocation in the X layer to the strength of activation in the entire layer. For example, \nsuppose that X2,3,5 = 1000 and X5,3,5= 100. Given the above rules, X2,3,5 would \nreceive about 10 times more of the output error than the unit X5,3,5. Thus the units \ncompete with each other for the credit or blame of the output, and the \"rich get \nricher\" until the proper target is achieved. This favors self-organization of highly \nlocalized spikes of activity in the exponential layers directly above the particular \ncharacter that the exponential layer detects (\"smart histograms\" as shown in Fig(cid:173)\nure 1). Note that we never give positional information in the network but that the \nnetwork self-organizes the exponential unit activity to discern the positional infor(cid:173)\nmation. The second function is the total-sum-square error, E = Ez(tz - pz)2. For \nthe total-sum-square error measure, the gradient term becomes \n\n8E \n-~ -\nuTJ~yz \n\n( \n\n) \n= t z - pz \n(1 + L.~y X~yz \n\nX~yz \n~ \n\n)2 . \n\n(3) \n\nAgain this has a competitive term, but the competition is only important for X~yz \nlarge, otherwise the denominator is dominated by 1 for small E~y X~yz. We used \nthe quadratic error function for the networks reported in the next section. \n\n3 NIST DATABASE RECOGNITION \n\n3.1 Data \n\nWe tested this neural network algorithm on the problem of segmenting and rec(cid:173)\nognizing handwritten numerals from the NIST database. This database contains \napproximately 273,000 samples of handwritten numerals collected from the Bureau \nof Census field staff. There were 50 different forms used in the study, each with \n33 fields, 28 of which contain handwritten numerals ranging in length from 2 to 10 \ndigits per field. We only used fields of length 2 to 6 (field numbers 6 to 30). We \nused two test sets: a small test set, Test Set A of approximately 4,000 digits, 1,000 \nfields, from forms labeled f1800 to f1840 and a larger test set, Test Set B, containing \n20,000 numerals 5,000 fields and 200 forms from f1800 to f1899 and f2000 to f2199. \n\nWe used two different training sets: a hand-segmented training set containing ap(cid:173)\nproximately 33,000 digits from forms mooo to m636 (the Segmented Training Set) \nand another training set that was never hand-segmented from forms mooo to f1800 \n(the Unsegmented Training Set. We pre-processed the fields with a simple box(cid:173)\nremoval and size-normalization program before they were input to the ISR net. \n\n\f500 \n\nKeeler and Rumelhart \n\nThe hand segmentation was conventional in the sense that boxes were drawn around \neach of the characters, but we the boxes included any other portions of characters \nthat may be nearby or touching in the natural context. Note that precise labeling of \nthe characters is not essential at all. We have trained systems where only the center \ninformation the characters was used and found no degradation in performance. This \nis due to the fact that the system self-organizes the positional information, so it is \nonly required that we know whether a character is in a field, not precisely where. \n\n3.2 TRAINING \n\nWe trained several nets on the NIST database. The best training procedure was \nas follows: Step 1): train the network to an intermediate level of accuracy (96% \nor so on single characters, about 12 epochs of training set 1). Note that when we \ntrain on single characters, we do not need isolated characters -\nthere are often \nportions of other nearby characters within the input field. Indeed, it helps the ISR \nperformance to use this natural context. There are two reasons for this step: the \nfirst is speed - training goes much faster with single characters because we can use a \nsmall network. We also found a slight generalization accuracy benefit by including \nthis training step. Step 2): copy the weights of this small network into a larger \nnetwork and start training on 2 and 3 digit fields from the database without hand \nsegmentation. These are fields numbered 6,7,11,15,19,20,23,24,27, and 28. The \nreason that we use these fields is that we do not have to hand-segment them - we \npresent the fields to the net with the answer that the person was supposed to write \nin the field. (There were several cases where the person wrote the wrong numbers \nor didn't write anything. These cases were NOT screened from the training set.) \nTaking these fields from forms mooo to f1800 gives us another 45,000 characters to \ntrain on without ever segmenting them. \n\nThere were several reasons that we use fields of length 2 and 3 and not fields of \n4,5,or 6 for training (even though we used these in testing). First, 3 characters \ncovers the most general case: a character either has no characters on either side, \none to the left, one to the right or one on both sides (3 characters total). If we train \non 3 characters and duplicate the weights, we have covered the most general case for \nany number of characters, and it is clearly faster to train on shorter fields. Second, \ntraining with more characters confuses the net. As pointed out in our previous \nwork (keeler 1991a), the learning algorithm that we use is only valid for one or no \ncharacters of a given type presented in the input field. Thus, the field '39541' is ok \nto train on, but the field '288' violates one of the assumptions of the training rule. \nIn this case the two 8 's would be competing with each other for the answer and \nthe rule favors only one winner. Even though this problem occurs 1/lth of the \ntime for two digit fields, it is not serious enough to prevent the net from learning. \n(Clearly it would not learn fields of length 10 where all of the target units are \nturned on and there would be no chance for discrimination.) This problem could \nbe avoided by incorporating order information into training and we have proposed \nseveral mechanisms for incorporating order information in training, but do not use \nthem in the present system. Note that this biases the training toward the a-priori \ndistribution of characters in the 2 and 3 digit fields, which is a different distribution \nfrom that of the testing set. \n\nThe two networks that we used had the following architectures: Net1: Input: 28x24 \n\n\fA Self-Organizing Integrated Segmentation and Recognition Neural Net \n\n501 \n\nreceptive fields 6x6 shift 2x2. hidden 1: 12xllx12 receptive fields 4x4x12 shift \n2x2x12. hidden 2: 5x4x18 receptive fields 3x3x18 shift lxlxl8. exponentials (block \n3): 3x2xlO 10 summing, 10 outputs. \n\nNet2: Input: 28x26 receptive fields 6x6 shift 2x4. hidden 1: 12x6x12 receptive \nfields 5x4x12 shift lx2xl2. hidden 2: 8x2x18 receptive fields 5x2x18 shift lxlxl8. \nexponentials (block 3): 4xlxlO 10 summing, 10 outputs. \n\nn1&2 \n\nA \n\nB \n\n99. 5 t--+-+--t---:lhr-t:~rI-', \n\nn2 \n\n0/0 \nc \n0 \nr \nr \ne \nc \nt \n\n100 \n\n99 \n\n98 \n\n97 \n\n96 \n\n95 \n\n94 \n\n93 \n\n92 \n\n91 . . . . ______ . . _ ..... \n\no 5 10 15 20 25 \n\n3C \n\n98 ----'--I-\n\n0/0 Rejected \n\n6 , \n\n97.5 .......... _ \n\n.... ~ ............ _ .............. \n5 10 15 20 25 30 35 40 45 50 55 60 \n\nFigure 2: Average combined network performance on the NIST database. Figure \n2A shows the generalization performance of two neural networks on the NIST Test \nSet A. The individual nets Netl and Net2 (nl, n2 respectively) and the combined \nperformance of nets 1 and 2 are shown where fields are rejected when the nets differ. \nThe curves show results for fields ranging length 2 to 6 averaged over all fields for \n1,000 total fields, 4,000 characters. Note that Net2 is not nearly as accurate as Netl \non fields, but that the combination of the two is significantly better than either. \nFor this test set the rejection rate is 17% (83% acceptance) with an accuracy rate of \n99.3% (error rate 0.7%) overall on fields of average length 4. Figure 2B shows the \nper-field performance for test-set B (5,000 fields, 20,000 digits) Again both nets are \nused for the rejection criterion. For comparison, 99% accuracy on fields of length 4 \nis achieved at 23% rejection. \n\nFigure 2 shows the generalization performance on the NIST database for Netl, Net2 \nand their combination. For the combination, we accepted the answer only when the \nnetworks agreed and rejected further based on a simple confidence measure (the \ndifference of the two highest activations) of each individual net. \n\n\f502 \n\nKeeler and Rumelhart \n\n~./.~;, I \n.! 1\"'\" \n\n../f. \nI \n\n.I \n\nFigure 3: Examples of correctly recognized fields in the NIST database. This figure \nshows examples of fields that were correctly recognized by the ISR network. Note \nthe cases of touching characters, multiple touching characters, characters touching \nin multiple places, fields with extrinsic noise, broken characters and touching, broken \ncharacters with noise. Because of the integrated nature of the segmentation and \nrecognition, the same system is able to handle all of these cases. \n\n4 DISCUSSION AND CONCLUSIONS \n\nThis investigation has demonstrated that the ISR algorithm can be used for inte(cid:173)\ngrated segmentation and recognition and achieve high-accuracy results on a large \ndatabase of hand-printed numerals. The overall accuracy rates of 83% acceptance \nwith 99.3% accuracy on fields of average length 4 is competitive with accuracy re(cid:173)\nported in commercial products. One should be careful making such comparisons. \nWe found a variance of 7% or more in rejection performance on different test sets \nwith more than 1,000 fields (a good statistical sample). Perhaps more important \nthan the high accuracy, we have demonstrated that the ISR system is able to deal \nwith touching, broken and noisy characters. In other investigations we have demon(cid:173)\nstrated the ISR system on alphabetic characters with good results, and on speech \nrecognition (Keeler, Rumelhart, Zand-Biglari, 1991) where the results are slightly \nbetter than Hidden Markov Model results. \n\nThere are several attractive aspects about the ISR algorithm: 1) Labeling can be \n\"sloppy\" in the sense that the borders of the characters do not have to be defined. \nThis reduces the labor burden of getting a system running. 2) The final weights can \nbe duplicated so that the system can all run in parallel. Even with both networks \nrunning, the number of weights and activations needed to be stored in memory is \nquite small - about 30,000 floating point numbers, and the system is quite fast \nin the feed-forward mode: peak performance is about 2.5 characters/sec on a Dec \n5000 (including everything: both networks running, input pre-processing, parsing \nthe answers, printing results, etc.). This structure is ideal for VLSI implementation \nsince it contains a very small number of weights (about 5,000). This is one possible \nway around the computational bottleneck facing encountered in processing complex \nscenes - the ISR net can do very-fast first-cut scene analysis with good discrimi-\n\n\fA Self-Organizing Integrated Segmentation and Recognition Neural Net \n\n503 \n\nnation of similar objects - an extremely difficult task. 3) The ISR algorithm and \narchitecture presents a new and powerful approach of using forward models to con(cid:173)\nvert position-independent training information into position-specific error signals. \n4) There is no restriction to one-dimension; The same ISR structure has been used \nfor two-dimensional parsing. \n\nNevertheless, there are several aspects of the ISR net that require improvement for \nfuture progress. First, the algorithmic assumption of having one pattern of a given \ntype in the input field is too restrictive and can cause confusion in some training \nexamples. Second, we are throwing some information away when we project out \nall of the positional information order information could be incorporated into the \ntraining information. This extra information should improve training performance \ndue to the more-specific error signals. Finally, normalization is still a problem. \nWe do a crude normalization, and the networks are able to segment and recognize \ncharacters as long as the difference in size is not too large. A factor of two in \nsize difference is easily handled with the ISR system, but a factor of four decreases \nrecognition accuracy by about 3-5% on the character recognition rates. This re(cid:173)\nquires a tighter coupling between the segmentation/recognition and normalization. \nJust as one must segment and recognize simultaneously, in many cases one can't \nproperly normalize until segmentation/recognition has occurred. Fortunately, in \nmost document processing applications, crude normalization to within a factor of \ntwo is simple to achieve, allowing high accuracy networks. \n\nAcknowledgements \n\nWe thank Wee-Kheng Leow, Steve O'Hara, John Canfield, for useful discussions \nand coding. \n\nReferences \n(1] J.D. Keeler, D.E. Rumelhart, and W.K. Leow (1991a) \"Integrated Segmenta(cid:173)\ntion and Recognition of Hand-printed Numerals\". In: Lippmann, Moody and \nTouretzky, Editors, Neural Information Processing Systems 3, 557-563. \n\n[2] J.D. Keeler, D.E. Rumelhart, and S. Zand-Biglari (1991b) \"A Neural Network \nFor Integrated Segmentation and Recognition of Continuous Speech\". MCC \nTechnical Report ACT-NN-359-91. \n\n[3] K. Lang, A. Waibel, G. Hinton. (1990) A time delay Neural Network Architec(cid:173)\n\nture for Isolated Word Recognition. Neural Networks, 3 23-44. \n\n[4] Y. Le Cun, B. Boser, J .S. Denker, S. Solla, R. Howard, and L. Jackel. \n\n(1990) \"Back-Propagation applied to Handwritten Zipcode Recognition.\" Neu(cid:173)\nral Computation 1(4):541-551. \n\n[5] G. Martin, J. Pittman (1990) \"Recognizing hand-printed letters and digits.\" \nIn D. Touretzky (Ed.). Neural Information Processing Systems 2, 405-414, \nMorgan Kauffman Publishers, San Mateo, CA. \n\n[6] The NIST database can be obtained by writing to: Standard Reference Data \nNational Institute of Standards and Technology 221/ A323 Gaithersburg, MD \n20899 USA and asking for NIST special database 1 (HWDB). \n\n\f", "award": [], "sourceid": 570, "authors": [{"given_name": "Jim", "family_name": "Keeler", "institution": null}, {"given_name": "David", "family_name": "Rumelhart", "institution": null}]}