{"title": "Integrated Segmentation and Recognition of Hand-Printed Numerals", "book": "Advances in Neural Information Processing Systems", "page_first": 557, "page_last": 563, "abstract": null, "full_text": "Integrated Segmentation and Recognition of \n\nHand-Printed Numerals \n\nJames D.  Keeler\u00b7 \nMCC \n3500 W.  Balcones  Ctr.  Dr. \nAustin,  TX 78759 \n\nDavid E.  Rumelhart \nPsychology Department \n\nStanford University \nStanford, CA 94305 \n\nWee-Kheng Leow \nMCC  and \nUniversity of Texas \nAustin,  TX 78759 \n\nAbstract \n\nNeural  network  algorithms  have  proven  useful  for  recognition  of individ(cid:173)\nual,  segmented  characters.  However,  their recognition  accuracy  has  been \nlimited by  the  accuracy  of the  underlying  segmentation  algorithm.  Con(cid:173)\nventional,  rule-based  segmentation  algorithms  encounter  difficulty  if the \ncharacters  are touching, broken,  or noisy.  The problem in these situations \nis  that  often  one  cannot  properly  segment  a  character  until  it  is  recog(cid:173)\nnized yet  one cannot  properly recognize  a  character until it is  segmented. \nWe present here  a  neural network algorithm that simultaneously segments \nand recognizes  in an  integrated  system.  This  algorithm has several  novel \nfeatures:  it uses  a supervised learning algorithm (backpropagation), but is \nable to take position-independent information as  targets and self-organize \nthe  activities  of the units  in  a  competitive fashion  to infer the  positional \ninformation.  We  demonstrate  this  ability with  overlapping  hand-printed \nnumerals. \n\n1 \n\nINTRODUCTION \n\nA major problem with standard backpropagation algorithms for pattern recognition \nis  that  they  seem  to  require  carefully  segmented  and  localized  input  patterns  for \ntraining.  This is  a  problem for  two reasons:  first,  it is  often  a  labor intensive  task \nto  provide  this  information  and  second,  the  decision  as  to  how  to  segment  often \ndepends  on prior recognition.  However,  we  describe below  a  neural network  design \nand  corresponding  backpropagation  learning  algorithm  that  learns  to  simultane-\n\n\u00b7Reprint requests:  Jim Keeler,  keeler@mcc.com  or coila@mcc.com \n\n557 \n\n\f558 \n\nKeeler, Rumelhart, and Leow \n\nously segment  and identify a pattern.  1 \n\nThere  are  two  important  aspects  to  many  pattern  recognition  problems  that  we \nhave  built directly  into our network  and  learning  algorithm.  The  first  is  that  the \nexact location of the pattern,  in  space or time,  is irrelevant  to the classification  of \nthe  pattern;  it  should  be  recognized  as  a  member  of the  same  class  wherever  or \nwhenever  it occurs.  This suggests  that we  build translation independence  directly \ninto  our  network.  The  second  aspect  is  that  feedback  about  whether  or  not  a \npattern  of a  particular  class  is  present  is  all that should  be  required  for  training; \ninformation about the exact location and relationship to other patterns should not \nbe  required.  The  target  information,  thus,  does  not  include  information  about \nwhere the  patterns occur,  only about whether  a  particular pattern occurs. \n\nWe  have  incorporated  two  design  principles  into  our  network  to  deal  with  these \nproblems.  The  first  is  to build  translation  independence  into  the  network  by  us(cid:173)\ning  linked  local receptive  fields.  The  second  is  to  build  a  fixed  \"forward  model\" \n(c.f.  Jordan and  Rumelhart,  1990)  which  translates  a  location-specific recognition \nprocess  into a  location-independent  output value.  This output  gives  rise  to a  non(cid:173)\nspecific  error  signal which  is  propagated  back  through  this  fixed  network  to  train \nthe underlying location-specific network. \n\n2  NETWORK ARCHITECTURE AND  ALGORITHM \n\nThe  basic  organization  of the  network  is  illustrated  in  Figure  1. \nIn  the  case  of \ncharacter  recognition,  the input consists of a  set  of pixels over which  the stimulus \npatterns  are  displayed.  We  designate  the  stimulus  pattern  by  the  vector  X.  In \ngeneral,  we  assume  that  any character can  be  presented  in  any  position  and  that \ncharacters  may  overlap.  The  input  image  then  projects  to  a  set  of  hidden  units \nwhich  learn  to  abstract  features  from  the  input  field.  These  feature  abstraction \nunits are organized into sheets, one for each feature type.  Each unit within a  sheet \nis constrained to have the same weights as every  other unit in the sheet  (to enforce \ntranslational invariance).  This is the same method used by Rumelhart, Hinton and \nWilliams (1986)  in solving the so-called  TIC problem  and the one  used  by  LeCun \net aI.  (1990)  in their work on  ZIP-code  recognition. \nWe  let  the  activation  value  of  hidden  unit  of  type  i  at  location  i  be  a  logistic \nsigmoidal function  of its  net  input  and  designate  it  hi;.  We  interpret  ~; as  the \nprobability that feature Ii is  present  in  the  input  at  position  j.  The hidden  units \nthen project onto a set of sheets of position-specific character recognition units, one \nsheet for each character type.  These units have exponential activation functions and \neach  unit  in  the sheet  receives  inputs from  a  local receptive  field  block  of feature \ndetection units as shown in Figure  1.  As  with the hidden units, the weights in each \nexponential unit sheet  are linked,  enforcing translational invariance.  We designate \nas  Xi,- the activation of the  unit for  detecting character i  at  location  j, and  define \n\nlThe  algorithm  and  network  design  presented  here  was  first  proposed  by  Rumelhart \nin  a presentation  entitled  \"Learning  and  Generalization  in  Multilayer  networks\"  given  at \nthe NATO Advanced Research  Workshop on Neuro Computing, Algorithms, Architectures \nand Applications held in Lea Arcs, France in February,  1989.  The algorithm can be viewed \nas  a generalization  and refinement  of the TDNN of Lang,  Hinton, &&  Waibel,  1990. \n\n\fIntegrated Segmentation and Recognition of Hand-ftinted Numerals \n\n559 \n\nOutputs: \np = \ni \n\ns. \nI \n\n1+S i \n\nNet  input: \n\nIs: \ni \nII \nXl  =e  xy \n\n. \nxy \n\nN \n\ni \n'T\\  = '\"  W \n'Ixy  ~  kx'y' \n\nixy \n\nk::l \nx'  \u2022 \n\nk \n\nhx'y' \n\nCharacter \nDetectors: \n\nSheets  of \nexponential \nunits \n\nTargets \n\nUnear units: \nS=l:xi \ni \nxy \n\nFeature \nDetectors: \nSheets  of sigmoidal  units \nwith  linked  local  receptive \nfields. \n\n--~~ .. --~~--~ .. ~ \n\nk \n\nhx'Y' \n\nSigmoidal \nHiddens: \n\n,~'-\"'~~~~ __ ~,; Grey-scale \nInput  image \n\nFigure 1:  The Integrated Segementation and Recognition (ISR) network.  The input \nimage  may  contain  several  characters  and  is  presented  to  the  network  in  a  two(cid:173)\ndimensional array of grey-scale values.  Units in the first block h!,y'  have linked-local \nreceptive  fields  to the  input image,  and detect  features  of type  k.  The exponential \nunits in the next block receive inputs from a local receptive field of hidden sigmoidal \nunits.  The  weights  W~:~y'  connect  the  hidden  unit  h!,y'  to  the  exponential  unit \nX~tI'  The architecture enforces  translational invariance across the sheets of units by \nlinking the weights and shifting the receptive  fields  in each dimension.  Finally, the \nactivity in each individual sheet of exponential units is summed by the linear units Si \nand converted  to a probability Pi.  The two-dimensional input image can be thought \nof as a one-dimensional vector X as discussed in the text.  For notational convenience \nwe  used one-dimensional indices (j) in the text rather than two-dimensional (xy)  as \nshown in the figure.  All of the mathematics goes  through if one replaces i +-+  2:y. \n\n\f560 \n\nKeeler, Rumelhart, and Leow \n\nXi; =  e'1ii,  where  the net input to the unit is \n\n'Ii; =  L Wilehle;  + Pi \n\nIe \n\n(1) \n\nand Wile  is the weight from hidden unit hie;  to the detector Xi;.  As we  argue in Keeler \nRumelhart and Leow  (1991), 'Ii; can usefully be interpreted  as  the logarithm of the \nlikelihood ratio favoring the hypothesis that a character of type i  is  at location i  of \nthe input field.  Since Xi; is the exponential of 'Ii;, the X units are to be interpreted as \nrepresenting  the likelihood ratios themselves.  Thus, we  can interpret  the output of \nthe X units directly as the evidence favoring the assumption that there is a character \nof a particular type at a  particular location.  If we were willing and able to carefully \nsegment the input and tell the network the exact location of each character, we could \nuse  a  standard  training technique  to  train  the  network  to  recognize  characters  at \nany  location with  any  degree  of overlap.  However,  we  are  interested  in  a  training \nalgorithm in which we don't have to provide the network with such specific training \ninformation.  We  are  interested  in simply  telling  the  network which  characters  are \npresent  in the input - not where each character is.  This approach saves  tremendous \ntime and effort  in data preparation and labeling.  To implement this  idea,  we  have \nbuilt  an  additional network  which  takes  the  output  of the  X  units  and  computes, \nthrough  a  fixed  output  network,  the  probability  that  a  given  character  is  present \nanywhere  in the  input  field.  We  do  this  by  adding  two  additional layers  of units. \nThe  first  layer  of units,  the  S  units,  simply  sum  the  activity of each  sheet  of the \nX  units.  The  activity of unit  Si  can,  under certain  assumptions,  be  interpreted  as \nthe likelihood ratio that a character of type i  occurred  anywhere in the input field. \nFinally in the output layer, we  convert the likelihood ratio into a  probability by the \nformula \n\n(2) \nThus,  Pi  is  interpreted  as  representing  directly  the  probability  that  character  i \noccurred  in the  input field. \n\nPi  =  1 + Si . \n\nSi \n\n2.1  The learning Rule \n\nOn  having  set  up  our network,  it  is  straight-forward to compute  the  derivative  of \nthe error function with respect  to 'lii.  We  get  a  particularly simple learning rule if \nwe  let the objective function  be the cross-entropy  function, \n\nl  =  L tilnPi + (1 - ti)ln(1 - pd \n\n(3) \n\nwhere ti equals 1 if character i  is  presented  and zero  otherwise.  In  this case,  we  get \nthe following rule: \n\ni \n\n~ =  (ti  - Pi)  Xi; \nLie X ile \na'li; \n\n. \n\n(4) \n\nIt  should  be  noted  that  this  is  a  kind  of competitive  rule  in  which  the  learning  is \nproportional  to  the  relative  strength  of the  activation  at  the  unit  at  a  particular \nlocation in the X layer to the strength of activation in the entire layer.  This is valid \nif we  assume that either the character appears exactly once or not at all.  This ratio \nis the conditional probability that the target was at position i  under the assumption \nthat  the target  was,  in fact,  presented.  It is  also possible to derive  a  learning rule \nfor  the case  where more than one  of the same character  is  present[3]. \n\n\fIntegrated Segmentation and Recognition of Hand-l\\'inted Numerals \n\n561 \n\n3  EXPERIMENTAL RESULTS \n\nTo investigate the ability of this network to simultaneously segment  and recognize \ncharacters in an integrated system, we  trained the network outlined in section 2 on \na  database of hand-printed  numerals  taken  from  financial  documents.  We  used  a \ntraining  and  test  set  of about  9,000  and  1,800 characters  respectively.  We  placed \npairs of these grey-scaled characters on the input plane at positions determined by a \ndistance parameter which tells how far apart to place the centers  of the characters. \nWe used a distance parameter of 1.2 which indicates that the centers were about 1.2 \ncharacters apart with an added random displacement in the x  and y  dimensions by \n\u00b1.25 and \u00b10.15 of the leftmost character size respectively.  With these  parameters, \nthe characters touch or overlap about 15% of the time.  The network had 10 output \nunits  and  the  target  was  to  tUrn  on  the  units  of the  two  characters  in  the  input \nwindow, regardless of what order or position they occurred in.  Thus the  pair (3,5) \nhas the same target as  (5,3):  target =  (0001010000). \n\nE:XP  , \n\nE:XP  1 \n\nUP  2 \n\nrxp  3 \n\nrxp  .. \n\nEXP  5 \n\nEXP  , \n\nEXP  7 \n\nEXP  a \n\nEXP  9 \n\n\u2022 or \n\n.. : .. :.:.: .... :.;.::.:.:<.':/::.:.: .. :.;:. .. ::.:.: ... ': .... :.\u00b7\u00b7\u00b7 4 \n\nI \u2022 \n\nFigure  2.:  The ISR network's  performance.  This figure \nshows  two  touching characters  (06,  shown at left)  in the \ninput image  and  the  corresponding activation  of the \nsheets of exponential units.  The network  was  never \ntrained  on these  particular characters individually  or as  a \npair,  but gets  the  correct  activation  of greater than 0.9 \non  the  6  and  0.8  on  the  0  with  near  0.0 activation for \nall  other outputs.  Note  the  sharp peaks  of activity  in \nthe  0  and  6  layers  approximetely  above  the  center  of the \ncharacters,  even  though  they  are  touching.  In  this  case \nthe  maximum activity  of the  6-sheet was  about  14,000 \nand had  to  be scaled by  a  factor  of about  70 to  fit  in the \ngraph  space.  The maximum  activity  in the 0  sheet was \napproximately  196. \n\n\f562 \n\nKeeler, Rumelhart, and Leow \n\nAfter training on several hundred  thousand of the randomly sampled pairs of num(cid:173)\nbers  from  the  9,000,  the  network  generalized  correctly  on  about  81%  of the pairs. \nThis pair accuracy corresponds  to a  single-character recognition  accuracy  of about \n90%.  The  network  recognizes  isolated  single  characters  at  an  accuracy  of about \n95%.  Note  that  this  is  an  artificially generated  data set,  and by  changing  the dis(cid:173)\ntance parameter we  can make the problem as  simple or as  difficult  as we  desire,  up \nto the  point where  the characters  overlap so  much  that  a  human cannot recognize \nthem.  Most  conventional segmentation algorithms do not deal with touching char(cid:173)\nacters,  and so would presumably miss  the vast  majority of these characters.  To see \nhow  overlap affects  performance,  we  tested  generalization  in  the  same  network  on \n100  pairs  with  the  distance  parameter  lowered  to  1.0  and  0.95.  With  a  distance \nparameter of 1.0, the characters  touch  or overlap about 50% of the time.  Of those, \nthe network correctly identified 80%.  Of the 20% that were missed,  about 1/2 were \nunrecognizable  by  a  human.  With  a  distance  parameter  of  0.95,  causing  about \n66% of the characters  to touch,  about 74%  are correctly identified.  As  one expects, \nperformance drops for smaller distance parameters. \n\nThe qualitative behavior of this system is  quite interesting.  As  described  in section \n2,  the learning rule for  the exponential units contains a  term that is competitive in \nnature.  This term  favors  \"winner-take-all\"  behavior for  the  units  in  that  sheet  in \nthe sense  that nearly equal activations are  unstable under the  learning rule:  if one \npresents the same pattern again and again, the learning rule will cause one activation \nto  grow  or  shrink  away  from  the  other  at  an  exponetial  rate.  This  causes  self(cid:173)\norganization to occur in the exponential sheets, and we would expect the exponential \nunits  to  organize  into  highly-localized  activations  or  \"spikes\"  of  activity  on  the \nappropriate exponential layers  directly  above  the input characters.  This is  exactly \nthe behavior that is  observed in the trained network,  as exemplified  in  Figure 2.  In \nthis figure  we  see  two overlapping characters  in the input image (06).  The network \ngeneralized properly with output activity of about 0.8 for zero 0.99 for 6  and about \n0.0 for  everything  else.  Note  that  in  the  exponential  layer,  there  are  very  sharp \nspikes of activity directly above the 0 and the 6 in the appropriate layers.  Indeed, it \nhas been our experience that even with quite noisy input images, the representation \nin  the  exponential  layer  is  very  localized,  and  we  could  presumably  recover  the \npositional information by examining the  activity of the exponential units.  We  can \nthus  think  of  these  spikes  in  the  exponential  layer  as  \"smart  histograms\": \nthe \nexponential  units  in  each  sheet  learn  to  look  for  specific  combinations of features \nin  the  input  layer  and  reject  other  combinations  of inputs.  This  allows  them  to \nrespond  correctly  even if there is  a  significant  amount of noise  in  the input image, \nor if the characters happen  to be touching or broken. \n\n4  DISCUSSION \n\nThe system  presented  here demonstrates that neural networks can, in fact,  be used \nfor  segmentation as  well  as  recognition.  We  have  by  no  means demonstrated  that \nthis method is  better than conventional segmentation/recognition systems  in over(cid:173)\nall performance.  However,  most  conventional  systems  cannot  deal  with  touching, \nbroken, or noisy characters very well  at all,  whereas the present  system handles all \nof these  cases  and  recognition  in  a  single,  integrated fashion.  This  approach  not \nonly offers  an integrated solution to the problems at hand, it also has the properties \n\n\fIntegrated Segmentation and Recognition of Hand-ftinted Numerals \n\n563 \n\nof being  translation  invariant,  trainable  with  minimal information,  and  could  be \nimplemented in hardware for extremely fast  feed-forward  performance. \n\nNote that the architecture discussed  here is  similar in some respects  to the neocog(cid:173)\nnitron model of Fukushima (1980).  However,  the system is different  in several  im(cid:173)\nportant aspects.  First of all, the features  here are learned through backpropagation \nrather  than  hand-coded  as  in  the  neocognitron.  Second,  the  neural  network  self(cid:173)\norganizes  positional  information via localized  activation in  the exponential layers. \nThird, the  network is  all feed-forward  in its run-time dynamics. \n\nFinally,  it  is  worth  pointing  out  that  there  are  other  aspects  of the  problem  that \nwe  have  not  dealt with:  Our network was  trained  on  approximately the  same size \ncharacters  -\nto  within  40%  in  height  and  no  normalization  in  the  x-dimension. \nWe have  not dealt  here  with  the aspects  of normalization, attentional focusing,  or \nrecovery  of positional information,  all  of which  would  be  needed  in  a  functioning \nsystem. \n\nAcknowledgements \n\nWe thank Peter Robinson from  NCR Waterloo for  providing the training data and \nEric Hartman, Carsten Peterson, Richard Durbin, and Charles Rosenburg for useful \ndiscussions. \n\nReferences \n\n[11  K.  Fukushima.  (1980)  Neocognitron:  A  self-organizing  neural  network  model \nfor  a  mechanism  of pattern  recognition  unaffected  by  shift  in  position.  Biological \nCybern.  36,  193-202. \n[21  M.  I. Jordan and D.  E.  Rumelhart  (1990)  Forward models:  Supervised  Learning \nwith a  Distal Teacher.  MIT Center for  Cognitive  Science,  Occasional  paper #  40. \n[31  J.  Keeler,  D.  E.  Rumelhart  and  W.  K.  Leow.  (1991)  Integrated  Segmentation \nand Recognition of Hand-Printed Numerals.  MCC Technical Report A CT-NN-1O. 91 \n\n[41  K.  Lang,  A.  Waibel  and  G.  Hinton. \nArchitecture for  Isolated  Word Recognition.  Neural  Networks,  3  23-44. \n\n(1990)  A  Time  Delay  Neural  Network \n\n[51  Y.  Le  Cun,  B.  Boser,  J.S.  Denker,  S.  Solla,  R.  Howard,  and  L.  Jackel.  (1990) \nBack-Propagation applied to Handwritten  Zipcode  Recognition.  Neural  Computa(cid:173)\ntion 1(4):541-551. \n[61  D.E.  Rumelhart,  G.E.Hinton  and  R.J.Williams  (1986),  \"Learning  Internal \nRepresentations  by  Error  Propagation,\"  in  D.E.Rumelhart,  J.L.McClelland  and \nthe  PDP  Research  Group,  Parallel  Distributed  Processing:  Explorations  in  the \nMicrostructure  of  Cognition.  Volume  1:  Foundations,  Cambridge,  MA:  MIT \nPress/Bradford. \n\n\f", "award": [], "sourceid": 397, "authors": [{"given_name": "James", "family_name": "Keeler", "institution": null}, {"given_name": "David", "family_name": "Rumelhart", "institution": null}, {"given_name": "Wee", "family_name": "Leow", "institution": null}]}