{"title": "KODAK lMAGELINK\u2122 OCR Alphanumeric Handprint Module", "book": "Advances in Neural Information Processing Systems", "page_first": 778, "page_last": 784, "abstract": "", "full_text": "KODAK lMAGELINK\u2122 OCR \nAlphanumeric Handprint Module \n\nAlexander Shustorovich and Christopher W. Thrasher \n\nBusiness Imaging Systems, Eastman Kodak Company, Rochester,  NY 14653-5424 \n\nABSTRACT \n\nThis  paper  describes  the  Kodak  Imageliok TM  OCR  alphanumeric \nhandprint  module.  There  are  two  neural  network  algorithms  at  its \ncme:  the  first  network  is  trained  to  find  individual  characters  in  an \nalphamuneric  field,  while  the  second one perfmns the  classification. \nBoth  networks  were  trained  on  Gabor  projections  of  the  ociginal \npixel  images,  which  resulted  in higher  recognition  rates  and  greater \nnoise  immunity.  Compared \nits  purely  numeric  counterpart \n(Shusurovich  and  Thrasher,  1995),  this  version  of the  system  has  a \nsignificant  applicatim  specific  postprocessing  module.  The  system \nhas  been  implemented in specialized parallel hardware,  which allows \nit to run at 80 char/sec/board.  It has been installed  at the Driver and \nVehicle  Licensing  Agency  (DVLA)  in  the  United  Kingdom.  and  its \noverall  success  rate  exceeds  96%  (character  level  without  rejects). \nwhich  translates  into  85%  field  rate.  If approximately  20%  of the \nfields  are  rejected.  the  system  achieves  99.8%  character  and  99.5% \nfield success rate. \n\nto \n\nINTRODUCTION \n\n1 \nThe  system  we describe  below was  designed  to process  alphanumeric fields  extracted \nfrom forms. The major assumptialS were that (1) the form  layoot and definition allows \nthe  system  to  capture  the  field  image  with  a  single  line  of  characters,  (2)  the \ncharacters  are  handprinted  capital  letters  and  numerals,  with  possible  addition  of \nseveral special characters,  and (3) the characters may occasimally touch,  but generally \nthey  do  not  overlap.  We  also  assume  that  some  additional  informatim  about  the \ncootents of the field  is available  to  assist in the process of disambiguation.  Otherwise, \nit is virtually  impossible  to distinguish not only between\" 0  \"  and  zero, but also\" I  \" \nand one, \" Z  \" and two,  \" S \"  and five, etc. \nA  good example of such an  applicatim is the  processing of vehicle registration forms \nat  the  Driver  and  Vehicle  Licensing  Agency  (DVLA)  in  the  United  Kingdom.  The \nalphamuneric  field  in  question  contains  a  license  plate.  There  are  29 allowed patterns \nof  character  combinations,  fran  two  to  seven  characters  long.  For  example,  \" \nA999AAA \"  is a valid license,  whereas\" A9A9A9 \" is not (here  ..  A  \"  stands for any \nalpha  character,  \"  9  ..  - for  any  numeric  character).  In  addition,  every  field  has  a \n\n\fKODAK IMAGELINKTM OCR Alphanumeric Handprint Module \n\n779 \n\ncontrol  character box  on the right.  This  control cltaracter is computed  as  a  remainder \nof  the  integer  division  by  37  of  a  linear  e<mbination  of  numeric  values  of  the \ncharacters in the main field.  Ambiguous cltaracters. namely  \" 0  \".  \" I \".  and \" S \" are \nnot allowed in the  role of the control character. so they are replaced here by \"  - \".  \" + \n\".  and  \"  I  \"  (not  a  very  good  choice.  and  the  37th character  used  is  the  \"  %  fl.  To \nmake things m<m complicated.  sometimes  the control character is  not available at  the \nmoment  of  filing  the  form  (at  a  local  post  dfice).  and  this  lack.  of knowledge  is \nindicated by putting an asterisk instead. Later we will discuss possible ways to use  this \nadditiooal information in an application specific postprocessing module. \n\n2  SEGMENTATION AND  ALTERNATIVE APPROACHES \nThe  most challenging problem for handprint OCR. is finding  individual characters in a \nfield.  A  number of approaches  to this problem can be found  in the  literature.  the  two \nmost common  being  (1)  segmentation  (Gupta et al .\u2022  1993.  as  an example  of a  recent \npublication).  and  (2)  combined  segmentation  and  recognition  (Keeler  and Rumelhart. \n1992). \nThe  segmentation  approach has  difficulty  separating  touching  characters.  and recendy \nthe  consensus  of  practitioners  in  the  field  started  shifting  towards  e<mbined \nsegmentation  and  recognition.  In  this  scheme.  the  algmthm  moves  a  window  of  a \ncertain  width  along  the  field.  and  confidence  values  of  competing  classification \nhypotheses  are  used (sometimes  with  a  separate  centered/noncentered node)  to decide \nif the  window is positioned on top of a cltaracter. In the Saccade  system (Martin et al .\u2022 \n1993). for example.  the neural network was trained not only  to recognize characters in \nthe  center  of the  moving  window  (and  whether  there  is  a  character  centered  in  the \nwindow).  but  also  to  make  corrective  jumps  (saccades)  to  the  nearest  character  and. \nafter classification. to the next character. \nStill  another  variation  on the  theme  is  an  arrangement when the classification window \nis  duplicated  with  one- or  several-pixel  shifts  along  the  field  (Benjio  et  al .\u2022  1994). \nThen the  outputs of the  classifiers  serve as  input for  a  postprocessing  module  (in this \npaper.  a  IDdden  Markov  Model)  used  to decide  which of the  multitude  of processing \nwindows actually have centered cltaracters in them. \nAll  these  approaches  have  deficiencies.  As  we  mentioned  earlier.  touching cltaracters \nare  difficult  for  autonomous  segmenters.  The  moving  (and jumping)  window  with  a \nsing1e  cemered/noncentered  node  tends  to  miss  narrow  characters  and  sometimes  to \nduplicate  wide  ones.  The replication  of a  classifier  together  with postprocessing  tends \nto be quite expensive computationally. \n\n3  POSmONING NETWORK \nTo do  the  positioning.  we decided  to introduce  an may of output units corresponding \nto  successive  pixels  in  the  middle  portion  of  the  window.  These  nodes  signal  if a \ncenter  (\"heart\")  of  a  character  lies  at  the  c<rresponding  positions.  Because  the \nprecision  with  which  a  human  operator  can  mark  the  character heart  is  low  (usually \nwithin one or two pixels  at best).  the  target  activatims  of three cmsecutive nodes  are \nset to one if there  is a  cltaracter heart at a  pixel positioo corresponding  to  the middle \nnode. The rest of the target activations are  set to zero. \nThe  network  is  then  trained  to  produce  bumps  of activation  indicating  the  cltaracter \nhearts.  Two buffer regions  on the  left  and on the right of the  window (pixels  without \nCOITesponding  output  nodes)  are  necessary  to  allow  all  or  most  of  the  cltaracter \ncentered  at  each  of  the  output  node  positions  to  fit  inside  the  window.  The \nreplacement  of a  single  centered/noncentered  node  by  an  array  allows  us  to  average \noutput  activations.  generated  by  different  window  shifts.  while  corresponding  to  the \nsame position.  lbis additional  procedure allows us  to slide the  window several pixels \n\n\f780 \n\nA. SHUSTOROVICH, C. W. THRASHER \n\nat  a  time:  the  appropriate  step  is  a  trade-off  between  the  processing  speed  and  the \nrequired  level  of  robustness.  The  final  procedure  involves  thresholding  of  the \nactivation-wave and  the  estimation of the  predicted character position as  the center of \nmass  of  the  activation-bubble.  The  resulting  algmthm  is  very  effective:  touching \ncharacters  do  not  present  significant  problems.  and  only  abnormally  wide  characters \nsometimes fool the system into false  alarms. \nThe  system  works  with  preprocessed  images.  Each  field  is  divided  into  subfields  of \ndisconnected  groups  of characters.  These  subfields  are  size-normalized  to  a  height of \n20 pixels.  After  that  they  are  reassembled  into  a  single field  again.  with 6  pixel gaps \nbetween  them.  Two  blank  rows  are  added  both  along  the  top  and  the  bottom  of the \nrecombined  field  as preferred by the  Gabor projection technique  (Shustorovich.  1994). \nIn our current system.  the  input nodes of a  sliding window are organized in a 24 x  36 \narray.  The first.  intermediary.  layer of the network implements the Galxr projections. \nIt has  12 x  12 local receptive fields  (LRFs) with fixed  precanputed weights.  The step \nbetween LRFs is 6  pixels  in both directions.  We work with  16 Gabor basis functions \nwith  circular  Gaussian  envelopes  centered  within  each LRF;  they  are  both  sine  and \ncosine  wavelets  in four  mentati(llS and two sizes.  All  16 projections fr<m  each LRF \nconstitute  the  input  to  a  column  of 20 hidden  units.  thus  the  second  (first  trainable) \nhidden layer  is  organized  in  a  three-dimensional  array  3  x  5  x  20.  The  third hidden \nlayer of the  network also has  local receptive fields.  they  are  three-dimensiooal 2  x  2  x \n20 with the  step  1  x  1 x  O.  The units  in the  third hidden  layer  are  also  duplicated 20 \ntimes.  thus  this layer is organized  in a  three-dimensional  array  2  x  4  x  20. The fourth \nhidden  layer has  60 units  fully  connected to the  third  layer.  Fmally. the  output layer \nhas  12 units.  also fully connected to the fourth layer. \nThe  network  was  trained  using  a  variant  of the  Back-Propagation  algorithm.  Both \ntraining  and  testing  sets  were  drawn  from.  the  field  data  collected  at  DVLA.  The \ntraining  set  contained  approximately  60.000  charactel\"s  from  8.000  fields.  and  about \n5,000  charactel\"s  from  650  fields  were  used  for  testing.  On  this  test  set.  more  than \n92%  of all character hearts  were  found  within  I-pixel  precision,  and only  0.4%  were \nmissed by more than  4 pixels. \n\n4  CLASSIFICATION NETWORK \nThe  structure  of the  classification  network resembles  that  of the  positioning  network. \nThe  Gabor  projection  layer  w<X'ks  in exactly  the  same  way.  but  the  window  size  is \nsmaller.  only  24 x  24 pixels.  We chose this  size because after height normalization to \n20  pixels.  only  occasionally  the  charactel\"s  are  wider  than  24  pixels.  Widening  the \nit  increases  the  dimensionality  of  the  input  while \nwindow  complicates  training: \nproviding  information.  mostly  about  irrelevant  pieces  of  adjacent  characters.  As  a \nresult.  the  second  layer  is  organized  as  a  3  x  3  x  20 array  of units  with  LRFs  and \nshared  weights.  the  third  is  a  2  x  2  x  20  array  of units  with  LRFs.  and  there  are  37 \noutput units  fully  connected  to  the  80 units  in the  third  layer.  The  number of ouq,ut \nunits  in this  variant of our system has  been determined by the intended  application.  It \nto  recognize  uppercase  letters.  numerals.  and  also  five  special \nwas  necessary \ncharactel\"s.  namely  plus  (+).  minus  (-).  slash  (f).  percent  (%).  and  asterisk  (*).  Since \nadditional information  was available for the  purposes of disambiguation.  we combined \n..  0  ..  and  zero.  ..  I  ..  and one.  ..  Z  to  and  two.  ..  S  ..  and  five.  and  so the  number of \noutput classes became 26 (alpha)  + 6  (numerals  3,4,6.7.8.9) + 5  (special characters) = \n37. \nBecause  we did not expect  any  positioning module to provide  precision  higher than  1 \nor 2  pixels. the  classifier network was trained  and tested.  on five copies of all centered \ncharacters in the  database,  with shifts of O.  1, and 2  pixels, both left and right  On the \ntest  set  mentioned  in  the  previous  section.  the  corresponding  character \nsame \nrecognition  rates  averaged  93.0%. 955%. and  96.0% for characters normalized  to the \n\n\fKODAK IMAGELINKTM OCR Alphanumeric Handprint Module \n\n781 \n\nheight of 18 to 20 pixels and placed in the  middle of the  window  with shifts of 0  and \n1 pixel up and down. \n\nS  POSTPROCESSING MODULE \nThe  postprocessing  module  is  a  rule-based  algorithm.  Fust.  it  monitors  the  width  of \neach  subfield and rejects  it if the number of predicted charactex hearts  is  inconsistent \nwith the width.  For example. if the positioning system cannot find  a  single character in \na  subfield.  the  output  of  the  system  bec<mes  a  question  made.  Second.  the \npostprocessing  module  <rganizes  competition  between  predicted  character  hearts  if \nthey  are  too  close  to each  other.  For example.  it will kill  a  predicted  center  with  a \nlower  activation  value  if its  distance from  a  competitor  is  Jess  than  ten  pixels.  but it \nmay  allow  both to survive if one of the two labels is \"one\". It is especially sensitive to \nclosely  positioned  centers  with  identical  labels.  and  will  remove  the  weaker  one  for \nwide characters such as  II W  \" or \" Mil. \nThe  rest  of  the  postprocessing  had  to  rely  on  the  applicatioo  knowledge.  Since  the \nalphanumeric fields  on DVLA forms contain  license plates.  we could use  the fact  that \nthere ~ exactly 29  allowed patterns  of symbol combinations.  and that carect strings \nshould match control characters from the box on the right. \nBecause  in  this  applicatioo  rejection  of  individual  characters  is  meaningless.  we \ndecided  to keep  and analyze  all  possible candidates for each detected positioo.  that  is. \ncharacters with output activations  above a certain threshold (currently. 0.1). Of course. \nspecial charactexs  are not allowed in the  main field.  The field  as a whole is rejected if \nfor  any  one  position  there  is  not  even  a  single  candidate  cllaracter.  All  possible \nCOOlbinations  of candidate characters  are  analyzed  A  candidate  string  is  rejected if it \ndoes  not  conform  to  any  of  allowed  patterns.  or  if it  does  not  match  any  of  the \ncandidate  control cllaracters. All remaining candidate  strings  are assigned confidences. \nSince a chain is no stronger than its weakest link. in the case of an asterisk (no control \ncharactex information).  the string confidence equals  that of its least confident cllaracter. \nIf there is a  valid control character. then we can tolerate one low-confidence cllaracter. \nand  so  the  string  confidence  equals  that  of  its  charactex  with  the  second  lowest \nindividual  confidence.  If there  are  two  or  mme  candidate  strings.  the  difference  in \nconfidence  between  the  best  and  the  second  best  is  compared  to  another  threshold \n(currently. 0.7) in order to pass the final round of rejects. \n\n6  CONCLUSIONS \nKodak  Imagelink\u2122OCR alphanumeric  handprint module desaibed  in this  paper uses \none  neural  network  to  find  individual  cllaracters  in  a  field.  and  then  the  second \nnetwork performs the classification.  The outputs of both networks are interpreted by a \npostprocessing module that generates the final label string (Figure  1. Figure 2). \nThe  algmthms  were  designed  within  the  constraints  of  the  planned  hardware \nimplementation.  At the  same  time.  they  provide  a  high level  of positioning  accuracy \nas  well  as  classification  ability.  One  new  feature  of our  approach  is  the  use  of an \narray  of centered/noncentered  nodes to significantly  improve  speed  and robustness  of \nthe  positioning  scheme.  The  overall  robustness  of the  system  is  further  improved  by \nnoise resistance provided by a  layer of Gabor projection units. The positioning module \nand the classification module are unified by the postprocessing module. \nSystem-level  testing  was performed on a  test  set  mentioned  above. The image  quality \nwas  generally  very good.  but  the  data  included  some fields  with  touching  characters. \nThe  character level  success rate (without rejects)  achieved on this  test exceeded 96%. \nwhich corresponded  to  above  85%  field  rate.  With  approximately  20%  d  the  fields \nrejected. the system achieved 99.8% character and 995% field success rate. \n\n\f782 \n\nA. SHUSTOROVICH, C. W. THRASHER \n\nIn  the  testing  mode,  the  preprocessing  module  would  separate  characters  if it  can \nreliably  do  so,  normalize  them  individually,  and  place  them  with  gaps  of ten  blank \npixels,  in  order  to  simplify  the  job  of  both  the  positioning  and  the  classification \nmodules.  When  it is  impossible  to  segment individual  characters,  our system  is  still \nable  to  perform  on the  level  of approximately 94%  (since  it has  beea  trained  on such \ndata).  The robustness  of our  system is  an  impOOant factor  in  its  success.  Most other \nsystems have substantial difficulties  trying to recover from. errors in segmentation. \n\nReferences \nBenjio, Y., Le Gm, Y., and Henderson, D. (1994) Globally Trained Handwritten Word \nRecognizer  Using  Spatial  Representation,  Space  Displacement  Neural  Networks  and \nHidden  Markov  Models.  In  Cowan,  J.D.,  Tesauro,  G.,  and  Alspector,  J.  (eds.), \nAdvances in Neural  Information  Processing  Systems  6,  pp.  937-944.  San Mateo,  CA: \nMorgan Kaufmann Publishers. \nGupta,  A., Nagendraprasad,  M.V., Lin. A., Wang, P.S.P., and  Ayyadurai,  S. (1993) An \nIntegrated  Architecture  for  Recognition  of  Totally  Unconstrained  Handwritten \nNumerals.  International  Journal  of Pattern  Recognition  and Artificial  Intelligence  7 \n(4), pp. 757-773. \nKeeler,  J.  and Rume1hart.  DE (1992)  A  Self-Organizing Integrated  Segmentation and \nRecognition  Neural  Net.  In  Moody,  J.E.,  Hanson.  S.1.,  and  Lippmann,  R.P.  (eds.), \nAdvances  in  Neural  Information  Processing  Systems 4,  pp.  496-503.  San Mateo,  CA: \nMorgan Kaufmann Publisbers. \nMartin.  G.,  Mosfeq, R, Otapman.  D.,  and Pittman,  J.  (1993)  Learning  to  See  Where \nand  What:  Training  a  Net  to  Make  Saccades  and  Recognize  Handwritten  Otaracters. \nIn Hanson.  S.J.,  Cowan.  JD., and Giles,  c.L.  (eds.),  Advances  in Neural  Information \nProcessing Systems 5, pp. 441-447.  San Mateo,  CA:  Morgan Kallfm8l1D  Publishers. \nShustorovich,  A.  (1994)  A  Subspace  Projection  Approach  to  Feature  Extraction:  the \nTw~Dimensianal Gab\u00ab  Transform  for  Character  Recognition.  Neural  Networks  7 \n(8),  1295-1301. \nShustorovich,  A.  and Thrasher,  C.W.  (1995)  KODAK IMAGELINK\u2122OCR  Numeric \nHandprint  Module:  Neural  Network  Positioning  and  Oassification.  ~ings of \nSession  11  (Document  Processing)  of  the  industrial  conference  of ICANN-95  Paris, \nOctober 9-13,  1995. \n\n\fKODAK IMAGELINKTM OCR Alphanumeric Handprint Module \n\n783 \n\nOriginal Image with Detected Subimages \n\nScaled Subimages \n\nCharacter Heart Index Waveform \n\nDetected Character Hearts \n\nBest Guess Characters \n\nMY 9  Z  B  E  we \nM7  9  2  B  E  we \n\nFinal Character string (After Post-Processing) \n\nFigure 1:  An Example of a Field Processed by the System \n\nOutline characters indicate low confidence. \n\n\f784 \n\nA. SHUSTOROVICH. C. W. THRASHER \n\nOriginal Image with Detected Subimages \n\nScaled Subimages \n\nOlaracter Heart Index Waveform \n\nDetected Character Hearts \n\nBest Guess Cll81'acters \n\nG3S8AAF3 \nG358AAF3 \n\nFinal 0Iaracter String (MterPost-Processing) \n\nFigure 2: Another Example cI a FJeld Processed by the System. \n\n\f", "award": [], "sourceid": 1104, "authors": [{"given_name": "Alexander", "family_name": "Shustorovich", "institution": null}, {"given_name": "Christopher", "family_name": "Thrasher", "institution": null}]}