{"title": "Human Reading and the Curse of Dimensionality", "book": "Advances in Neural Information Processing Systems", "page_first": 17, "page_last": 23, "abstract": null, "full_text": "Human Reading and  the  Curse of \n\nDimensionality \n\nGale L.  Martin \n\nMCC  Austin,  TX 78613  galem@mcc.com \n\nAbstract \n\nWhereas optical character recognition (OCR) systems learn to clas(cid:173)\nsify single characters;  people learn to classify long character strings \nin  parallel,  within  a  single  fixation .  This  difference  is  surprising \nbecause  high  dimensionality is  associated  with  poor  classification \nlearning.  This  paper  suggests  that  the  human  reading  system \navoids  these  problems  because  the  number  of to-be-classified  im(cid:173)\nages  is  reduced  by  consistent  and  optimal eye  fixation  positions, \nand  by  character sequence  regularities. \n\nAn interesting difference exists between human reading and optical character recog(cid:173)\nnition (OCR) systems.  The input/output dimensionality of character  classification \nin  human reading is much greater than that for  OCR systems (see  Figure  1) .  OCR \nsystems  classify  one  character  at  time;  while  the  human  reading  system  classifies \nas  many  as  8-13  characters  per  eye  fixation  (Rayner,  1979)  and  within  a  fixation, \ncharacter  category  and  sequence  information is  extracted  in  parallel  (Blanchard, \nMcConkie,  Zola,  and  Wolverton,  1984;  Reicher,  1969). \n\nOCR  (Low  Dbnensionality) \nI Dorothy  lived In the  .... I \n[Q]  ... _ ....................................  \"D\" \n~ ................................. ..  \"0\" \no \n\n~  \"R\" \n\nHUnlan  Reading  (High  Dbnensionality) \n\n..............  \"DOROTHY  LI\" \n..  . .. )00  \"LIVED  IN  THE\" \n\n...  \"MIDST  OF THE\" \n\nI Dorothy  lived In the  midst  of the ..... \nI Dorothy  lil \n\nIlived  In the  I \n\n....... \n\nI midst  of the I \n\nFigure  1:  Character  classification  versus  character sequence  classification. \n\nThis is an interesting difference  because high dimensionality is associated with poor \nclassification  learning-the  so-called  curse  of dimensionality  (Denker,  et  ali  1987; \nGeman,  Bienenstock,  &  Doursat,  1992).  OCR  systems  are  designed  to  classify \nsingle characters to minimize such problems.  The fact that most people learn to read \nquite well even  with the high dimensional inputs and outputs, implies that variance \n\n\f18 \n\nG.  L.  MARTIN \n\nis  somehow lowered  in this domain, thereby  making accurate classification learning \npossible.  The present paper reports on simulations of parallel character classification \nwhich suggest that variance is lowered  through regularities in eye fixation  positions \nand in  character sequences  making up  valid  words. \n\n1  Training and Testing Materials \n\nTraining and  testing  materials were  drawn  from  the  story  The  Wonderful  Wizard \nof Oz by  L. Frank Baum.  Images of text lines  were  created from  120  pages  of text \n(about 160,000 characters,  33,000 total words,  or 2,600 different  words),  which were \ndivided  into  6  different  font  and  case  conditions of 20  pages  each.  Three  different \nfonts (variable and constant-width fonts),  and two different  cases  (all upper-case  or \nmixed-case  characters)  were  used.  Text  line  images were  normalized  with  respect \nto  height,  but  not width .  All  training and test  sets  contained  an equal  mix of the \nsix  font/case  conditions.  Two  generalization  sets  were  used,  for  test  and  cross(cid:173)\nvalidation, and each  consisted  of about  14,000 characters. \n\nDorothy lived in the JDidst of the great Kansas Prairies. \nDOROTHY LIVED IN THE MIDST OF THE GREAT KANSAS PRAIRIES. \nDorothy lived in the midst of the great Kansas Prairies. \nDOROTHY LIVED IN THE MIDST OF THE GREAT KANSAS PRAIRIES. \nDorothy  1~ved  ~n  the  m~dst of  the  great  Kansas  Pra~r~es. \nDOROTHY  LIVED  IN  THE  MIDST  OF  THE  GREAT  KANSAS  PRAIRIES. \n\nFigure  2:  Samples of the type font  and  case  conditions used  in the simulations \n\n2  Network  Architectures \n\nThe simulations used  backpropagation networks  (Rumelhart,  Hinton  &  Williams, \n1986)  that  extended  the  local  receptive  field ,  shared-weight  architecture  used  in \nmany character-based OCR neural networks (LeCun, et aI,  1989; Martin & Pittman, \n1991) .  In  the previous single  character-based  approach,  the  input  to  the  net  is  an \nimage of a single character.  The output is a  vector representing  the category ofthe \ncharacter.  Hidden  nodes  have  local  receptive  fields  that receive  input from  a  spa(cid:173)\ntially local region, (e.g.,  a 6x6  area)  in the preceding layer.  Groups of hidden  nodes \nshare their weights.  Corresponding weights in each  receptive field  are initialized to \nthe same  value  and  updated  by  the  same  value.  Different  hidden  nodes  within  a \ngroup  learn  to  detect  the  same feature  at  different  locations.  A  group  is  depicted \nas  hidden  nodes within  a single  plane of a  cube that  corresponds to a  hidden  layer. \nDifferent  groups  occupy  different  planes  in the  cube,  and  learn  to  detect  different \nfeatures.  This architecture  biases learning by  reducing  the number of free  parame(cid:173)\nters available for  representing  a function.  The fact that these  nets usually train and \ngeneralize well in  this domain, and that the local feature  detectors  that emerge  are \nsimilar to the oriented-edge  and -line  detectors  found  in  mammalian visual cortex \n(Hubel  &  Wiesel,  1979), suggests  that the bias is  at  least roughly appropriate. \n\nThe  extension  of this  character  network  to  a  character-sequence  network  is  illus(cid:173)\ntrated  in  Figure  3,  where  n  (number  of to-be-classified  characters)  is  equal  to  4. \nEach  output node  represents  a  character  category  (e.g.,  \"D\")  in one of the nth or(cid:173)\ndinal  positions  (e.g.,  \"First  character  on  the  left\") .  The size  of the  input  window \nis  expanded  horizontally  to  cover  at  least  the  n  widest  characters  ( \"WWWW\") . \nWhen the character string is  made up  of relatively narrow characters, more than  n \ncharacters  will  appear  in  the  input  window  and  the  network  must  learn  to ignore \n\n\fHuman Reading  and the Curse of Dimensionality \n\n19 \n\nthem.  Increasing  input/output  dimensionality  is  accomplished  by  expanding  the \nnumber of hidden  nodes  horizontally.  Network  capacity  is  described  by  the  depth \nof each  hidden  layer  (the  number  of different  features  detected),  as  well  as  by  the \nwidth of each  hidden  layer  (the spatial coverage  of the  network) . \n\nThe  network  is  potentially  sensitive  to  both  local  and  global  visual  information. \nLocal  receptive fields  build  in a  sensitivity  to letter features.  Shared  weights make \nlearning  transfer  possible  across  representations  of the  same  character  at  different \npositions.  Output nodes are globally connected to all the nodes in the second hidden \nlayer,  but  not  with  one  another  or  with  any  word-level  representations.  Networks \nwere  trained until the training set  accuracy failed to improve by at least  .1%  over  5 \nepochs,  or  overfitting became evident from  periodic testing with the  generalization \ntest  set. \n\nABC  EFOHIJKLMNOP  RSTUVWXYZ \nABCDEFOHIJKLMN  PQRSTUVWXYZ \n\nLocal,  sharr!d-weight  rl'!ceptive  /ielcls \n\nFigure 3:  Net  architecture for  parallel character sequence  classification, n=4 chars. \n\n3  Effects  of Dimensionality on  Training  Difficulty  and \n\nGeneralization \n\nExperiment 1 provides a baseline measure of the impact of dimensionality.  Increases \nin dimensionality result  in exponential increases in the number of input and output \npatterns and the number of mapping functions .  As a result, training problems arise \ndue  to  limitations  in  network  capacity  or  search  scope.  Generalization  problems \narise  because  it  becomes  impractical to  use  training sets  large  enough  to  obtain  a \ngood  estimate  of the  underlying  function.  Four  different  levels  of dimensionality \nwere  used  (see  Figure  4),  from  an  input  window  of  20x20  pixels,  with  1  to-be(cid:173)\nclassified  character  to  an  80x20 window,  with  4 to-be-classified  characters  ).  Input \npatterns were generated by starting the window at the left edge of the text line such \nthat the first  character was centered  10  pixels from the left of the window,  and then \nsuccessively  scanning  across  the  text  line  at  each  character  position .  Five  training \nset  sizes  were  used  (about  700  samples to 50,000).  Two relative network  capacities \nwere  used  (15  and  18  different  feature  detectors  per  hidden  layer).  Forty  different \n\n2Ox2O  - 1  Character \n\n~ ---lOo-\"D\" \n.-.,. \"a\" \n\n@] \n\n40x20  - 2  Characters \n~ _\"DO\" \n@!9  .-~ \"OR\" \n\n60x20  - 3  Characters \n... \u00bb- \"DaR\" \nlDocol \nlorotij  ~ \"ORO\" \n\n80x20  - 4  Characters \nI Dorothl  ~ \"DORa\" \nI orothy \\ ._ .....  \"OROT\" \n\nLow \n\n...................  Dimensionality  . \n\n. ............. ........ ..... ~  High \n\nFigure 4:  Four levels of input/output dimensionality used  in the experiment. \n\n\f20 \n\nG.  L.  MARTIN \n\nnetworks were  trained, one for  each combination of dimensionality, training set size \nand relative network capacity (4x5x2).  Training difficulty is described by asymptotic \naccuracy  achieved  on the training set  and  by  amount of training required  to reach \nthe  asymptote.  Generalization is  reported  for  both  the  test  set  (used  to  check  for \noverfitting)  and  the  cross-validation set.  The  results  (see  Figure  5)  are  consistent \n\nLower  Capa:ity  Nets \n\nHigher  Capdy Nets \n\na)1~~8~ \n\n3~ \n\nB \n\n(.) \n~ \n\n94 \no \n\n10lXl  4IlD \n\n:nm \n\n\u00abXDJ \nTraining Set Size \n\n4 Ch \n, \n\n' \n\n!illll \n\no \n\n10lXl  4IlD  :nm  \u00abXDJ \n\nTraining Set Size \n\n!illll \n\nAmount of Training Required  to  Reach Asymptote \n\nd) \n\n4 Ch \n\n3 Ch \n2 Ch \n1 Ch \n\no \n\nGeoeralization AtturIII:)'  on Test Set \n\n~~====~::::-. 1 ~  f)  m~~~~.:o1 ~ \n\n.. _ ... __ . t . . . .   4 Ch \n\n4 ~ \n\n0 \n\n10lXl  m:D \n\n:nm  \u00abXDJ \n\n!illll \n\n0 \n\n10lXl  4IlD  :nm  \u00abXDJ  &nO \n\nOeoeralization Aa:uraq on Validation Set \n\nh) \n\n28R \n3 Ch \n\n4 Ch \n\nl ~~ \n\n4  h \n\ne) \n\ng) \n\n&i \nt: \n0 \n(.) \n~ \n\n0 \n\n1croJ  m:D \n\n:nm \n\n\u00abXDJ \n\n!DJl) \n\n0 \n\n10lXl  4IlD  :nm  \u00abXDJ  &nO \n\nFigure 5:  Impact of dimensionality on training and generalization. \n\nwith expectations.  Increasing  dimensionality results  in increased  training difficulty \nand  lower  generalization.  Since  the  problems  associated  with  high  dimensionality \noccur  in  both  training  and  test  sets,  and  seem  to  be  alleviated  somewhat  in  the \nhigh  capacity  nets,  they  are  presumably  due  to  both  capacity/search  limitations \nand insufficient sample size, \n\n4  Regularities in Window Positioning \n\nOne  way  human  reading  might  reduce  the  problems  associated  with  high  dimen(cid:173)\nsionality is  to  constrain eye  fixation  positions during reading;  thereby  reducing the \nnumber of different  input images the system must learn to classify.  Eye  movement \n\n\fHuman  Reading and the Curse of Dimensionality \n\n21 \n\nstudies  suggest  that,  although  fixation  positions  within  words  do  vary,  there  are \nconsistencies  Rayner,  1979).  Moreover,  the  particular  locations fixated,  slightly to \nthe  left  of the  middle of words,  appear to be  optimal.  People  are  most efficient  at \nrecognizing words at these locations (O'Regan & Jacobs,  1992).  These fixation posi(cid:173)\ntions reduce  variance  by reducing the average variability in the positions of ordered \ncharacters  within  a  word.  Position  variability increases  as  a  function  of  distance \nfrom  the  fixated  character.  The  average  distance  of characters  within  a  word  is \nminimized  when  the fixation  position is toward  the center  of a  word,  as  compared \nto when  it is  at the beginning or  end of a  word. \n\nExperiment  2  simulated  consistent  and  optimal  positioning  with  an  80x20  input \nwindow  fixated  on  the  3rd  character.  Only  words  of 3  or  more  characters  were \nfixated  (see  Figure 6) .  The network  learned  to classify  the first  4 characters in the \nword.  This condition  was  compared  to  a  consistent  positioning  only  condition,  in \nwhich  the input  window  was fixated  on the first  character  of a  word.  Two  control \nconditions were  also examined.  They were replications ofthe 20x20-1Character and \nthe 80x20-4 Character conditions of Experiment 1,  except  that in the first  case,  the \nnetwork  was  trained  and tested  only on  the first  4  characters in  each  word  and in \nthe second  case,  the network was  trained as  before but was tested  with the window \nfixated  on the first  character  of the word.  Four levels  of training set size  were  used \nand three replications of each  training set size x window conditions were  run (4 x  4 \nx  3 = 48  networks  trained  and tested).  All  networks employed 18  different  feature \ndetectors  for  each  hidden  layer.  The  results  (see  Figure  7)  support  the  idea  that \n\nConsistent  &  Optimal \n\n80x20  - 4 Chars \n\nI DOfthi --.. \"DORO\" \n\n~--\"\"LIVE\" \n\nConsistent  Only \n80x20  - 4  Chars \n\nHigh  Dim.  Control \n80x20  - 4  Chars \n\nLow  Dim.  Control \n2Ox20  - 1  Char \n\n'i0roth~ --.. \"DORO\" \n\n~oroth~ --.. \"DORO\" \n\nFigure 6:  Window  positioning and  dimensionality manipulations in  Experiment  2 \n\nconsistent  and  optimal  positioning  reduces  variance,  as  indicated  by  reductions \nin  training  difficulties  and  improved  generalization.  The  consistent  and  optimal \npositioning  networks  achieved  training  and  generalization  results  superior  to  the \nhigh  dimensionality  control  condition,  and  equivalent  to,  or  better  than  those  for \nthe  low  dimensionality control.  They  were  also  slightly better  than  the  consistent \npositioning only nets. \n\n5  Character  Sequence Regularities \n\nSince  only  certain  character  sequences  are  allowed  in  words,  character  sequence \nregularities  in  words  may  also  reduce  the  number  of distinct  images  the  system \nmust  learn  to  classify.  The system  may  also  reduce  variance  by  optimizing accu(cid:173)\nracy  on  highest  frequency  words.  These  hypotheses  were  tested  by  determining \nwhether  or  not  the  three  consistent  and  optimal  positioning  networks  trained  on \nthe largest training set in  Experiment 2,  were  more accurate in classifying high fre(cid:173)\nquency words,  as compared to low frequency words; and more accurate in classifying \nwords  as  compared  to  pronounceable non-words or random character  strings.  The \ncontrol  condition  used  the  networks trained in  the low  dimensional control  (20x20 \n-1  Character)  condition from  Experiment 2.  Human reading exhibits increased  effi(cid:173)\nciency I accuracy  in  classifying high frequency  as  compared to  low  frequency  words \n\n\f22 \n\nG.  L.MARTIN \n\nTralnln3 Dlmculty \n\nAmount or Training to Reach Asymptote \n\nAsymptotic Accuracy on Tralning Set \n\n'OOf~ ---am --- -m  .. \"\"\"\"\" \u2022 \"\" \n\nConsISt. \nCnlrl-20x20 \n\n' .  \n\n............ \n\n_ \n\n-- -- ----------.. Cnlrl-80x20 \n\n0) \n\ntl  98 \n!3 \nU96 \n~ \n\n94 \no \n\n600J \n\n10000 \n\nJ !  \n1600J \n'Itaining  Set  Size \n\n20000 \n\nGeneralization  Acclll'acy \n\nTest  Set \n\nI \n\n1 oor----:::~~~==::::::;:==::!!I CCJ\"ISist. & Opt. \n_ __ ,  ___ \u2022  Cntrl-BOx20 \n\nCn,Irl-20x20 \nConsist. \n\n_ \n~  ~ ~~.,,'~'------\n~ 26 \n\n.. ' \n\n100 . - - - - - - - -__ -,--:xconslsl.  & Opt \n\nValidation  Set \n, '  - - - -\n\nCCJ\"Islst. \nCnlrl-20x20 \n\n____ ,  __ ,  ____ \u2022 - -\"  Cntrl-BOx20 \n\n\u00b0O~--~6OOJ~--1-0000~---16OOJ~----20000 \n\n~~~6OOJ~-~10000~-1~6OOJ~~20000~ \n\nTraining  Set  Size \n\n'Itaining  Set  Size \n\nFigure 7:  Impact of consistent  &  optimal window  positions. \n\n(Howes  &  Solomon,  1951;  Solomon &  Postman,  1952)  , and  in  classifying  charac(cid:173)\nters in words as  compared to pronounceable non-words or random character strings \n(Baron  &  Thurston,  1973;  Reicher,  1969).  Experiment  3  involved  creating  a  list \nof 30  4-letter  words  drawn from  the  Oz  text,  of which  15  occurred  very  frequently \nin  the  text  (e.g.,  SAID),  and  15  occurred  infrequently  (e.g.,  PAID),  and  creating \na  list  of 30  4-letter  pronounceable  non-words  (e.g.,  TOlD)  and  a  list  of 30  4-letter \nrandom strings (e.g., SDIA).  Each string was reproduced  in each of the 6 font  Icase \nconditions and labeled to create a test set.  One further condition involved creating a \nversion of the word list in which the cases  of the characters aLtErN aTeD.  Psycholo(cid:173)\ngists used this manipulation to demonstrate that the advantages in processing words \ncan  not  simply be  due  to the  use  of word-shape  feature  detectors,  since  the  word \nadvantage carries  over  to the alternating case  condition,  which  destroys  word-level \nfeatures  (McClelland,  1976). \n\nConsistent  with  human  reading  (see  Figure  8),  the  character-sequence-based  net(cid:173)\nworks  were  most  accurate  on  high  frequency  words  and  least  accurate  for  low  fre(cid:173)\nquency  words.  The  character-sequence-based  networks  also  showed  a  progressive \ndecline  in  accuracy  as  the  character  string  became  less  word-like.  The  advantage \nfor  word-like  strings  can  not  be  due  to  the  use  of  word  shape  feature  detectors \nbecause  accuracy  on  aLtErNaTiNg  case  words,  where  word  shape  is  unfamiliar, \nremains quite  high. \n\nWord  Frequency  Effect \n\nWord  Superiority  Effect \n\nHgh Fraq  Low Fraq \n\nWordS  Pmn NonWorcts  Random \n\naLtErNaTINg \n\nD \n\nCharacter-Sequence-Based \nConsistent  &  Optimal  Positioning \n\n\u2022 \n\nControl  condition, \n20x20  single  character \n\nFigure 8:  Sensitivity to word frequency  and character sequence  regularities \n\n\fHuman Reading and the Curse of Dimensionality \n\n23 \n\nThe  present  results  raise  questions  about  the  role  played  by  high  dimensionality \nin  determining  reading  disabilities  and  difficulties.  Reading  difficulties  have  been \nassociated with reduced  perceptual spans (Rayner,  1986;  Rayner,  et al., 1989),  and \nwith irregular eye fixation  patterns (Rayner &  Pollatsek,  1989).  This suggests that \nsome reading difficulties and disorders may be related to problems in generating the \nprecise  eye  movements necessary  to maintain consistent  and optimal eye  fixations. \nMore  generally,  these  results  highlight  the  importance  of considering  the  role  of \ncharacter  classification  in  learning to read,  particularly since  content  factors,  such \nas  word  frequency,  appear to influence even  low-level  classification operations. \n\nReferences \n\nBlanchard,  H.,  McConkie,  G.,  Zola,  D.,  &  Wolverton,  G.  (1984)  Time  course  of \nvisual  information  utilization  during  fixations  in  reading.  Jour.  of Exp.  Psych.: \nHuman  Perc.  fj  Perf,  10,  75-89. \nDenker,  J., Schwartz,  D.,  Wittner, B., Solla, S.,  Howard,  R.,  Jackel,  L.,  &  Hopfield, \nJ.  (1987)  Large  automatic  learning,  rule  extraction  and  generalization,  Complex \nSystems,  1,  877-933. \n\nGeman,  S.,  Bienenstock,  E.,  and  Doursat,  R.  (1992)  Neural  networks  and  the \nbias/variance dilemma.  Neural  Computation,  4,  1-58. \nHowes,  D.  and  Solomon,  R.  L.  (1951)  Visual  duration  threshold  as  a  function  of \nword  probability.  Journal  of Exp.  Psych.,  41,  401-410. \nHubel,  D.  &  Wiesel,  T.  (1979)  Brain  mechanisms  of  vision.  Sci.  Amer.,  241, \n150-162. \nLeCun,  Y.,  Boser,  B.,  Denker,  J.,  Henderson,  D.,  Howard,  R.,  Hubbard,  W.,  & \nJackel,  L.  (1990)  Handwritten  digit  recognition  with  a  backpropagation  network. \nIn  Adv.  in  Neural Inf  Proc.  Sys.  2,  D.  Touretzky  (Ed)  Morgan Kaufmann. \n\nMartin,  G . L.  &  Pittman, J. A.  (1991)  Recognizing  hand-printed letters  and  digits \nusing  backpropagation learning.  Neural  Computation,  3,  258-267. \nMcClelland, J. L.  (1976) Preliminary letter identification in the perception of words \nand nonwords.  Jour.  of Exp.  Psych.:  Human  Perc.  fj  Perf,  2,  80-91. \nO'Regan, J. & Jacobs, A.(1992) Optimal viewing position effect in word recognition. \nJour.  of Exp.  Psych.:  Human  Perc.fj Perf,  18,  185-197. \n\nRayner,  K.  (1986)  Eye movements and the perceptual span in beginning and skilled \nreaders.  Jour.  of Exp.  Child  Psych.,  41,  211-236. \nRayner,  K.  (1979)  Eye  guidance in  reading.  Perception,  8,  21-30. \n\nRayner,  K.,  Murphy,  1., Henderson,  J.  &  Pollatsek, A.  (1989)  Selective  attentional \ndyslexia.  Cognitive  Neuropsych.,  6,  357-378. \nRayner,  K.  &  Pollatsek,  A.  (1989)  The  Psychology  of reading.  Prentice  Hall \nReicher,  G.  (1969)  Perceptual  recognition  as  a function  of meaningfulness  of stim(cid:173)\nulus  material.  Jour.  of Exp.  Psych.,  81,  274-280. \n\nRumelhart,  D.,  Hinton,  G.,  and  Williams,  R.  (1986)  Learning  internal  represen(cid:173)\ntations by  error  propagation.  In D.  Rumelhart and J.  McClelland,  (Eds)  Parallel \nDistributed Processing,  1.  MIT Press. \nSolomon,  R.  &  Postman,  L.  (1952)  Frequency  of usage  as  a  determinant of recog(cid:173)\nnition thresholds for  words.  Jour.  of Exp.  Psych.,  43,  195-210. \n\n\f", "award": [], "sourceid": 1107, "authors": [{"given_name": "Gale", "family_name": "Martin", "institution": null}]}