{"title": "Handwritten Word Recognition using Contextual Hybrid Radial Basis Function Network/Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 764, "page_last": 770, "abstract": null, "full_text": "Handwritten Word Recognition using Contextual \nHybrid Radial Basis Function NetworklHidden \n\nMarkov Models \n\nBernard Lemarie \n\nLa Poste/SRTP \n\n10, Rue de l'lle-Mabon \n\nF-44063 Nantes Cedex France \n\nlemarie@srtp.srt-poste.fr \n\nMichel Gilloux \nLa Poste/SRTP \n\n10, Rue de l'1le-Mabon \n\nF-44063 Nantes Cede x France \n\ngilloux@srtp.srt-poste.fr \n\nManuel Leroux \nLa Poste/SRTP \n\n10, Rue de l'lle-Mabon \n\nF-44063 Nantes Cedex France \n\nleroux@srtp.srt-poste.fr \n\nAbstract \n\nA hybrid and contextual radial basis function networklhidden Markov \nmodel  off-line  handwritten  word  recognition  system  is  presented.  The \ntask  assigned  to  the radial  basis  function  networks  is  the  estimation of \nemission probabilities associated to Markov states. The model is contex(cid:173)\ntual because the estimation of emission probabilities takes  into account \nthe left context of the current image segment as represented by its pred(cid:173)\necessor in the sequence. The new system does not outperform the previ(cid:173)\nous system without context but acts differently. \n\n1  INTRODUCTION \n\nHidden  Markov  models  (HMMs)  are  now  commonly  used  in  off-line  recognition  of \nhandwritten words (Chen et aI.,  1994) (Gilloux et aI.,  1993) (Gilloux et al.  1995a). In some \nof these approaches (Gilloux et al.  1993), word images are transformed into sequences of \nimage segments through some explicit segmentation procedure. These segments are passed \non to a module which is in charge of estimating the probability for each segment to appear \nwhen the corresponding hidden state is some state s (state emission probabilities). Model \nprobabilities are generally optimized for the Maximum Likelihood Estimation (MLE) cri(cid:173)\nterion. \n\nMLE training is known to be sub-optimal with respect to discrimination ability when the \nunderlying  model  is  not  the  true  model  for  the  data.  Moreover,  estimating  the  emission \nprobabilities in regions where examples are sparse is difficult and estimations may not be \naccurate. To reduce the risk of over training, images segments consisting of bitmaps are of(cid:173)\nten replaced by feature vector of reasonable length (Chen et aI.,  1994) or even discrete sym(cid:173)\nbols (Gilloux et aI.,  1993). \n\n\fHandwritten Word Recognition Using HMMlRBF Networks \n\n765 \n\nIn  a  previous  paper (Gilloux  et aI.,  1995b) we  described  a  hybrid  HMMlradial  basis \nfunction  system in  which emission probabilities  are  computed from  full-fledged  bitmaps \nthough the use of a radial basis function (RBF) neural network. This system demonstrated \nbetter recognition  rates  than  a  previous  one  based  on  symbolic  features  (Gilloux  et  aI., \n1995b).  Yet,  many  misclassification  examples  showed  that  some  of the  simplifying  as(cid:173)\nsumptions made in HMMs were responsible for a significant part of these errors. In partic(cid:173)\nular, we observed that considering each segment independently from its neighbours would \nhurt the accuracy of the model. For example, figure  1 shows examples of letter a when it is \nsegmented in two parts. The two parts are obviously correlated. \n\nFigure 1: Examples of segmented a. \n\nal \n\nWe propose a new variant of the hybrid HMMIRBF model in which emission probabil(cid:173)\n\nities are estimated by taking into account the context of the current segment. The context \nwill be represented by the preceding image segment in the sequence. \n\nThe RBF model  was chosen because it was proven to  be an  efficient model for recog(cid:173)\n\nnizing  isolated  digits  or  letters  (Poggio  &  Girosi,  1990)  (Lemarie,  1993).  Interestingly \nenough, RBFs bear close relationships with gaussian mixtures often  used to  model emis(cid:173)\nsion probabilities in markovian contexts. Their advantage lies in  the fact that they  do not \ndirectly estimate emission probabilities and thus are less prone to errors in this estimation \nin  sparse regions.  They are  also  trained  through  the  Mean  Square Error (MSE) criterion \nwhich makes them more discriminant. \n\nThe idea of using a neural net and in particular a RBF in conjunction with a HMM is not \nnew. In (Singer &  Lippman,  1992) it was applied to a speech recognition task. The use of \ncontext to  improve  emission  probabilities  was  proposed  in  (Bourlard  &  Morgan,  1993) \nwith the use of a discrete set of context events. Several neural networks are there used to \nestimate various relations between states, context events and current segment. Our point is \nto propose a different method without discrete context based on a adapted decomposition \nof the HMM likelihood estimation.This model is next applied to off-line handwritten word \nrecognition. \n\nThe organization of this paper is as follows. Section 1 is an overview of the architecture \nof our HMM.  Section  2 describes  the justification for using RBF outputs in  a contextual \nhidden  Markov model.  Section 3 describes  the  radial  basis  function  network recognizer. \nSection 4 reports on an experiment in which the contextual model is  applied to the recog(cid:173)\nnition of handwritten words found on french bank or postal cheques. \n\n2  OVERVIEW OF THE HIDDEN MARKOV MODEL \n\nIn an HMM model (Bahl et aI.,  1983), the recognition scores associated to words ware \n\nlikelihoods \n\nL(wli) ... in)  =  P(i1 \u00b7\u00b7\u00b7inlw)xP(W) \n\nin which the first term in the product encodes the probability with which the model of each \nword w generates some image (some sequence of image segments) ij ... in- In the HMM par(cid:173)\nadigm, this term is decomposed into a sum on  all paths (i.e.  sequence of hidden states) of \nproducts of the probability of the hidden path by the probability that the path generated the \nImage sequence: \n\np(i) ... inlw)  = \n\n\f766 \n\nB. LEMARIE. M.  GILLOUX. M. LEROUX \n\nIt is often assumed that only one path contributes significantly to this term so that \n\nIn HMMs, each sequence element is assumed to depend only on its corresponding state: \n\nn \n\np(il\u00b7\u00b7\u00b7i  ISI\u00b7\u00b7\u00b7 s )  =  ITp(i\u00b7ls .) \n}  } \n\nn \n\nn \n\nj=1 \n\nMoreover,  first-order Markov models assume that paths  are generated by  a first-order \n\nMarkov chain so that \n\nn \n\nP(sl \u00b7 \u00b7\u00b7s )  =  ITp(s . ls. \n}  }-\n\nn \n\nj  = I \n\nI) \n\nWe have reported in previous papers (Gilloux et aI.,  1993) (Gilloux et aI.,  1995a) on sev(cid:173)\n\neral handwriting recognition systems based on this assumption.The hidden Markov model \narchitecture used in  all  systems has been extensively presented in  (Gilloux et aI.,  1995a). \nIn that model, word letters are associated to three-states models which are designed to ac(cid:173)\ncount for the situations where a letter is realized as 0,  1 or 2 segments. Word models are the \nresult of assembling the corresponding letter models. This architecture is depicted on figure \n2.  We  used  here  transition  emission  rather  than  state  emission.  However,  this  does  not \n\nE, 0.05 \n\nE,0.05 \n\nI \n\na  va l  \n\nFigure 2:  Outline of the model for \"laval\". \n\nchange the previous formulas if we replace states by transitions, i.e. pairs of states. \n\nOne of these systems was an hybrid RBFIHMM model in which a radial basis function \nnetwork was used to estimate emission probabilities p (i. Is.)  . The RBF outputs are intro(cid:173)\n\nduced by applying Bayes rule in  the expression of p (i I .~. i; I s I . .. S n)  : \n\np(il \u00b7 \u00b7 \u00b7i  IsI\u00b7 \u00b7\u00b7s)  =  IT  }} \np (s.) \n} \n\nn  n .  \n} = 1 \n\nn  p(s.1 i.)  xp(i.) \n} \n\nSince the product of a priori image segments probabilities p (i.)  does not depend on the \n\nword hypothesis w,  we may write: \n\n} \n\nn  p (s. Ii.) \np(il\u00b7\u00b7\u00b7inlsl\u00b7\u00b7\u00b7sn)oc.IT  p~s./ \n\n}  =  1 \n\n} \n\nIn the above formula, terms of form p (s . Is. _ I)  are transition probabilities which may \nbe estimated through the Baum-Welch re-istirhatlOn algorithm. Terms of form  p (s.)  are \na priori probabilities of states. Note that for  Bayes rule  to  apply,  these probabilitid have \nand only  have to  be estimated consistently with terms of form  p (s. Ii.)  since p (i. Is.) \nis independent of the statistical distribution of states. \n}  } \n\n}  } \n\nIt has been proven elsewhere (Richard &  Lippman,  1992) that systems trained through \nthe MSE criterion tend to approximate Bayes probabilities in  the  sense that Bayes proba-\n\n\fHandwritten Word Recognition Using HMMlRBF Networks \n\n767 \n\nbilities  are  optimal  for  the  MSE  criterion. In  practice,  the  way  in  which  a  given  system \ncomes close to Bayes optimum is not easily predictable due to various biases of the trained \nsystem (initial  parameters, local  optimum,  architecture  of the  net,  etc.).  Thus real  output \nscores are generally not equal to Bayes probabilities. However, there exist different proce(cid:173)\ndures which act as a post-processor for outputs of a system trained with the MSE and make \nthem closer to Bayes probabilities (Singer & Lippman,  1992). Provided that such a post(cid:173)\nprocessor is used, we will assume that terms p (s. Ii .)  are well estimated by the post-proc(cid:173)\nessed outputs of the recognition system. Then, u~~ p (s .)  are just the a priori probabili(cid:173)\nties of states on the set used to train the system or post-prbcess the system outputs. \n\nThis  hybrid  handwritten  word  recognition  system  demonstrated  better  performances \n\nthan previous systems in which word images were represented through sequences of sym(cid:173)\nbolic features instead of full-fledged bitmaps  (Gilloux et aI.,  1995b). However, some rec(cid:173)\nognition errors remained, many  of which could be explained by the simplifying assump(cid:173)\ntions made in the model. In particular, the fact that emission probabilities depend only on \nthe state corresponding to  the current bitmap appeared to  be  a poor choice. For example, \non figure 3 the third and fourth segment are classified as two halves of the letter i. For letters \n\nFigure 3:  An image of trente classified as mille. \n\nsegmented in two parts, the second half is naturally correlated to the first (see figure 1). Yet, \nour Markov model  architecture is designed so that both halves are assumed uncorrelated. \nThis has two effects. Two consecutive bitmaps which cannot be the two parts of a unique \nletter are sometimes recognized as  such like on figure 3. Also, the emission probability of \nthe second part of a segmented letter is lower than if the first part has been considered for \nestimating this probability. The contextual model described in the next section is designed \nso has to make a different assumption on emission probabilities. \n\n3  THE HYBRID CONTEXTUAL RBFIHMM MODEL \n\nThe exact decomposition of the emission part of word likelihoods is the following: \n\np(i1 \u00b7\u00b7\u00b7 inls1\u00b7\u00b7\u00b7sn)  =  P(il ls 1\u00b7\u00b7\u00b7 sn)  x  ITp(ijlsl ... sn,il ... ij_l) \n\nj=2 \n\nn \n\nWe assume now that bitmaps are conditioned by  their state and the previous image in the \nsequence: \n\nP(il\u00b7\u00b7\u00b7 in I sl\u00b7 \u00b7\u00b7 sn)  ==p(i11 sl)  x  IT  p (ij  I sj'ij _ l ) \n\nn \n\nj=2 \n\nThe RBF is again introduced by applying Bayes rule in the following way: \n\np (i 1\u00b7\u00b7\u00b7 in  lSI\u00b7\u00b7 \u00b7 s n)  == \n\nP(s1 1 i l )  xp(i l ) \n\n( )  \n\nP  sl \n\nn  p(s . 1 i ., i .  1)  xp (i . I i .  1) \n\nx  IT  }}} (-\n. \nJ=2 \n\nP  s.  !.  1 \n\nJ  J-\n\nI \u00b7  / )  -\n\nSince terms  of form  p (i . Ii . _ 1)  do not contribute to  the discrimination of word  hy-\n\npotheses, we may write: \n\nJ  J \n\n( . \n\nP  11\u00b7 \u00b7 \u00b7ln  sl\u00b7\u00b7\u00b7 sn \n\n.  I )  \n\noc \n\np (s 1 IiI) \n\n( )   x \n\npSI \n\nIT  }  J  J-\nn  p (s . Ii., i .  1 ) \nI . )  \n. \nJ=2 \n\nP (s . \n\nJ  J-\n\nI.  1 \n\n\f768 \n\nB.  LEMARIE,  M.  GILLOUX,  M.  LEROUX \n\nThe RBF has now to estimate not only terms of form p (s. Ii ., i. _ 1)  but also terms like \np (s . Ii.  1)  which are no longer computed by  mere countind. 'two radial basis function \nnetJoris-are  then  used  to  estimate these probabilities.  Their common architecture is  de(cid:173)\nscribed in the next section. \n\n4  THE RADIAL BASIS FUNCTION MODEL \n\nThe radial basis function model has been described in (Lemarie,  1993). RBF networks \nare inspired from the theory of regularization (Poggio &  Girosi,  1990). This theory studies \nhow  multivariate real  functions  known  on  a  finite  set of points  may  be  approximated  at \nthese points in a family of parametric functions under some bias of regularity. It has been \nshown  that when  this  bias  tends  to  select smooth functions  in  the  sense  that some  linear \ncombination of their derivatives is  minimum, there exist an  analytical solution which is  a \nlinear combination of gaussians centred on the points where the function is known (Poggio \n& Girosi,  1990). It is straightforward to transpose this paradigm to the problem of learning \nprobability distributions given a set of examples. \n\nIn  practice, the theory is not tractable since it requires one gaussian per example in  the \ntraining  set.  Empirical  methods  (Lemarie,  1993)  have  been  developed  which  reduce  the \nnumber of gaussian centres.  Since the theory is  no  longer applicable when the number of \ncentres is reduced, the parameters of the model (centres and covariance matrices for gaus(cid:173)\nsians, weights for the linear combination) have to be trained by another method, in that case \nthe gradient descent method and the MSE criterion. Finally, the resulting RBF model may \nbe looked at like a particular neural  network with three layers. The first is the input layer. \nThe second layer is completely connected to  the input layer through connections with unit \nweights.  The  transfer  functions  of cells  in  the  second  layer are  gaussians  applied  to  the \nweighed distance between the corresponding centres and the weighed input to the cell. The \nweight of the distance are analogous to  the parameters of a diagonal covariance matrix. Fi(cid:173)\nnally,  the last layer is  completely connected to  the second one through weighted connec(cid:173)\ntions. Cells in  this layer just output the sum of their input. \n\nIn our experiments, inputs to the RBF are feature vectors of length  138 computed from \nthe  bitmaps  of a  word  segment (Lemarie,  1993).  The RBF  that  estimates  terms  of form \np (s. Ii., i.  1)  uses  to  such  vectors  as \ninput  whereas  the  second  RBF  (terms \np (/ I/_il) ) is only fed with the vector associated to ij _l . These vectors are inspired from \n\"cha'rac{eristic loci\" methods (Gluksman,  1967) and encode the proportion of white pixels \nfrom which a bitmap border can be reached without meeting any black pixel in  various of \ndirections. \n\n5  EXPERIMENTS \n\nThe model has been assessed by applying it to the recognition of words appearing in le(cid:173)\ngal amounts of french postal or bank cheques. The size of the vocabulary is 30 and its per(cid:173)\nplexity is only  14.3 (Bahl et aI.,  1983). The training and test bases are made of images of \namount words written by unknown writers on real cheques. We used 7 191  images during \ntraining and 2 879 different images for test. The image resolution was 300 dpi. The amounts \nwere manually segmented into words and an automatic procedure was used to separate the \nwords from  the preprinted lines of the cheque form. \n\nThe training was conducted by  using the results of the former hybrid system. The seg(cid:173)\n\nmentation module was  kept unchanged. There are 48  140 segments in the training set and \n19577 in the test set. We assumed that the base system is almost always correct when align(cid:173)\ning segments onto letter models. We thus used this  alignment to  label all  the segments in \nthe training set and took these labels as  the desired outputs for the RBF.  We used a set of \n63 different labels since 21  letters appear in the amount vocabulary and 3 types of segments \nare possible for each letter. The outputs of the RBF are directly interpreted as Bayes prob-\n\n\fHandwritten Word Recognition Using HMMJRBF  Networks \n\n769 \n\nabilities without further post-processing. \n\nFirst of all, we assessed the quality of the system by evaluating its  ability  to  recognize \nthe class of a segment through the value of  p (s . Ii., i.  1)  and compared it with that of \nthe previous hybrid  system. The results of this  e'xpdrirhent are reported on table  1 for the \ntest set. They demonstrate the importance of the context and thus its potential interest for a \n\nTable  1:  Recognition and confusion rates for segment classifiers \n\nRecognition rate  Confusion rate  Mean square error \n\nRBF system without context \n\nRBF system with context \n\n32.6% \n\n41.7% \n\n67.4% \n\n58.3% \n\n0.828 \n\n0.739 \n\nword recognition system. \n\nWe next compare the performance on word recognition on the data base of 2878 images \nof words. Results are shown in table 2. The first remark is that the system without context \n\nTable 2:  Recognition and confusion rates for the word recognition systems \n\nRecognition rate  Confusion rate \n\n# Confusions \n\nRBF system without context \n\nRBF system with context \n\n81,3% \n\n76,3% \n\n16,7% \n\n23,7% \n\n536 \n\n683 \n\npresent better results than the contextual system. Some of the difference between the sys(cid:173)\ntems with and without context are shown below in figures 4 and 5 and may explain why the \ncontextual system remains at a lower level of performance. The word \"huit\" and \"deux\" of \nfigure 4 are well recognized by the system without context but badly identified by the con(cid:173)\ntextual  system respectively as  \"trente\"  and  \"franc\". The image of the word  \"huit\", for ex(cid:173)\nample, is segmented into eight segments and each of the four letters of the word is thus nec(cid:173)\nessarily  considered  as  separated  in  two  parts.  The  fifth  and  sixth  segments  are  thus \nrecognized as  two halves of the letter \"i\"  by the standard system while the contextual sys(cid:173)\ntem avoids this  decomposition of the letter \"i\". On the next image, the contextual system \nproposes \"ra\" for the second and third segments mainly because of the absence of informa(cid:173)\ntion on the relative position ofthese segments. On the other hand, figure 5 shows examples \nwhere the contextual system outperforms the system without context. In the first case the \nlatter proposed the class \"trois\" with two halves on the letter \"i\" on the fifth and sixth seg(cid:173)\nments.  In the second case the context is clearly useful for the recognition on the first letter \nof the word. Forthcoming experiments will try to combine  the two systems so as to benefit \nof their respective characteritics. \n\nFigure 4 : some new confusions produced by the contextual system. \n\nExperiments have also revealed that the contextual system remains very sensible to the \nnumerical output values for the  network which estimates p (s. Ii. _ 1)  . Several approaches \nfor  solving this problem are currently under investigation. Ffrst'results have yet been ob(cid:173)\ntained by trying to approximate the network which estimates p (Sj I ij  _ 1)  from  the network \nwhich estimates  p (Sj I ij' ij  _ 1)  . \n\n\f770 \n\n6  CONCLUSION \n\nB.  LEMARIE,  M. GILLOUX,  M.  LEROUX \n\nWe have described a new application of a hybrid radial basis function/hidden Markov \nmodel architecture to the recognition of off-line handwritten words. In this architecture, the \nestimation of emission probabilities is assigned to a discriminant classifier. The estimation \nof emission probabilities is enhanced by taking into account the context as represented by \nthe previous bitmap in the sequence to be classified. A formula have been derived introduc(cid:173)\ning this context in the estimation of the likelihood of word scores. The ratio of the output \nvalues of two networks are now used so as to estimate the likelihood. \n\nThe reported experiments reveal that the use of context, if profitable at the segment rec(cid:173)\n\nognition level, is not yet useful at the word recognition level. Nevertheless, the new system \nacts differently from  the previous system without context and future  applications will  try \nto exploit this difference. The dynamic of the ratio networks output values is also very un(cid:173)\nstable and some solutions to  stabilize it which will be deeply tested in the forthcoming ex(cid:173)\nperiences. \n\nReferences \n\nBahl L,  Jelinek F,  Mercer R,  (1983).  A maximum likelihood approach to speech recogni(cid:173)\ntion. IEEE Transactions on Pattern Analysis and Machine Intelligence 5(2): 179-190. \nBahl LR, Brown PF,  de Souza PV, Mercer RL,  (1986). Maximum mutual information es(cid:173)\ntimation  of hidden Markov model  parameters  for  speech recognition.  In:  Proc  of the  Int \nConf on Acoustics, Speech, and Signal Processing (ICASSP'86):49-52. \nBourlard, H., Morgan, N., (1993). Continuous speech recognition by connectionist statisti(cid:173)\ncal methods, IEEE Trans.  on Neural Networks,  vol.  4, no.  6, pp.  893-909,  1993. \nChen, M.-Y., Kundu,  A.,  Zhou, J., (1994). Off-line handwritten  word recognition using a \nhidden Markov model type stochastic network, IEEE Trans.  on Pattern Analysis and Ma(cid:173)\nchine Intelligence,  vol.  16, no.  5:481-496. \nGilloux, M., Leroux, M., Bertille, J.-M., (1993). Strategies for handwritten words recogni(cid:173)\ntion using hidden Markov models, Proc.  of the 2nd Int.  Conf.  on Document Analysis and \nRecognition:299-304. \nGilloux, M.,  Leroux, M.,  Bertille, J.-M., (1995a).  \"Strategies for Cursive Script Recogni(cid:173)\ntion  Using  Hidden  Markov  Models\",  Machine  Vision  &  Applications,  Special  issue  on \nHandwriting recognition, R  Plamondon ed., accepted for publication. \nGilloux, M., Lemarie, B., Leroux, M., (l995b). \"A Hybrid Radial Basis Function Network! \nHidden Markov Model Handwritten Word Recognition System\", Proc. of the 3rd Int. Conf. \non Document Analysis and Recognition:394-397. \nGluksman, H.A., (1967). Classification of mixed font alphabetics by characteristic loci, 1 st \nAnnual IEEE Computer Conf.:  138-141. \nLemarie, B., (1993). Practical implementation of a radial basis function network for hand(cid:173)\nwritten digit recognition, Proc. of the 2nd Int.  Conf. on Document Analysis and Recogni(cid:173)\ntion:412-415. \nPoggio, T., Girosi, F., (1990). Networks for approximation and learning, Proc. of the IEEE, \nvol  78, no 9. \nRichard, M.D.,  Lippmann,  RP., (1991).  \"Neural network classifiers  estimate bayesian  a \nposteriori probabilities\", Neural Computation, 3:461-483. \nSinger,  E,  Lippmann,  RP.,  (1992).  A  speech recognizer using  radial  basis  function  net(cid:173)\nworks  in  an  HMM  framework,  Proc.  of the  Int.  Conf.  on  acoustics,  Speech,  and  Signal \nProcessing. \n\n\f", "award": [], "sourceid": 1106, "authors": [{"given_name": "Bernard", "family_name": "Lemari\u00e9", "institution": null}, {"given_name": "Michel", "family_name": "Gilloux", "institution": null}, {"given_name": "Manuel", "family_name": "Leroux", "institution": null}]}