{"title": "The Use of Dynamic Writing Information in a Connectionist On-Line Cursive Handwriting Recognition System", "book": "Advances in Neural Information Processing Systems", "page_first": 1093, "page_last": 1100, "abstract": null, "full_text": "The Use of Dynamic Writing Information \n\nin a Connectionist On-Line Cursive \nHandwriting Recognition System \n\nStefan Manke Michael Finke \n\nAlex Waibel \n\nUniversity of Karlsruhe \n\nComputer Science Department \nD-76128 Karlsruhe, Germany \n\nmankeCO)ira. uka.de, finkem@ira.uka.de \n\nCarnegie Mellon University \nSchool of Computer Science \n\nPittsburgh, PA 15213-3890, U.S.A. \n\nw ai bel CO) cs. cm u. ed u \n\nAbstract \n\nIn this paper we present NPen ++, a connectionist system for \nwriter independent, large vocabulary on-line cursive handwriting \nrecognition. This system combines a robust input representation, \nwhich preserves the dynamic writing information, with a neural \nnetwork architecture, a so called Multi-State Time Delay Neural \nNetwork (MS-TDNN), which integrates rec.ognition and segmen(cid:173)\ntation in a single framework. Our preprocessing transforms the \noriginal coordinate sequence into a (still temporal) sequence offea(cid:173)\nture vectors, which combine strictly local features, like curvature \nor writing direction, with a bitmap-like representation of the co(cid:173)\nordinate's proximity. The MS-TDNN architecture is well suited \nfor handling temporal sequences as provided by this input rep(cid:173)\nresentation. Our system is tested both on writer dependent and \nwriter independent tasks with vocabulary sizes ranging from 400 \nup to 20,000 words. For example, on a 20,000 word vocabulary we \nachieve word recognition rates up to 88.9% (writer dependent) and \n84.1 % (writer independent) without using any language models. \n\n\f1094 \n\nStefan Manke, Michael Finke, Alex Waibel \n\n1 \n\nINTRODUCTION \n\nSeveral preprocessing and recognition approaches for on-line handwriting recog(cid:173)\nnition have been developed during the past years. The main advantage of on-line \nhandwriting recognition in comparison to optical character recognition (OCR) is the \ntemporal information of handwriting, which can be recorded and used for recogni(cid:173)\ntion. In general this dynamic writing information (i.e. the time-ordered sequence of \ncoordinates) is not available in OCR, where input consists of scanned text. In this \npaper we present the NPen++ system, which is designed to preserve the dynamic \nwriting information as long as possible in the preprocessing and recognition process. \n\nDuring preprocessing a temporal sequence of N-dimensional feature vectors is com(cid:173)\nputed from the original coordinate sequence, which is recorded on the digitizer. \nThese feature vectors combine strictly local features, like curvature and writing di(cid:173)\nrection [4], with so-called context bitmaps, which are bitmap-like representations of \na coordinate's proximity. \nThe recognition component of NPen++ is well suited for handling temporal se(cid:173)\nquences of patterns, as provided by this kind of input representation. The rec(cid:173)\nognizer, a so-called Multi-State Time Delay Neural Network (MS-TDNN), inte(cid:173)\ngrates recognition and segmentation of words into a single network architecture. \nThe MS-TDNN, which was originally proposed for continuous speech recognition \ntasks [6, 7], combines shift-invariant, high accuracy pattern recognition capabilities \nof a TDNN [8, 4] with a non-linear alignment procedure for aligning strokes into \ncharacter sequences. \n\nOur system is applied both to different writer dependent and writer independent, \nlarge vocabulary handwriting recognition tasks with vocabulary sizes up to 20,000 \nwords. Writer independent word recognition rates range from 92.9% with a 400 \nword vocabulary to 84.1 % with a 20,000 word vocabulary. For the writer dependent \nsystem, word recognition rates for the same tasks range from 98.6% to 88.9% [1]. \n\nIn the following section we give a description of our preprocessing performed on the \nraw coordinate sequence, provided by the digitizer. In section 3 the architecture and \ntraining of the recognizer is presented. A description of the experiments to evaluate \nthe system and the results we have achieved on different tasks can be found in \nsection 4. Conclusions and future work is described in section 5. \n\n2 PREPROCESSING \n\nThe dynamic writing information, i.e. \nthe temporal order of the data points, is \npreserved throughout all preprocessing steps. The original coordinate sequence \n{(x(t), y(t\u00bbhE{O ... T'} recorded on the digitizer is transformed into a new temporal \nsequence X6 = Xo ... XT, where each frame Xt consists of an N-dimensional real(cid:173)\nvalued feature vector (h(t), . .. , fN(t\u00bb E [-1, l]N. \nSeveral normalization methods are applied to remove undesired variability from \nthe original coordinate sequence. To compensate for different sampling rates and \nvarying writing speeds the coordinates originally sampled to be equidistant in time \nare resampled yielding a new sequence {(x(t),y(t\u00bbhE{O ... T} which is equidistant in \n\n\fDynamic Writing Information in Cursive Handwriting Recognition \n\n1095 \n\nnormalized \ncoordinate \n\n(b) writing direction \n\n, \n\n, \nx(.\u00b72),y( \u2022\u2022 2) 'f \n~ \n\nx(t-l),y('\u00b7l) t~, ' \n\n\u2022 \n\n\" .... , \nx('),y(') \n\nx('+1),y('+l) \n\ncurvature \n\n, \n, \n, \n, \n, \n: \nx(t-11y('\u00b71). \n\n,';'1 \n\n1\" x(t+2),y(t+2) \n\nx( \u2022\u2022 2),y('.2) ~ /1 \n' \n' x(.+2),Y(1+2) \n\n~ \n\n\" \n\n\\. .. ...... x(t+l),y(t+l) \n\nx(.),y(') \n\nFigure I: Feature extraction for the normalized word \"able\". The final input rep(cid:173)\nresentation is derived by c,alculating a 15-dimensional feature vector for each data \npoint, which consists of a context bitmap (a) and information about the curvature \nand writing direction (b). \n\nspace. This resampled trajectory is smoothed using a moving average window in \norder to remove sampling noise. In a final normalization step the goal is to find \na representation of the trajectory that is reasonably invariant against rotation and \nscaling of the input. The idea is to determine the words' baseline using an EM \napproach similar to that described in [5] and rescale the word such that the center \nregion of the word is assigned to a fixed size. \nFrom the normalized coordinate sequence {(x(t), y(t))hE{O ... T} the temporal se-\nquence 2::r; of N-dimensional feature vectors ~t = (!l(t), ... .IN(t)) is computed \n(Figure I). Currently the system uses N = 15 features for each data point. The \nfirst two features fl(t) = x(t)-x(t-I) and h(t) = y(t)-b describe the relative X \nmovement and the Y position relative to the baseline b. The features /get) to f6(t) \nare used to desc,ribe the curvature and writing direction in the trajectory [4] (Fig(cid:173)\nure I(b)). Since all these features are strictly local in the sense that they are local \nboth in time and in space they were shown to be inadequate for modeling temporal \nlong range context dependencies typically observed in pen trajectories [2]. There(cid:173)\nfore, nine additional features h(t) to !J5(t) representing :3 x:3 bitmaps were included \nin each feature vector (Figure I(a\u00bb. These so-called context bitmaps are basically \nlow resolution, bitmap-like descriptions of the coordinate's proximity, which were \noriginally described in [2]. \n\nThus, the input representation as shown in Figure I combines strictly local features \nlike writing direction and curvature with the c,ontext bitmaps, which are still local \n\n\f1096 \n\nStefan Manke, Michael Finke, Alex Waibel \n\nin space but global in time. That means, each point of the trajectory is visible from \neach other point of the trajectory in a small neighbourhood. By using these context \nbitmaps in addition to the local features, important information about other parts \nof the trajectory, which are in a limited neighbourhood of a coordinate, are encoded. \n\n3 THE NPen++ RECOGNIZER \n\nThe NPell ++ recognizer integrates recognition and segmentation of words into \na single network architecture, the Multi-State Time Delay Neural Network (MS(cid:173)\nTDNN) . The MS-TDNN, which was originally proposed for continuous speech \nrecognition tasks [6, 7], combines the high accuracy single character recognition \ncapabilities of a TDNN [8, 4] with a non-linear time alignment algorithm (dynamic \ntime warping) for finding stroke and character boundaries in isolated handwritten \nwords. \n\n3.1 MODELING ASSUMPTIONS \nLet W = {WI, .. . WK} be a vocabulary consisting of K words. Each of these words \nWi is represented as a sequence of characters Wi == Ci l Ci 2 \u2022 \u2022\u2022 Cik where each character \nCj itself is modelled by a three state hidden markov model Cj == qj q] qJ. The \nidea of using three states per character is to model explicitly the imtial, middle \nand final section of the characters. Thus, Wi is modelled by a sequence of states \nWi == qioqi l \u00b7 .. qhk . In these word HMMs the self-loop probabilities p(qij/qij) and \nthe transition probabilities p(qij/qij_l) are both defined to be ~ while all other \ntransition probabilities are set to zero. \nDuring recognition of an unknown sequence of feature vectors ~'(; = ~o . . . ~T we \nhave to find the word Wi E W in the dictionary that maximizes the a-posteriori \nprobability p( Wi /~a 10) given a fixed set of parameters 0 and the observed coordinate \nsequence. That means, a written word will be recognized such that \n\nWj = argmaXw,EWp(Wi/Za,O). \n\nIn our Multi-State Time Delay Neural Network approach the problem of modeling \nthe word posterior probability p( wdz'{; , 0) is simplified by using Bayes' rule which \nexpresses that probability as \n\np(Z'{;/WilO)P(Wi/O) \n\np(z'{;/O) \n\nInstead of approximating p( Wi /z'{;, 0) directly we define in the following section a \nnetwork that is supposed to model the likelihood of the feature vector sequence \np(z'{; /Wi, 0). \n\n3.2 THE MS-TDNN ARCHITECTURE \n\nIn Figure 2 the basic MS-TDNN architecture for handwriting recognition is shown. \nThe first three layers constitute a standard TDNN with sliding input windows in \neach layer. In the current implementation of the system, a TDNN with 15 input \n\n\fDynamic Writing InJonnation in Cursive Handwriting Recognition \n\n1097 \n\n'\" \n0 \n\n.D \n\n\"E \ni::' \n'\" \n'\" ..'!l \n0 \n1i \n.D \nmm.o \n'\" \n'\" \n'\" \n::::::: \n... -....... -.... \n\n:.\" . \u2022 \u2022 \n\n~ r:'?~\"\"\"~L-,;~\u00b7\u00b7\u00a7i] \n-\n\n~ \n\n. \n\n\",. \n\n........ \n\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 ;r?5_-m~;.\u00b7i~::~:~\u00b7:\u00b7:\u00b7:I; \n\n: \n! \n\nI I \nI \nI\n\nX \nY \u2022. --------------- ----.-- --\nz'---______ -l::\u00b1-_____ --' \n\n,I L\" \n~ \n~~~: ~ ~~~~~~~:~~~:~~~}:_ ~ \n\n: 11 \n\n-g \n:; ] \n\" \n0 \n0 \n~ .D \n'\" \n'\" \n~.m.W.J \n\n\u2022 \u2022 \u2022 \n\n: ~ __ - _. _____ n ___ ___ . . . __ n n_. n ~. __ nn __ __ __ ~ \n\n0 \n... ~ [~~~::~~~::~~~~~~~~~:::~::~:~~~~:J-\n.-\n\nt. ~-- -- n---:011- ~ 7-n \u2022\u2022 --- -- __ .. ~.:~ __ \u2022\u2022 -:. ____ n --.~-- \"_ \n\nE \n0 \n0 \nN \n\n... \n\n- - - - - - - - - - - - - - . time - - - - - - - - - - - - - - - - - - -\n\nFigure 2: The Multi-State TDNN architecture, consisting of a 3-layer TDNN to \nestimate the a posteriori probabilities of the character states combined with word \nunits, whose scores are derived from the word models by a Viterbi approximation \nof the likelihoods p(x6'IWi). \n\nunits , 40 units in the hidden layer , and 78 state output units is used . There are 7 \ntime delays both in the input and hidden layer. \n\nThe softmax normalized output of the states layer is interpreted as an estimate of \nthe probabilities of the states qj given the input window x!~~ = Xt-d .. . Xt+d for \neach time frame t , i.e . \n\nexp(1/j (t)) \n\n2::k exp(1/k(t)) \n\n( 1 ) \n\nwhere 1lj (t) represents the weighted sum of inputs to state unit j at time t . Based \non these estimates, the output of the word units is defined to be a Viterbi approx(cid:173)\nimation of the log likelihoods of the feature vector sequence given the word model \n\n\f1098 \n\nStefan Manke, Michael Finke, Alex Waibel \n\nT \n\nt=1 \nT \n\nqo \n\nqo \n\nt=1 \n\nlogp(zrlwi) ~ m~ L 10gp(z;~~lqt, Wi) + logp(qtlqt-I, Wi) \n\n~ m~ Llogp(qtlz;~~) -logp(qt) + logp(qtlqt-1, Wi). \n\n(2) \n\nHere, the maximum is over all possible sequences of states q'{; = qo . .. qT given a \nword model, p(qtlz!~~) refers to the output of the states layer as defined in (1) and \np(qt) is the prior probability of observing a state qt estimated on the training data. \n\n3.3 TRAINING OF THE RECOGNIZER \n\nDuring training the goal is to determine a set of parameters 0 that will maximize \nthe posterior probability p( wlzr, 0) for all training input sequences. But in order \nto make that maximization computationally feasible even for a large vocabulary \nsystem we had to simplify that maximum a posteriori approach to a maximum \nlikelihood training procedure that maximizes p(zrlw, 0) for all words instead. \nThe first step of our maximum likelihood training is to bootstrap the recognizer \nusing a subset of approximately 2,000 words of the training set that were labeled \nmanually with the character boundaries to adjust the paths in the word layer cor(cid:173)\nrectly. After training on this hand-labeled data, the recognizer is used to label \nanother larger set of unlabeled training data. Each pattern in this training set \nis processed by the recognizer. The boundaries determined automatically by the \nViterbi alignment in the target word unit serve as new labels for this pattern. Then, \nin the second phase, the recognizer is retrained on both data sets to achieve the \nfinal performance of the recognizer. \n\n4 EXPERIMENTS AND RESULTS \n\nWe have tested our system both on writer dependent and writer independent tasks \nwith vocabulary sizes ranging from 400 up to 20,000 words. The word recognition \nresults are shown in Table 1. The scaling of the recognition rates with respect to \nthe vocabulary size is plotted in Figure 3b. \n\nT bl 1 W' \n\na e \n\nrlter d \nVocabulary \n\nepen ent an d' d \n\nd \nd \nWriter Dependent \nTest \n\nRecognitIon \n\nIn epen ent recogmtIOn resu ts \n\nWriter Independent \nTest \n\nKecogmtIOn \n\nPatterns \n\nTask \n\ncrtAOO \nwsj_l,OOO \nwsj_7,000 \nwsj_l0,000 \nwsj..20,000 \n\nSize \n\n400 \n1,000 \n7,000 \n10,000 \n20,000 \n\nPatterns \n\n800 \n800 \n-\n\n1,600 \n1,600 \n\nRate \n98.6% \n97.8% \n\n-\n\n92.1% \n88.9% \n\n800 \n-\n\n2,500 \n2,500 \n2,500 \n\nRate \n92.9% \n\n-\n\n89.3% \n87.7% \n84.f% \n\n\fDynamic Writing Information in Cursive Handwriting Recognition \n\n1099 \n\n100r-----.------.----~------., \n\n.................................. . \n\n............... -\n\nwriter dependent .... -. \nwriter independent .-+-\n\n\"'-\"'-'. \n\n.. .................. -.. .. \n\n-------.. \n\n... \n................... \n\n...... -----... \n\n......... . \n\n.................... \n\n95 \n\n90 \n\n85 \n\n80 \n\n75~----~----~----~------~ \n20000 \n\n10000 \n\n15000 \n\n5000 \n\nsize of vocabulary \n(b) \n\nFigure 3: (a) Different writing styles in the database: cursive (top), hand-printed \n(middle) and a mixture of both (bottom) (b) Recognition results with respect to \nthe vocabulary size \n\nFor the writer dependent evaluation, the system was trained on 2,000 patterns from \na 400 word vocabulary, written by a single writer, and tested on a disjunct set \nof patterns from the same writer. In the writer dependent case, the training set \nconsisted of 4,000 patterns from a 7,000 word vocabulary, written by approximately \n60 different writers. The test was performed on data from an independent set of 40 \nwriters. \nAll data used in these experiments was collected at the University of Karlsruhe, \nGermany. Only minimal instructions were given to the writers. The writers were \nasked to write as natural as they would normally do on paper, without any re(cid:173)\nstrictions in writing style. The consequence is , that the database is characterized \nby a high variety of different writing styles, ranging from hand-printed to strictly \ncursive patterns or a mixture of both writing styles (for example see Figure 3a). \nAdditionally the native language of the writers was german, but the language of \nthe dictionary is english. Therefore, frequent hesitations and corrections can be ob(cid:173)\nserved in the patterns of the database. But since this sort of input is typical for real \nworld applications, a robust recognizer should be able to process these distorted \npatterns, too. From each of the writers a set of 50-100 isolated words, choosen \nrandomly from the 7,000 word vocabulary, was collected. \nThe used vocabularies CRT (Conference Registration Task) and WSJ (ARPA Wall \nStreet Journal Task) were originally defined for speech recognition evaluations. \nThese vocabularies were chosen to take advantage of the synergy effects between \nhandwriting recognition and speech recognition, since in our case the final goal is \nto integrate our speech recognizer JANUS [10] and the proposed NPen++ system \ninto a multi-modal system. \n\n\f1100 \n\nStefan Manke, Michael Finke, Alex Waibel \n\n5 CONCLUSIONS \n\nIn this paper we have presented the NPen++ system, a neural recognizer for writer \ndependent and writer independent on-line cursive handwriting recognition. This \nsystem combines a robust input representation, which preserves the dynamic writing \ninformation, with a neural network integrating recognition and segmentation in a \nsingle framework. This architecture has been shown to be well suited for handling \ntemporal sequences as provided by this kind of input. \n\nEvaluation of the system on different tasks with vocabulary sizes ranging from 400 \nto 20,000 words has shown recognition rates from 92.9% to 84.1 % in the writer \nindependent case and from 98.6% to 88.9% in the writer dependent case. These \nresults are especially promising because they were achieved with a small training \nset compared to other systems (e.g. [3]). As can be seen in Table 1, the system \nhas proved to be virtually independent of the vocabulary. Though the system \nwas trained on rather small vocabularies (e.g. 400 words in the writer dependent \nsystem), it generalizes well to completely different and much larger vocabularies. \n\nReferences \n\n[1] S. Manke and U. Bodenhausen, \"A Connectionist Recognizer for Cursive Hand(cid:173)\n\nwriting Recognition\", Proceedings of the ICASSP-94, Adelaide, April 1994. \n\n[2] S. Manke, M. Finke, and A. Waibel, \"Combining Bitmaps with Dynamic Writing \nInformation for On-Line Handwriting Recognition\", Proceedings of the ICPR-\n94, Jerusalem, October 1994. \n\n[3] M. Schenkel, I. Guyon, and D. Henderson, \"On-Line Cursive Script Recognition \nUsing Time Delay Neural Networks and Hidden Markov Models\", Proceedings \nof the ICASSP-94, Adelaide, April 1994. \n\n[4] I. Guyon, P. Albrecht, Y. Le Cun, W. Denker, and W. Hubbard, \"Design of a \nNeural Network Character Recognizer for a Touch Terminal\", Pattern Recogni(cid:173)\ntion, 24(2), 1991. \n\n[5] Y. Bengio and Y. LeCun. \"Word Normalization for On-Line Handwritten Word \n\nRecognition\", Proceedings of the ICPR-94, Jerusalem, October 1994. \n\n[6] P. Haffner and A. Waibel, \"Multi-State Time Delay Neural Networks for Contin(cid:173)\n\nuous Speech Recognition\", Advances in Neural Information Processing Systems \n(NIPS-4) , Morgan Kaufman, 1992. \n\n[7] C. Bregler, H. Hild, S. Manke, and A. Waibel, \"Improving Connected Letter \nRecognition by Lipreading\", Proceedings of the ICASSP-93, Minneapolis, April \n1993. \n\n[8] A. Waibel, T. Hanazawa, G. Hinton, K. Shiano, and K. Lang, \"Phoneme Recog(cid:173)\n\nnition using Time-Delay Neural Networks\", IEEE Transactions on Acoustics, \nSpeech and Signal Processing, March 1989. \n\n[9] W. Guerfali and R. Plamondon, \"Normalizing and Restoring On-Line Hand(cid:173)\n\nwriting\", Pattern Recognition, 16(5), 1993. \n\n[10] M. Woszczyna et aI., \"Janus 94: Towards Spontaneous Speech Translation\", \n\nProceedings of the ICASSP-94, Adelaide, April 1994. \n\n\f", "award": [], "sourceid": 908, "authors": [{"given_name": "Stefan", "family_name": "Manke", "institution": null}, {"given_name": "Michael", "family_name": "Finke", "institution": null}, {"given_name": "Alex", "family_name": "Waibel", "institution": null}]}