{"title": "Subgrouping Reduces Complexity and Speeds Up Learning in Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 638, "page_last": 641, "abstract": null, "full_text": "638 \n\nZipser \n\nSubgrouping Reduces Complexity and  Speeds Up \n\nLearning in  Recurrent Networks \n\nDavid Zipser \n\nDepartment of Cognitive Science \nUniversity of California, San Diego \n\nLa Jolla, CA 92093 \n\n1 INTRODUCTION \n\nRecurrent nets are more powerful than feedforward nets because they allow simulation of \ndynamical systems. Everything from sine wave generators through computers to the brain are \npotential candidates, but to use recurrent nets to emulate dynamical systems we need learning \nalgorithms to program them. Here I describe a new twist on an old algorithm for recurrent nets \nand compare it to its predecessors. \n\n2 BPTT \n\nIn the beginning there was BACKPROPAGATION THROUGH TUvffi (BPTT) which was \ndescribed by Rumelhart, Williams, and Hinton (1986). The idea is to add a copy of the whole \nrecurrent net to the top of a growing feedforward network on each update cycle. Backpropa(cid:173)\ngating through this stack corrects for past mistakes by adding up all the weight changes from \npast times. A difficulty with this method is that the feedforward net gets very big. The obvious \nsolution is to truncate it at a fixed number of copies by killing an old copy every time a new \ncopy is added. The truncated-BPTT algorithm is illustrated in Figure 1. It works well, more \nabout this later. \n\n3RTRL \nIt turns out that it is not necessary to keep an ever growing stack of copies of the recurrent \nnet as BPTT does. A fixed number of parameters can record all of past time. This is done in \nthe REAL TI!\\.1E  RECURRENT LEARNING (RTRL)  algorithm  of Williams and Zipser \n(1989).  The derivation is given elsewhere (Rumelhart, Hinton,  &  Williams,  1986), but a \n\n\fSub grouping Reduces Complexity \n\n639 \n\nIN \n\nIN \n\nt-l \n\nIN \n\nt  - k + 2 \n\nt - k + 1 \n\ni~ \nI \n\ni< \n~r \n\n-!::::~ \nl \n\ni;f~ \n\nFigure 1:  BPTT. \n\n\f640 \n\nZipser \n\nsimple rational comes from  the fact that error backpropagation is  linear, which makes it \npossible to collapse the whole feedforward stack ofBPTT into a few fixed size data structures. \nThe biggest and most time consuming to update of these is the matrix of p values whose \nupdate rule is \n\nP it <t  + 1) = f '(Sk <t\u00bb [  L Wkl  P i~  <t) + c5  ik  Zj <t)  ] \n\nleU \n\nieU,jeUuI,keU \n\nwhere z,,(t) represents the value of a signal, either an input or recurrent; the sets of subscriptss \nare defined so that if z\" is an input then k E  I and if z\"is a signal from a recurrently connected \nunit then k E  U,  s\" are net values; d,,, is the Kronecker delta;  and w k.l is the recurrent weight \nmatrix. For a network with n units and w weights there are nw of these p values, and it takes \nO(wn2)  operations to update them. As n gets big this gets very big and is computationally un(cid:173)\npleasant. This unpleasantness is cured to some degree by the new variant ofRTRL described \nbelow. \n\n4 SUBGROUPED RTRL \nThe value of n in the factor wn2, which causes all the trouble for RTRL, can be reduced by \nviewing a recurrent network as consisting of a set of subnetworks all connected together. A \nfull y recurrent network wi th n units and m inpu ts can be divided into g full y recurren t su bnets, \neach with n/g units (assuming g is a factor of n). Each unit in a subnet will receive as input \nthe original m inputs and the activities of the n - n/ g units in the other subnets. The effect of \nsubgrouping  is  to  reduce  the  number  of p  values  per weight  to  n/g  and  the  number  of \noperations to update the pto O(wn2/g2). If g is increased in proportion to n, which keeps the \nsize of the sub-nets constant, n2/g2 is a constant and the complexity is reduced to O(w). If all \nthis is  confusing try Figure 2. \n\n5 TESTING THESE ALGORITHMS \nTo see if the subgrouped algorithm works, I compared its performance to RTRL and BPTT \non the problem of training a Turing machine to balance parentheses. The network \"sees\" the \nsame tape as the Turing machine, and is trained to produce the same outputs. A fully recurrent \nnetwork with 12 units was the smallest that learned this task. All three algorithms learned the \ntask in about the same number oflearning cycles. RTRL and subgrouped RTRL succeeded \n50%, and BPTT succeeded 80% of the time.  Subgrouped RTRL was  10 times faster than \nRTRL, whereas BPTT was 28 times faster. \n\nReferences \nRumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations \nby error propagation. In D.  E. Rumelhart, J.  L.  McClelland, &  the PDP Research  Group \n(Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Vol. \n1. Foundationa.  Cambridge, MA:  MIT Press. \nWilliams, R.  J.,  &  Zipser, D.  (1989).  A  learning  algorithm  for  continually running fully \nrecurrent neural networks. Neural Computation, 1, 270-280. \n\n\fSubgrouping Reduces Complexity \n\n641 \n\nFully  Recurrent \n\n---t.~ ALtivity and  Error \n\n........... ::::'.,  Activity only \n\n.. \n.. \n\nSubgrOlJ'ped \n\nFigure  2:  Suhgroupcd-RTRL \n\n\f", "award": [], "sourceid": 253, "authors": [{"given_name": "David", "family_name": "Zipser", "institution": null}]}