{"title": "Operators and curried functions: Training and analysis of simple recurrent networks", "book": "Advances in Neural Information Processing Systems", "page_first": 325, "page_last": 332, "abstract": null, "full_text": "Operators and curried functions: \n\nTraining and analysis of simple recurrent networks \n\nJanet Wiles \nDepts of Psychology and Computer Science, \nUniversity of Queensland \nQLD 4072 Australia. \njanetw@CS.uq.oz.au \n\nAnthony Bloesch, \nDept of Computer Science, \nUniversity of Queensland, \nQLD 4072 Australia \nanthonyb@cs.uq.oz.au \n\nAbstract \n\nWe present a framework for programming tbe bidden unit representations of \nsimple recurrent networks based on the use of hint units (additional targets at \nthe output layer). We present two ways of analysing a network trained within \nthis framework: Input patterns act as operators on the information encoded by \nthe context units; symmetrically, patterns of activation over tbe context units \nact as curried functions of the input sequences. Simulations demonstrate that a \nnetwork can learn to represent three different functions simultaneously and \ncanonical discriminant analysis is used to investigate bow operators and curried \nfunctions are represented in the space of bidden unit activations. \n\n1 INTRODUCTION \n\nMany recent papers have contributed to the understanding of recurrent networks and their \npotential for modelling sequential pbenomena (see for example Giles, Sun, Chen, Lee, & \nChen, 1990; Elman, 1989; 1990; Jordan, 1986; Cleeremans, Servan-Schreiber & \nMcClelland, 1989; Williams & Zipser, 1988). Of particular interest in these papers is \nthe development of recurrent architectures and learning algorithms able to solve complex \nproblems. The perspective of the work we present bere has many similarities with these \nstUdies, however, we focus on programming a recurrent network for a specific task, and \nhence provide appropriate sequences of inputs to learn the temporal component. \n\nThe function computed by a neural network is conventionally represented by its weights. \nDuring training, the task of a network is to learn a set of weights that causes tbe \nappropriate action (or set of context-specific actions) for each input pattern. However, in \n\n325 \n\n\f326 Wiles and Bloesch \n\na network with recurrent connections, patterns of activation are also part of the function \ncomputed by a network. After training (when the weights have been fixed) each input \npattern has a specific effect on the pattern of activation across the hidden and output units \nwhich is modulated by the current state of those units. That is, each input pattern is a \ncontext sensitive operator on the state of the system. \n\nTo illustrate this idea, we present a task in which many sequences of the form, {F, argl, \n... , argn} are input to a network, which is required to output the value of each function, \nF(argl, ... , argn). The task is interesting since it illustrates how more than one function \ncan be computed by the same network and how the function selected can be specified by \nthe inputs. Viewing all the inputs (both function patterns, F, and argument patterns, \nargi) as operators allows us to analyse the effect of each input on the state of the network \n(the pattern of activation in the hidden and context units). From this perspective, the \nweights in the network can be viewed as an interpreter which has been programmed to \ncarry out the operations specified by each input pattern. \n\nWe use the term programming intentionally, to convey the idea that the actions of each \ninput pattern playa specific role in the processing of a sequence. In the simulations \ndescribed in this paper, we use the simple recurrent network (SRN) proposed by Elman \n(1990). The art of programming enters the simulations in the use of extra target units, \ncalled hints, that are provided at the output layer. At each step in learning a sequence, \nhints specify all the information that the network must preserve in the hidden unit \nrepresentation (the state of the system) in order to calculate outputs later in the sequence \n(for a discussion of the use of hints in training a recurrent network see Rumelhart, \nHinton & Williams, 1986). \n\n2 SIMULATIONS \n\nThree different boolean functions and their arguments were specified as sub-sequences of \npatterns over the inputs to an SRN. The network was required to apply the function \nspecified by the first pattern in each sequence to each of the subsequent arguments in \nturn. The functions provided were boolean functions of the current input and previous \noutput, AND, OR and XOR (Le., exclusive-or) and the arguments were arbitrary length \nstrings of O's and 1 'so The context units were not reset between sub-sequences. An SRN \nwith 3 input, 5 hidden, 5 context, 1 output and 5 hint units was trained using \nbackpropagation with a momentum term. The 5 hint units at the output layer provided \ninformation about the boolean functions during training (via the backpropagation of \nerrors), but not during testing. The network was trained on three data sets each \ncontaining 700 (ten times the number of weights in the network) randomly generated \npatterns, forming function and arguments sequences of average length 0.5, 2 and 4 \narguments respectively. The network was trained for one thousand iterations on each \ntraining set. \n\n2.1 RESULTS AND GENERALISATION \n\nAfter training, the network correctly computed every pattern in the three training sets \n(using a closest match criterion for scoring the output) and also in a test set of sequences \ngenerated using the same statistics. Generalisation test data consisting of all possible \nsequences composed of each function and eight arguments, and long sequences each of 50 \narguments also produced the correct output for every pattern in every sequence. To test \n\n\fOperators and curried functions: Training and analysis of simple recurrent networks \n\n327 \n\n~AI \n\nc:: \n8. \n8 \n0 \nCo) \n\n-c:: u \n-\n'8 ~. \n\n~ \nCo) \n\n0 \n\u00a7 \n\nCo) \n\"0 \n.!:l \n~ \n\nflJa \nr-:-\n,. \njj \n\u2022 \nL \n\n\u2022 \n\nffiQ \nrr \n\n,. \n~ \n\n-!. \n\n~. \n\nFirst canonical component \n\nFirst canonical component \n\nla. \n\nlb. \n\nFigure la. The hidden unit patterns for the training data, projected onto the first two \ncanonical components. These components separate the patterns into 3 distinct regions \ncorresponding to the initial pattern (AND, OR or XOR) in each sequence. lb. The first \nand third canonical components further separate the hidden unit patterns into 6 regions \nwhich have been marked in the diagrams above by the corresponding output classes AI, \nAO, Rl, RO, Xl and XO. These regions are effectively the computational states of the \nnetwork. \n\n1~G O'U \n\nFigure 2. Finite state machine to compute the three-function task. \n\nAnother way of considering sub-sequences in the input stream is to describe all the \ninputs as functions, not over the other inputs, as above, but as functions of the state (for \nwhich we use the term operators). Using this terminology, a sub-sequence is a \ncomposition of operators which act on the current state, \n\nG(S(t) = argt \u00b0 ... \u00b0 arg2 \u00b0 argJo S(O), \n\nwhere (f \u00b0 g) (x) = f(g(x)), and S(O) is the initial state of the network. A consequence of \ndescribing the input patterns as operators is that even the 0 and 1 data bits can be seen as \noperators that transform the internal state (see Box 1). \n\n\f328 Wiles and Bloesch \n\n3a. \n\nFirst canonical component \n\n[]. \n\nru-\n\n.. \n\n\u2022 \n\nFirst canonical component \n\n3d. \n\nFirst canonical component \n\n.. \n\n\u2022 \n\nFirst canonical component \n\n3e. \n\nFirst canonical component \n\nFigure 3. State transitions caused by each input pattern, projected onto the ftrst and third \ncanonical components of the hidden unit patterns (generated by the training data as in \nFigure 1). 3a-c. Transitions caused by the AND, OR and XOR input patterns \nrespectively. From every point in the hidden unit space, the input patterns for AND, OR \nand XOR transform the hidden units to values corresponding to a point in the regions \nmarked AI, RO and XO respectively. 3d-e. Transitions caused by the 0 and I input \npatterns respectively. The 0 and I inputs are context sensitive operators. The 0 input \ncauses changes in the hidden unit patterns corresponding to transitions from the state Al \nto AO, but does not cause transitions from the other 5 regions. Conversely. a I input \ndoes not cause the hidden unit patterns to change from the regions AI, AO or RI, but \ncauses transitions from the regions RO, Xl and XO. \n\n\fOperators and curried functions: Training and analysis of simple recurrent networks \n\n329 \n\nInput \n\noperators \n\nPatterns on \nthe input units \n\nEffect on information \nencoded in the state \n\nAND \nOR \nXOR \n\n1 \n\no \n\n011 \n110 \n101 \n111 \n\n000 \n\ncf-'AND \ncf-. OR \ncf-'XOR \nx(t) -. x(t-1) \n\n1 \n\nNOT(x(t-1\u00bb \n\nx(t) -. \n\n0 \n\nx(t-1) \nx(t-1) \n\nif cf= AND \nif cf= OR \nif cf= XOR \nif cf= AND \nif cf= OR \nif cf= XOR \n\nBox 1. Operators for the 5 input patterns. The operation performed by each input \npattern is described in terms of the effect it has on information encoded by the hidden \nunit patterns. The first and second columns specify the input operators and their \ncorresponding input patterns. The third column specifies the effect that each input in a \nsub-sequence has on information encoded in the state, represented as cf, for current \nfunction, and x(t) for the last output. \n\nFor each input pattern, we plotted all the transitions in hidden unit space resulting from \nthat input projected onto the canonical components used in Figure 1. Figures 3a to 3e \nshow transitions for each of the five input operators. For the three \"function\" inputs, \nOR, AND, and XOR, the effect is to collapse the hidden unit patterns to a single region \n- a particular state. These are relatively context insensitive operations. For the two \n\"argument\" inputs, 0 and 1, the effect is sensitive to the context in which the input \noccurs (i.e., the previous state of the hidden units). A similar analysis of the states \nthemselves focuses on the hidden unit patterns and the information that they must encode \nin order to compute the three-function task. At each timestep the weights in the network \nconstruct a pattern of activation over the hidden units that reduces the structured \narguments of a complex function of several arguments by a simpler function of one less \nargument. This can be represented as follows: \n\nG(F, arg1, ... argn) -. F(arg1, ... argn) \n\n-. Fargl(arg2, ... argn) \n-. Fargl arg2(arg3, ... argn). \n\nThis process of replacing structured arguments by a corresponding sequence of simple \nones is known as currying the input sequence (for a review of curried functions, see Bird \nand Wadler, 1988). Using this terminology, the pattern of activation in the hidden units \nis a curried function of the entire input sequence up to that time step. The network \ncombines the previous hidden unit patterns (preserved in the context units) with the \ncurrent input patterns to compute the next curried function in the sequence. Since there \nare 6 states required by the network, there are 6 classes of equivalent curried functions. \nFigure 4 shows the transition diagrams for each of the 6 equivalence classes of curried \nfunctions from the same simulation shown in Figures 1 and 3. \n\n\f330 Wiles and Bloesch \n\nFirst canonical component \n\n4a. \n\nFirst canonical component \n\n4d. \n\n\u2022 \n\n\u2022 \n\nFirst canonical component \n\n4b. \n\nFirst canonical component \n\n4e. \n\nFirst canonical component \n\nFirst canonical component \n\n4c. \n\n4f. \n\nFigure 4. State transitions for each hidden unit pattern, grouped into classes of curried \nfunctions, projected onto the frrst and third canonical components. 4a-f. Transitions from \nAI, RI, Xl, AO, RO and XO respectively. Each pattern of activation corresponds to a \ncurried function of the input sequence up to that item in the sequence. \n\n\fOperators and curried fUnctions: Training and analysis of simple recurrent networks \n\n331 \n\nhow often the network finds a good solution, five simulations were completed with the \nabove parameters, all started with different sets of random weights, and randomly \ngenerated training patterns. Three simulations learnt the training set perfectly (the other \ntwo simulations appeared to be converging, but slowly: worst case error less than 1%). \nOn the test data, the results were also good (worst case 7% error). \n\n2.2 ANALYSIS \n\nThe hidden unit patterns generated by the training data in the simulations described above \nwere analysed using canonical discriminant analysis (CDA, Kotz & Johnson, 1982). Six \noutput classes were specified, corresponding to one class for each output for each \nfunction. The output classes were used to compute the first three canonical components \nof the hidden unit patterns (which are 5-dimensional patterns corresponding to the 5 \nhidden units). The graph of the first two canonical components (see Figure 1a) shows \nthe hidden unit patterns separated into three tight clusters, corresponding to the sequence \ntype (OR, AND and XOR). The first and third canonical components (see Figure 1b) \nreveals more of the structure within each class. The six classes of hidden unit patterns \nare spread across six distinct regions (these correspond to the 6 states of the minimal \nfinite state machine, as shown in Figure 2). The first canonical component separates the \nhidden unit patterns into sequence type (OR, AND, or XOR, separated across the page). \nWithin each region, the third canonical component separates the outputs into O's and l's \n(separated down the page). Cluster analysis followed by CDA on the clusters gave \nsimilar results. \n\n3 DISCUSSION \n\nIn a network that is dedicated to computing a boolean function such as XOR, it seems \nobvious that the information for computing the function is in the weights. The \nsimulations described in this paper show that this intuition does not necessarily \ngeneralise to other networks. The three-function task requires that the network use the \nfirst input in a sequence to select a function which is then applied to subsequent \narguments. In general, for any given network, the function that is computed over a given \nsulrsequence will be specified by the interaction between the weights and the activation \npattern. \n\nThe function computed by the networks in these simulations can be described in terms of \nthe output of the global function, O(t) = G(argl, ... , argt), computed by the weights of \nthe network, which is a function of the whole input sequence. An equivalent description \ncan be given in terms of sulrsequences of the input stream, which specify a boolean \nfunction over subsequent arguments, G(F, argl, ... , argt) = F(argJ, ... , argt). Both these \nlevels of description follow the traditional approach of separating functions and data, \nwhere the patterns of activity can be described as either one or the other. \n\n\f332 Wiles and Bloesch \n\nIt appears to us that descriptions based on operators and curried functions provide a \npromising approach for the integration of representation and process within recurrent \nnetworks. For example, in the simulations described by Elman (1990), words can be \nunderstood as denoting operators which act on the state of the recurrent network, rather \nthan denoting objects as they do in traditional linguistic theory. The idea of currying can \nalso be applied to feedback from the output layer, for example in the networks developed \nby Jordan (1986), or to the product units used by Giles et al. (1990). \n\nAcknowledgements \n\nWe thank Jeff Elman, Ian Hayes, Julie Stewart and Bill Wilson for many discussions on \nthese ideas, and Simon Dennis and Steven Phillips for developing the canonical \ndiscriminant program. This work was supported by grants from the Australian Research \nCouncil and A. Bloesch was supported by an Australian Postgraduate Research Award. \n\nReferences \n\nBird, R., and Wadler P. (1988). Introduction to Functional Programming, Prentice Hall, \nNY. \n\nCleeremans, A., Servan-Schreiber, D., and McClelland, J.L. (1989). Finite state \nautomata and simple recurrent networks, Neural Computation, 1,372-381. \n\nElman, J. (1989). Representation and structure in connectionist models. UCSD CRL \nTechnical Report 8903, August 1989. \n\nElman, J. (1990). Finding structure in time. Cognitive Science, 14, 179-211. \nGiles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C., and Chen, D. (1990). Higher Order \nRecurrent Networks. In D.S. Touretzky (ed.) Advances in Neural Information Processing \nSystems 2, Morgan-Kaufmann, San Mateo, Ca., 380-387. \n\nJordan, M. I. (1986). Serial order: A parallel distributed processing approach. Institute \nfor Cognitive Science, Technical Report 8604. UCSD. \n\nKotz, S., and Johnson, N.L. (1982). Encyclopedia of Statistical Sciences. John Wiley \nand Sons, NY. \n\nRumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986). Learning internal \nrepresentations by error propagation. In D.E. Rumelhart & J.L. McClelland (eds.), \nParallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, \npp.318-362). Cambridge, MA: MIT Press. \n\nWilliams, R. J., and Zipser, D. (1988). A Learning Algorithm for Continually Running \nFully Recurrent Neural Networks, Institute for Cognitive SCience, Technical Report \n8805. UCSD. \n\n\f", "award": [], "sourceid": 468, "authors": [{"given_name": "Janet", "family_name": "Wiles", "institution": null}, {"given_name": "Anthony", "family_name": "Bloesch", "institution": null}]}