{"title": "Repeat Until Bored: A Pattern Selection Strategy", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1008, "abstract": null, "full_text": "Repeat Until  Bored:  A  Pattern  Selection  Strategy \n\nPaul  W.  Munro \n\nDepamnent of Information Science \n\nUniversity of Pittsburgh \nPittsburgh,  PA  15260 \n\nABSTRACT \n\nAn alternative to the typical technique of selecting  training examples \nindependently from a fixed distribution is fonnulated and analyzed, in \nwhich the current example is presented repeatedly until the error for that \nitem is reduced to  some criterion value,  ~; then,  another  item  is ran(cid:173)\ndomly selected.  The convergence time can be dramatically increased or \ndecreased by this heuristic, depending on the task, and is very sensitive \nto the value of ~. \n\n1  INTRODUCTION \n\nIn order to implement the back propagation learning procedure (Werbos,  1974;  Parker, \n1985; Rumelhart, Hinton and Williams, 1986), several issues must be addressed.  In addi(cid:173)\ntion to designing an appropriate network architecture and detennining appropriate values \nfor the learning parameters, the batch size and a scheme for selecting training examples \nmust be chosen.  The batch size is the number of patterns presented for which the corre(cid:173)\nsponding weight changes are computed before they are actually implemented; immediate \nupdate is equivalent to a batch size of one.  The principal pattern selection schemes are \nindependent selections from a stationary distribution (independent identically distributed, \nor i.i.d.) and epochal, in which  the training set is presented cyclically (here, each cycle \nthrough  the training  set is called an epoch). Under Li.d.  pattern selection, the learning \nperfonnance is sensitive to the sequence of training examples.  This observation suggests \nthat  there may  exist  selection  strategies  that facilitate  learning.  Several  studies  have \nshown the benefit of strategic pattern selection (e.g., Mozer and Bachrach,  1990; Atlas, \nCohn, and Ladner, 1990; Baum and Lang, 1991). \n\n1001 \n\n\f1002 \n\nMunro \n\nlYPically, online learning is implemented by independent identically distributed pattern se(cid:173)\nlection, which cannot (by definition)  take advantage of useful  sequencing  strategy. \nIt \nseems likely, or certainly plausible, that the success of learning depends to some extent \non  the order in  which  stimuli are presented.  An  extreme, though  negative,  example \nwould be to restrict learning to a portion of the available training set; i.e. to reduce the ef(cid:173)\nfective training set.  Let sampling functions that depend on the  state of the learner in a \nconstructive way be termed pedagogical. \n\nDetermination of a particular input may require information exogenous to the learner; that \nis, just as training algorithms have been classified as supervised and unsupervised, so can \npedagogical pattern selection techniques.  For example, selection may depend on the net(cid:173)\nworlc's performance relative to a desired schedule.  The intent of this study is to explore \nan unsupervised selection procedure (even though a supervised learning rule, backpropaga(cid:173)\ntion, is used).  The initial selection heuristic investigated was to evaluate the errors across \nthe entire pattern set for each iteration and to present the pattern with the highest error, of \ncourse, this technique has a large computational overhead, but the question was whether it \nwould reduce the number of learning trials.  The results were quite to the contrary; pre(cid:173)\nliminary trials on small tasks (two and three bit parity), show  that this scheme performs \nvery poorly with all patterns maintaining high error. \n\nA new unsupervised selection technique is introduced here.  The \"Repeat-Until-Bored\" \nheuristic is easily implemented and simply stated: if the current training example gener(cid:173)\nates a high error (Le. greater than a fixed criterion value), it is repeated; otherwise, another \none is randomly selected.  This approach was motivated by casual observations of behav(cid:173)\nior in  small children;  they seem  to repeat seemingly arbitrary  tasks several times,  and \nthen abruptly stop and move to some seemingly arbitrary alternative (Piaget,  1952).  For \nthe following  discussion, lID and RUB  will denote the two  selection procedures to be \ncompared. \n\n2  METHODOLOGY \n\nRUB can be implemented by adding a condition to the lID statement; in C, this is simply \n\nold (lID) :  patno  = random ()  %  numpatsi \nnew(RUB): \n\nif  (paterror<beta)  patno  = random()  % numpatsj \n\nwhere patno  identifies the selected pattern, numpats is the number of patterns in the train(cid:173)\ning set, and paterror is the sum squared error on a particular pattern.  Thus, an example is \npresented and repeated until it has been learned by the network to some criterion level, the \nsquared error summed across the output units is less than a \"boredom\" criterion  ~; then  , \nanother pattern is randomly selected. \n\nThe action of RUB in weight space is illustrated in Figure 1, for a two dimensional envi(cid:173)\nronment consisting of just two patterns.  Corresponding to each pattern, there is an iso(cid:173)\ncline (or equilibrium surface) , defined by the locus of weight vectors that yield the desired \nresponse to that pattern (here, a  or b).  Since the delta rule drives the weight parallel to \nthe presented pattern, trajectories in weight space are perpendicular to the pattern's iso(cid:173)\ncline.  Here, RUB is compared with alternate pattern selection. \n\n\fRepeat Until Bored:  A Pattern  Selection Strategy \n\n1003 \n\nwea=A \n\nan  II D  trajectory \n\na  RUB  trajectory \n\nFigure 1.  Effect of pattern selection on  weight state trajectory.  A linear unit can  be \ntrained to give arbitrary responses (A and B) to given stimuli (a and b).  The isoclines \n(bold lines) are defined to be the set of weights that satisfy each  stimulus-response \npair.  Thus, the intersection  is the weight state that satisfies both constraints.  The \ndelta rule drives the weights toward the isocline that corresponds to the presented \npattern.  The  RUB procedure repeats a pattern  until the state approaches the isocline. \n\nThe RUB procedure was tested for a broad range of ~ across several tasks.  Two perfor(cid:173)\nmance measures were used; in both cases, performance was averaged across several (20-\n1(0) trials with different initial random weights.  For the parity tasks, performance was \nmeasured as the fraction of trials for which the squared error summed over the training set \nreached a sufficiently low value (usually 0.1) within a specified number of training exam(cid:173)\npIes.  Since the parity task always converged for sufficiently large  ~,performance was \nmeasured as the number of trials that converged within a pre specified number of iterations \nrequired to reduce the total squared error summed across the pattern set to a low value \n(typically, 0.1).  Note that each iteration of weight modification during a set of repeated \nexamples was explicitly counted in the performance measure, so the comparison between \nlID and RUB  is fair.  Also, for each task,  the  learning rate and momentum  were  fixed \n(ususally 0.1  and 0.9, respectively). \n\nConsideration of RUB  (see the above C implementation, for example) indicates that, for \nvery  small values of ~, the first example will be repeated indefinitely, and the task can \ntherefore not be learned.  At the other extreme, for  ~ greater than or equal to the maxi(cid:173)\nmum possible squared error (2.0, in this case), perfonnance should match IID. \n\n\f1004 \n\nMunro \n\n3  RESULTS \n\n3.1.  PARITY \n\nWhile the expected behavior for RUB on the two and three bit parity tasks (Figure 2) is \nobserved for low and high values of ~, there are some surprises in the intermediate range. \nRather than proceeding monotonically from  zero to its lID value, the performance curve \nexhibits an \"up-down-up\" behavior; it reaches a maximum in the range O.2<~O.25, then \nplummets to zero at  J3=O.25,  remains there for an interval,  then partially recovers at its \nfinal (lID) level.  This \"dead zone\" phenomenon is not as pronounced when the momen(cid:173)\ntum parameter is set to zero (Figure 3). \n\n100 \n\n80 \n\n60 \n\n40 \n\n20 \n\n100 \n\n80 \n\n60 \n\n40 \n\n20 \n\n0 \n\n.0001 \n\n.001 \n\n.01 \n\n.1 \n\n10 \n\n0 \n.0001 \n\n.001 \n\n.01 \n\n.1 \n\n10 \n\nFigure 2.  Performance profiles for the parity task.  Each  point is the average \nnumber of successful simulations out of 100 trials.  A log scale is used  so that \nthe behavior for very low values of the error cr~erion is evident.  Note  the \ncritical falloff at  ~\"'0.25 for both the XOR task (left)  and three-bit parity (right). \n\n100 \n\n80 \n\n60 \n\n40 \n\n20 \n\n100 \n\n80 \n\n60 \n\n40 \n\n20 \n\n0 \n.0001 \n\n.001 \n\n.01 \n\n.1 \n\n10 \n\n0 \n.0001 \n\n.001 \n\n.01 \n\n.1 \n\n1 \n\n10 \n\nFigure 3.  Performance profiles with zero momemtum.  For these two tasks, \nthe up-down-up phenomenon is still evident, but there is no \"dead zone\". \nLeft: XOR  Right: Three bit parity \n\n\fRepeat Until Bored:  A Pattern Selection  Strategy \n\n1005 \n\n3.2  ENCODERS \n\nThe 4-2-4 encoder shows no significant improvement over the lID for any value of RUB. \nHere, perfonnance was measured both in tenns of success rate and average number of iter(cid:173)\nations  to  success.  Even  though  all  simulations converge for  ~>.001 (Le.,  there  is  no \ndead zone), the effect of ~ is reflected in another perfonnance measure: average number of \niterations to convergence (Figure 4).  However, experiments with the 5-2-5 encoder task \nshow an effect.  While backprop converges for all values of ~ (except very small values), \nthe perfonnance, as  measured  by  number of pattern  presentations,  does  show a  pro(cid:173)\nnounced decrement.  The 8-3-8 encoder shows a significant, but less dramatic, effect. \n\n6000 \n\n1 st data value: \n8691.0 \n\n\u2022 \n\n6 \n6 \n\n5-2-5 \n4-2-4 \n8-3-8 \n\n~  4000 \n0 u \n\n2000 \n\nCD u \n~ \nCD \n\n~ .. CD > \n0 -en \n0 ::= as .. CD \n== CD \n~ as .. CD \n\n~ \n\n~ \n\n0 \n.001 \n\n.01 \n\n1 \n\n10 \n\n.1 \n13 \n\nFigure 4.  Encoder performance profiles.  See text. \n\n3.3  THE  MESH \n\nThe mesh (Figure 5, left) is a 2-D classification task that can be solved by a strictly lay(cid:173)\nered net with five hidden units.  Like the encoder and unlike parity, lID is found to con(cid:173)\nverge on  100%  of trials;  however, there is a  critical value of ~ and a well-defined dead \nzone (Figure 5, right).  Note that the curve depicting average number of iterations to con(cid:173)\nvergence decreases monotonically, interrupted at the dead zone but continuing its apparent \ntrend for higher values of ~. \n\n\f1006 \n\nMunro \n\n0 \n\n0 \n\n0 \n\n\u2022 \n0  \u2022 \u2022 \u2022 \n0  \u2022 \n\u2022 \n\n0 \n\n\u2022 \u2022 \n\n\u2022 \n\n20 ~~----~~\"~----T10+ \n\n'b \n4b \n~ \n~ \nc: 06\" \nUC'i 10 \n-\na~ '-\nJ! \n:::J .S \n\n(I)  .... \n\nCI) \n\n4 \n\n0 \n.0001  .001 \n\n.01 \n\n.1 \n\n1 \n\n10 \n\n~ \n\nFigure  5.  The mesh task.  Left: the task.  Right: Performance profile.  Number of \nsimulations that converge is plotted along the bold line (left vertical) axis.  Average \nnumber of iterations are plotted as squares (right vertical axis). \n\n3A  NONCONVERGENCE \n\nNonconvergence was examined in detail for three values of ~, corresponding to high per(cid:173)\nfonnance, poor perfonnance (the dead zone), and lID, for the three bit parity task.  The \nerror for each of the eight patterns is plotted over time.  For trials that do not converge \n(Figure 6), the patterns interact differently, depending on the value of~.  At  (3=0.05  (a \n\"good\" value of ~ for this task), the error traces for the four odd-parity patterns are strong(cid:173)\nly correlated in an irregular oscillatory mode, as are the four even-parity traces, but the \ntwo groups are strongly anticorrelated.  In the odd parity group, the error remains low for \nthree of the patterns (001, 010, and 100), but ranges from less than 0.1  to values greater \nthan 0.95 for the fourth (111).  Traces for the even parity patterns correspond almost iden(cid:173)\ntically; i.e. not only are they correlated, but all four maintain virtually the same value. \n\nAt this point, the dead zone phenomenon has only been observed in  tasks with a single \noutput  unit.  This property  hints  at the  following  explanation.  Note  first  that  each \ninput/output pair in the training  set divides the weight space into two halves, character(cid:173)\nized by the sign of the linear activation into the output unit;  that is, whether the output \nis above or below 0.5, and hence whether the magnitude of the difference between the ac(cid:173)\ntual and desired responses is above or below 0.5.  Since  ~ is the value of the  squared \nerror, learning is repeated for  (3=0.25  only for  examples for  which  the state  is  on  the \nwrong half of weight space.  Just when it is about to cross the category boundary, which \nwould bring the absolute value of the error below .5, RUB  switches to another example, \nand the state is not pushed to the other side of the boundary.  This conjecture suggests \nthat for tasks  with multiple output units,  this effect might be reduced or eliminated, as \nhas been demonstrated in the encoder examples. \n\n\fRepeat  Until Bored:  A Pattern  Selection  Strategy \n\n1007 \n\n~ = 0.05 \n\n1.0't:::-----\"\"\"':;;:::~-~::;:'1 \n\n~ = 0.3 \n\n1.0~----------------------~ \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 ~-=::=...-...... ~:=._IIIIIIIC:~ \n\n25500 \n\n25600 \n\n1.0 -r---------------------~ \n\n~ =  2.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\no.o ...... -----------~ \n5000 \n4900 \n\nFigure  6.  Error traces for \nindividual patterns.  For each \nof three values of the error \ncriterion, the variation of the \nerror for each pattern is \nplotted for 100 iterations of \nthe three-bit parity task that \ndid not converge.  Note the \nlarge amplitude swings for low \nvalues (upper left), and the \nsmall amplitude oscillations in \nthe \"dead zone\" (upper right). \n\n0.0~---------------4 \n29000 \n\n29100 \n\n4  DISCUSSION \n\nActive learning and boredom.  The sequence of training examples has an undeniable effect \non learning, both in the real world and in simulated learning systems.  While the RUB \nprocedure influences this sequence such that the learning perfonnance is either positively \nor negatively affected,  it is just a  minimal instance of active learning;  more  elaborate \nlearning systems have explored similar notions of \"boredom\" (eg., Scott and Markovitch, \n1989). \n\nNonconvergence.  From Figure 6  it can be seen, for both RUB and lID, that nonconver(cid:173)\ngence does not correspond to a local minimum in weight space.  In situations where the \noverall error is \"stuck\" at a non-zero value, the error on the individual patterns continues \nto  change.  The weight trajectory is thus \"trapped\" in a  nonoptimal orbit,  rather than a \nnonoptimal equilibrium point. \n\n\f1008 \n\nMunro \n\nAcknowledgements \n\nThis research was supported in part by NSF grant 00-8910368 and by Siemens Corporate \nResearch, which kindly provided the author with  financial support and a stimulating re(cid:173)\nsearch environment during the summer of 1990.  David Cohn and Rile Belew were helpful \nin bringing relevant work to my attention. \n\nReferences \n\nBaum, E.  and Lang, K.  (1991)  Constructing multi-layer neural  networks by searching \ninput space rather than  weight space.  In:  Advances in Neural Information Processing \nSystems  3.  D. S. Touretsky, ed.  Morgan Kaufmann. \n\nCohn, D., Atlas, L., and Ladner, R.  (1990) Training connectionist networks with queries \nand selective sampling.  In:  Advances in Neural Information Processing Systems 2.  D. \nS.  Touretsky, ed.  Morgan Kaufmann. \n\nMozer, M. and Bachrach, J. (1990) Discovering the structure of a reactive environment by \nexploration.  In:  Advances  in  Neural  Information  Processing  Systems  2.  D.  S. \nTouretsky, ed.  Morgan Kaufmann. \n\nParker, D.  (1985) Learning logic.  TR-47.  MIT Center for Computational Economics \nand Statistics.  Cambridge MA. \n\nPiaget, J.  (1952) The Origins of Intelligence in Children.  Norton. \n\nRumelhart D.,  Hinton  G.,  and Williams R.  (1986)  Learning  representations  by back(cid:173)\npropagating errors.  Nature 323:533-536. \n\nScott,  P.  D.  and Markovitch, S. (1989) Uncertainty based selection of learning experi(cid:173)\nences.  Sixth International Workshop on Machine Learning. pp.358-361 \n\nWerbos, P.  (1974) Beyond regression: new tools for prediction and analysis in the behav(cid:173)\nioral sciences. Unpublished doctoral dissertation, Harvard University. \n\n\f", "award": [], "sourceid": 470, "authors": [{"given_name": "Paul", "family_name": "Munro", "institution": null}]}