{"title": "Constructing Hidden Units using Examples and Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 904, "page_last": 910, "abstract": null, "full_text": "Constructing Hidden Units \nusing Examples and Queries \n\nEric B. Baum \n\nKevin J. Lang \n\nNEC Research Institute \n\n4 Independence Way \nPrinceton, NJ 08540 \n\nABSTRACT \n\nWhile the network loading problem for 2-layer threshold nets is \nNP-hard when learning from examples alone (as with backpropaga(cid:173)\ntion), (Baum, 91) has now proved that a learner can employ queries \nto evade the hidden unit credit assignment problem and PAC-load \nnets with up to four hidden units in polynomial time. Empirical \ntests show that the method can also learn far more complicated \nfunctions such as randomly generated networks with 200 hidden \nunits. The algorithm easily approximates Wieland's 2-spirals func(cid:173)\ntion using a single layer of 50 hidden units, and requires only 30 \nminutes of CPU time to learn 200-bit parity to 99.7% accuracy. \n\nIntrod uction \n\n1 \nRecent theoretical results (Baum & Haussler, 89) promise good generalization from \nmulti-layer feedforward nets that are consistent with sufficiently large training sets. \nUnfortunately, the problem of finding such a net has been proved intractable due \nto the hidden unit credit assignment problem -\neven for nets containing only 2 \nhidden units (Blum & Rivest, 88). While back-propagation works well enough on \nsimple problems, its luck runs out on tasks requiring more than a handful of hidden \nunits. Consider, for example, Alexis Wielands \"2-spirals\" mapping from ~2 to \n{O, I}. There are many sets of weights that would cause a 2-50-1 network to be \nconsistent with the training set of figure 3a, but backpropagation seems unable to \nfind any of them starting from random initial weights. Instead, the procedure drives \nthe net into a suboptimal configuration like the one pictured in figure 2b. \n\n904 \n\n\fConstructing Hidden Units Using Examples and Queries \n\n90S \n\nFigure 1: The geometry of query learning. \n\nIn 1984, Valiant proposed a query learning model in which the learner can ask an \noracle for the output values associated with arbitrary points in the input space. \nIn the next section we shall see how this additional source of information can be \nexploited to locate and pin down a network's hidden units one at a time, thus \navoiding the combinatorial explosion of possible hidden unit configurations which \ncan arise when one attempts to learn from examples alone. \n\n2 How to find a hidden unit using queries \nFor now, assume that our task is to build a 2-layer network of binary threshold units \nwhich computes the same function as an existing \"target\" network. Our first step \nwill be to draw a positive example x+ and a negative example x_ from our training \nset. Because the target net maps these points to different output values, its hidden \nlayer representations for the points must also be different, so the hyperplane through \ninput space corresponding to one of the net's hidden units must intersect the line \nsegment bounded by the two points (see figure 1). We can reduce our uncertainty \nabout the location of this intersection point by a factor of 2 by asking the oracle \nfor the target net's output at m, the line segment's midpoint. If, for example, m \nis mapped to the same output as x+, then we know that the hidden plane must \nintersect the line segment between x_ and m, and we can then further reduce our \nuncertainty by querying the midpoint of this segment. By performing b of queries \nof this sort, we can determine to within b bits of accuracy the location of a point \nPo that lies on the hidden plane. Assuming that our input space has n dimensions, \nafter finding n - 1 more points on this hyperplane we can solve n equations in n \nunknowns to find the weights of the corresponding hidden unit. 1 \n\n1 The additional points Pi are obtained by perturbing PO with various small vectors 1r i and then \ndiving back to the plane via a search that is slightly more complicated than the bisection method \nby which we found PO. (Baum, 91) describes this search procedure in detail, &8 well &8 a technique \nfor verifying that all the points Pi lie on the .ame hidden plane. \n\n\f906 \n\nBaum and Lang \n\nFigure 2: A backprop net before and after being trained on the 2-spirals task. \nIn these plots over input space, the net's hidden units are shown by lines while its \noutput is indicated by grey-level shading. \n\n3 Can we find all of a network's hidden units? \nHere is the crucial question: now that we have a procedure for finding one hid(cid:173)\nden unit whose hyperplane passes between a given pair of positive and negative \nexamples,2 can we discover all of the net's hidden units by invoking this procedure \non a sequence of such example pairs? If the answer is yes, then we have got a viable \nlearning method because the net's output weights can be efficiently computed via \nthe linear programming problem that arises from forward-propagating the training \nset through the net's first layer of weights. (Baum, 91) proves that for target nets \nwith four or fewer hidden units we can always find enough of them to compute the \nrequired function. This result is a direct counterpoint to the theorem of (Blum & \nRivest, 88): by using queries, we can PAC learn in polynomial time small threshold \nnets that would be NP-hard to learn from examples alone. \n\nHowever, it is possible for an adversary to construct a larger target net and an \ninput distribution such that we may not find enough hidden units to compute the \ntarget function even by searching between every pair of examples in our training \nset. The problem is that more than one hidden plane can pass between a given pair \nof points, so we could repeatedly encounter some of the hidden units while never \nseeing others. \n\n2This \"positive\" and ''negative'' terminology suggests that the target net possesses a single \n\noutput unit, but the method is not actually restricted to this case. \n\n\fConstructing Hidden Units Using Examples and Queries \n\n907 \n\nFigure 3: 2-spirals oracle, and net built by query learning. \n\nFortunately, the experiments described in the next section suggest that one can \nfind most of a net's hidden units in the average case. In fact, we may not even \nneed to find all of a network's hidden units in order to achieve good generalization. \nSuppose that one of a network's hidden units is hard to find due to the rarity of \nnearby training points. As long as our test set is drawn from the same distribution \nas the training set, examples that would be misclassified due to the absence of \nthis plane will also be rare. Our experiment on learning 200-bit parity illustrates \nthis point: only 1/4 of the possible hidden units were needed to achieve 99.7% \ngeneralization. \n\n4 Learning random target nets \nAlthough query learning might fail to discover hidden units in the worse case, the \nfollowing empirical study suggests that the method has good behavior in the average \ncase. In each of these learning experiments the target function was computed by a \n2-layer threshold net whose k hidden units were each chosen by passing a hyperplane \nthrough a set of n points selected from the uniform distribution on the unit n-sphere. \nThe output weights of each target net corresponded to a random hyperplane through \nthe origin of the unit k-sphere. Our training examples were drawn from the uniform \ndistribution on the corners of the unit n-cube and then classified according to the \ntarget net. \nTo establish a performance baseline, we attempted to learn several of these functions \nusing backpropagation. For (n = 20, k = 20) we succeeded in training a net to \n97% accuracy in less than a day, but when we increased the size of the problem \nto (n = 100, k = 50) or (n = 200, k = 30), 150 hours of CPU time dumped our \nbackprop nets into local minima that accounted for only 90% of the training data. \n\n\f908 \n\nBaum and Lang \n\nIn contrast, query learning required only 1.5 hours to learn either of the latter \ntwo functions to 99% accuracy. The method continued to function well when we \nincreased the problem size to (n = 200, k = 200). In each of five trials at this \nscale, a check of 104 training pairs revealed 197 or more hidden planes. Because the \nnetworks were missing a couple of hidden units, their hidden-to-output mappings \nwere not quite linearly separable. Nevertheless, by running the percept ron algorithm \non 100 x k random examples, in each trial we obtained approximate output weights \nw hose generalization was 98% or better. \n\n5 Learning 200-bit parity \nBecause the learning method described above needs to make real-valued queries in \norder to localize a hidden plane, it cannot be used to learn a function that is only \ndefined on boolean inputs. Thus, we defined the parity of a real-valued vector to be \nthe function computed by the 2-layer parity net of (Rumelhart, Hinton & Williams, \n1986), which has input weights of 1, hidden unit thresholds of ~, ~, ... , n - ~, and \noutput weights alternating between 1 and -1. The n parallel hidden planes of this \nnet carve the input space into n + 1 diagonal slabs, each of which contains all of the \nbinary patterns with a particular number of 1 'so \n\nAfter adopting this definition of parity (which agrees with the standard definition on \nboolean inputs), we applied the query learning algorithm to 200-dimensionalinput \npatterns. A search of 30,000 pairs of examples drawn randomly and uniformly \nfrom the corners of the unit cube revealed 46 of the 200 decision planes of the \ntarget function. Using approximate output weights computed by the perceptron \nalgorithm, we found the nets generalization rate to be 99.7%. If it seems surprising \nthat the net could perform so well while lacking so many hidden planes, consider the \nfollowing. The target planes that we did find were the middle ones with thresholds \nnear 100, and these are the relevant ones for classifying inputs that contain about \nthe same number of 1 's and O's. Because vectors of uniform random bits are unlikely \nto contain many more 1 's than O's or vice versa, we had little chance of stumbling \nacross hidden planes with high or low thresholds while learning, but we were also \nunlikely to need them for classifying any given test case. \n\n6 Function approximation using queries \nSuppose now that our goal in building a threshold net is to approximate an arbitrary \nfunction rather than to duplicate an existing threshold net. Earlier, we were worried \nabout whether we could locate all of a target net's hidden units, but at least we \nknew how many of them there were, and we knew that we had made real progress \nwhen we found one of them. Now, the hidden units constructed by our algorithm \nare merely tangents to the true decision boundaries of the target fuction, and we do \nnot know ahead of time how many such units will be required to construct a decent \napproximation to the function. \n\nWhile one could keep adding hidden units to a net until the hidden layer repre(cid:173)\nsentation of the training set becomes linearly separable, the fact that there are \n\n\fConstructing Hidden Units Using Examples and Queries \n\n909 \n\nhidden units \nmin \n90 \n65 \n49 \n\ntest errors \nmax errors min max \n136 \n72 \n45 \n125 \n\n160 \n80 \n59 \n\n70 \n47 \n15 \n80 \n\ntrain \n\n0 \n0 \n0 \n\nlearning \nadditional \nalgorithm heuristics \n\nquerIes \n\nnone \nreject redundant units \ntwo-stage construction \n\nconjugate gradient backprop \n\n60 \n\navg=9 \n\nTable 1: 2-spirals performance summary. \n\ninfinitely many of tangents to a given curve can result in the creation of an over(cid:173)\nsized net which generalizes poorly. This problem can be addressed heuristically by \nrejecting new hidden units that are too similar to existing ones. For example, the \ntop two rows of the above table summarize the results of 10 learning trials on the \ntwo-spirals problem with and without such a heuristic. 3 By imposing a floor on the \ndifference between two hidden units,4 we reduced the size of our nets and the rate \nof generalization errors by 40%. \n\nThe following two-stage heuristic training method resulted in even better networks. \nDuring the first stage of learning we attempted to create a minimally necessary \nset of hidden units by searching only between training examples that were not \nyet divided by an existing hidden unit. During the second stage of learning we \ntried to increase the separability of our hidden codes by repeatedly computing an \napproximate set of output weights and then searching for hidden units between \nmisclassified examples and nearby counterexamples. This heuristic was motivated \nby the observation that examples tend to be misclassified when a nearby piece of \nthe target function's decision boundary has not been discovered. Ten trials of this \nmethod resulted in networks containing an average of just 54 hidden units, and the \nnets committed an average of only 29 mistakes on the test set. An example of a \nnetwork generated by this method is shown in figure 3b. \n\nFor comparison, we made 10 attempts to train a 60-hidden-unit backprop net on \nthe 2-spirals problem starting from uniform random weights and using conjugate \ngradient code provided by Steve Nowlan. While these nets had more than enough \nhidden units to compute the required function, not one ofthem succeeded in learning \nthe complete training set.s \n\n3To employ query learning, we defined the oracle function indicated by shading in figure 3a. \nThe 194 training points are shown by dots in the figure. Our 576-element test set consisted of 3 \npoints between each pair of adjacent same-class training points. \n\n\u2022 Specifically, we required a minimum euclidean distance of 0.3 between the weights of two \nhidden units (after first normalizing the weight vectors so that the length of the non-threshold \npart of each vector was 1. \n\n5lnterestingly, a 2-50-1 backprop net whose initial weights were drawn from a handcrafted \ndistribution (hidden units with uniform random positions together with the appropriate output \nweights) came much closer to success than 2-50-1 nets with uniform random initial weights (com(cid:173)\npare figures 4 and 2). We can sometimes address tough problems with backprop when our prior \nknowledge gives us a head start. \n\n\f910 \n\nBaum and Lang \n\nFigure 4: Backprop works better when started near a solution. \n\nThese results illustrate the main point of this paper: the currently prevalent training \nmethodology (local optimization of random initial weights) is too weak to solve the \nNP-hard problem of hidden unit deployment. We believe that methods such as \nquery learning which avoid the credit assignment problem are essential to the future \nof connectionism. \n\nReferences \nE. Baum & D. Haussler. (1989) What size net gives valid generalization? Neural \nComputation 1(1): 151-160. \nE. Baum. (1991) Neural Net Algorithms that Learn in Polynomial Time from Ex(cid:173)\namples and Queries. IEEE Transactions on Neural Networks 2(1), January, 1991. \nA. Blum & R. L. Rivest. (1988) Training a 3-node neural network is NP-complete. \nIn D. S. Touretzky (ed.), Advances in Neural Information Processing Systems 1, \n494-501. San Mateo, CA: Morgan Kaufmann. \nK. Lang & M. Witbrock. (1988) Learning to Tell Two Spirals Apart. Proceedings \nof the 1988 Connectionist Models Summer School, Morgan Kaufmann. \nD. Rumelhart, G. Hinton, & R. Williams. (1986) Learning internal representations \nby error propagation. In D. Rumelhart & J. McClelland (eds.) Parallel Distributed \nProcessing, MIT Press. \nL. G. Valiant. (1984) A theory of the learnable. Comm. ACM 27(11): 1134-1142. \n\n\f", "award": [], "sourceid": 301, "authors": [{"given_name": "Eric", "family_name": "Baum", "institution": null}, {"given_name": "Kevin", "family_name": "Lang", "institution": null}]}