{"title": "Constant-Time Loading of Shallow 1-Dimensional Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 863, "page_last": 870, "abstract": null, "full_text": "Constant-Time Loading of Shallow 1-Dimensional \n\nNetworks \n\nStephen Judd \n\nSiemens Corporate Research, \n\n755 College Rd. E., \nPrinceton, NJ 08540 \n\njudd@learning.siemens.com \n\nAbstract \n\nThe complexity of learning in shallow I-Dimensional neural networks has \nbeen shown elsewhere to be linear in the size of the network. However, \nwhen the network has a huge number of units (as cortex has) even linear \ntime might be unacceptable. Furthermore, the algorithm that was given to \nachieve this time was based on a single serial processor and was biologically \nimplausible. \nIn this work we consider the more natural parallel model of processing \nand demonstrate an expected-time complexity that is constant (i.e. \ndependent of the size of the network). This holds even when inter-node \ncommunication channels are short and local, thus adhering to more bio(cid:173)\nlogical and VLSI constraints. \n\nin(cid:173)\n\n1 \n\nIntroduction \n\nShallow neural networks are defined in [J ud90]; the definition effectively limits the \ndepth of networks while allowing the width to grow arbitrarily, and it is used as a \nmodel of neurological tissue like cortex where neurons are arranged in arrays tens \nof millions of neurons wide but only tens of neurons deep. Figure I exemplifies \na family of networks which are not only shallow but \"I-dimensional\" as well-we \nallow the network to be extended as far as one liked in width (i.e. to the right) by \nrepeating the design segments shown. The question we address is how learning time \nscales with the width. In [Jud88], it was proved that the worst case time complexity \n863 \n\n\f864 \n\nJudd \n\nof training this family is linear in the width. But the proof involved an algorithm \nthat was biologically very implausible and it is this objection that will be somewhat \nredressed in this paper. \n\nThe problem with the given algorithm is that it operates only a monolithic serial \ncomputer; the single-CPU model of computing has no overt constraints on commu(cid:173)\nnication capacities and therefore is too liberal a model to be relevant to our neural \nmachinery. Furthermore, the algorithm reveals very little about how to do the \nprocessing in a parallel and distributed fashion. In this paper we alter the model \nof computing to attain a degree of biological plausibility. We allow a linear num(cid:173)\nber processors and put explicit constraints on the time required to communicate \nbetween processors. Both of these changes make the model much more biological \n(and also closer to the connectionist sty Ie of processing). \n\nThis change alone, however, does not alter the time complexity-the worst case \ntraining time is still linear. But when we change the complexity question being \nasked, a different answer is obtained. We define a class of tasks (viz. training data) \nthat are drawn at random and then ask for the expected time to load these tasks, \nrather than the worst-case time. This alteration makes the question much more \nenvironmentally relevant. It also leads us into a different domain of algorithms and \nyields fast loading times. \n\n2 Shallow I-D Loading \n\n2.1 Loading \n\nA family of the example shallow I-dimensional architectures that we shall examine \nis characterized solely by an integer, d, which defines the depth of each architecture \nin the family. An example is shown in figure 1 for d = 3. The example also happens \nto have a fixed fan-in of 2 and a very regular structure, but this is not essential. A \nmember of the family is specified by giving the width n, which we will take to be \nthe number of output nodes. \n\nA task is a set of pairs of binary vectors, each specifying an stimulus to a net and \nits desired response. A random task of size t is a set of t pairs of independently \ndrawn random strings; there is no guarantee it is a function. \n\nOur primary question has to do with the following problem, which is parameterized \nby some fixed depth d, and by a node function set (which is the collection of different \ntransfer functions that a node can be tuned to perform): \n\nShallow 1-D Loading: \n\nInstance: An integer n, and a task. \nObjective: Find a function (from the node function set) for each node in the \nnetwork in the shallow I-D architecture defined by d and n such that the \nresulting circuit maps all the stimuli in the task to their associated responses. \n\n\fConstant-Time Loading of Shallow I-Dimensional Networks \n\n865 \n\nFigure 1: A Example Shallow 1-D Architecture \n\n2.2 Model of Computation \n\nOur machine model for solving this question is the following: For an instance of \nshallow 1-D loading of width n, we allow n processors. Each one has access to \na piece of the task, namely processor i has access to bits i through i + d of each \nstimulus, and to bit i of each response. Each processor i has a communication link \nonly to its two neighbours, namely processors i-I and i + 1. (The first and nth \nprocessors have only one neighbour.) It takes one time step to communicate a fixed \namount of data between neighbours. There is no charge for computation, but this is \nnot an unreasonable cheat because we can show that a matrix multiply is sufficient \nfor this problem, and the size of the matrix is a function only of d (which is fixed). \n\nThis definition accepts the usual connectionist ideal of having the processor closely \nidentified with the network nodes for which it is \"finding the weights\", and data \navailable at the processor is restricted to the same \"local\" data that connectionist \nmachines have. \n\nThis sort of computation sets the stage for a complexity question, \n\n2.3 Question and Approach \n\nWe wish to demonstrate that \n\nClaim 1 This parallel machine solves shallow J-D loading where each processor is \nfinished in constant expected time The constant is dependent on the depth of the \narchitecture and on the size of the task, but not on the width. The expectation is \nover the tasks. \n\n\f866 \n\nJudd \n\nFor simplicity we shall focus on one particular processor-the one at the leftmost \nend-and we shall further restrict our at tention to finding a node function for one \nparticular node. \n\nTo operate in parallel, it is necessary and sufficient for each processor to make its \nlocal decisions in a \"safe\" manner-that is, it must make choices for its nodes in \nsuch a way as to facilitate a global solution. Constant-time loading precludes being \nable to see all the data; and if only local data is accessible to a processor, then \nits plight is essentially to find an assignment that is compatible with all nonlocal \nsatisfying assignments. \n\nTheorem 2 The expected communication complexity of finding a \"safe\" node func(cid:173)\ntion assignment for a particular node in a shallow l-D architecture is a constant \ndependent on d and t, but not on n. \n\nIf decisions about assignments to single nodes can be made easily and essentially \nwithout having to communicate with most of the network, then the induced parti(cid:173)\ntioning of the problem admits of fast parallel computation. There are some com(cid:173)\nplications to the details because all these decisions must be made in a coordinated \nfashion, but we omit these details here and claim they are secondary issues that do \nnot affect the gross complexity measurements. \n\nThe proof of the theorem comes in two pieces. First, we define a computational \nproblem called path finding and the graph-theoretic notion of domination which \nis its fundamental core. Then we argue that the loading problem can be reduced \nto path finding in constant parallel time and give an upper bound for determining \ndomination. \n\n3 Path Finding \n\nThe following problem is parameterized by an integer I<, which is fixed. \n\nPath finding : \n\nInstance: An integer n defining the number of parts in a partite graph, and a \nseries of I<xI< adjacency matrices, M I , M 2 , \u2022\u2022. Mn -\nI . Mj indicates connections \nbetween the K nodes of part i and the I< nodes of part i + 1. \nObjective: Find a path of n nodes, one from each part of the n-partite graph. \n\nDefine Xh to be the binary matrix representing connectivity between the first part of \nthe graph and the ith part: Xl = MI and Xh(j, k) = 1 iff 3m such that Xh(j, m) = 1 \nand Mh(m, k) = 1. We say \"i includes j at h\" if every bit in the ith row of Xh is 1 \nwhenever the corresponding bit in the jth row of X h is 1. We say \"i dominates at \nh\" or \"i is a dominator' if for all rows j, i includes j at h. \n\nLemma 3 Before an algorithm can select a node i from the first part of the graph \nto be on the path, it is necessary and sufficient for i to have been proven to be a \ndominator at some h. \n0 \n\n\fConstant-Time Loading of Shallow l-Dimensional Networks \n\n867 \n\nThe minimum h required to prove domination stands as our measure of \"commu(cid:173)\nnication complexity\" . \n\nLemma 4 Shallow J-D Loading can be reduced to path finding in constant parallel \ntime. \n\nProof: Each output node in a shallow architecture has a set of nodes leading into it \ncalled a support cone (or \"receptive field\"), and the collection of functions assigned \nto those nodes will determine whether or not the output bit is correct in each \nresponse. Nodes A,B,C,D,E,G in Figure 1 are the support cone for the first output \nnode (node C), and D,E,F,G,H,J are the cone for the second. Construct each part \nof the graph as a set of points each corresponding to an assignment over the whole \nsupport cone that makes its output bit always correct. This can be done for each \ncone ih parallel, and since the depth (and the fan-in) is fixed, the set of all possible \nassignments for the support cone can be enumerated in constant time. Now insert \nedges between adjacent parts wherever two points correspond to assignments that \nare mutually compatible. (Note that since the support cones overlap one another, \nwe need to ensure that assignments are consistent with each other.) This also can \nbe done in constant parallel time. We call this construction a compatibility graph. \n\nA solution to the loading problem corresponds exactly to a path in the compatibility \ngraph. \n0 \n\nA dominator in this path-finding graph is exactly what was meant above by a \"safe\" \nassignment in the loading problem. \n\n4 Proof of Theorem \n\nSince it is possible that there is no assignments to certain cones that correctly \nmap the stimuli it is trivial to prove the theorem, but as a practical matter we are \ninterested in the case where the architecture is actually capable of performing the \ntask. We will prove the theorem using a somewhat more satisfying event. \n\nProof of theorem 2: For each support cone there is 1 output bit per response and \nthere are t such responses. Given the way they are generated, these responses could \nall be the same with probability .5t - 1 . The probability of two adjacent cones both \nhaving to perform such a constant mapping is .5 2(t-l). \n\nImagine the labelling in Figure 1 to be such that there were many support cones \nto the left (and right) of the piece shown. Any path through the left side of the \ncompatibility graph that arrived at some point in the part for the cone to the left \nof C would imply an assignment for nodes A, B, and D. Any path through the \nright side of the compatibility graph that arrived at some point in the part for the \ncone of I would imply an assignment for nodes G, H, and J. If cones C and F were \nboth required to merely perform constant mappings, then any and all assignments \nto A, B, and D would be compatible with any and all assignments to G, H, and J \n(because nodes C and F could be assigned constant functions themselves, thereby \nmaking the others irrelevant). This insures that any point on a path to the left will \ndominate at the part for I. \n\n\f868 \n\nJudd \n\nThus 22(t-l) (the inverse of the probability of this happening) is an upper bound \non the domination distance, i.e. the communication complexity, i.e. \nthe loading \ntime. \n0 \n\nMore accurately, the complexity is min(c(d, t), f(t), n), where c and f are some \nunknown functions. But the operative term here is usually c because d is unlikely \nto get so large as to bring f into play (and of course n is unbounded). \nThe analysis in the proof is sufficient, but it is a far cry from complete. The actual \nMarkovian process in the sequence of X's is much richer; there are so many events \nin the compatibility graph that cause domination to occur that is takes a lot of \ncareful effort to construct a task that will avoid it! \n\n5 Measuring the Constants \n\nUnfortunately, the very complications that give rise to the pleasant robustness of \nthe domination event also make it fiendishly difficult to analyze quantitatively. So \nto get estimates for the actual constants involved we ran Monte Carlo experiments. \n\nWe ran experiments for 4 different cases. The first experiment was to measure \nthe distance one would have to explore before finding a dominating assignment for \nthe node labeled A in figure 1. The node function set used was the set of linearly \nseparable functions. In all experiments, if domination occurred for the degenerate \nreason that there were no solutions (paths) at all, then that datum was thrown out \nand the run was restarted with a different seed. \n\nFigure 2 reports the constants for the four cases. There is one curve for each \nexperiment. The abscissa represents t, the size of the task. The ordinate is the \nnumber of support cones that must be consulted before domination can be expected \nto occur. All points given are the average of at least 500 trials. Since t is an integer \nthe data should not have been interpolated between points, but they are easier to \nsee as connected lines. The solid line (labeled LSA) is for the case just described. \nIt has a bell shape, reflecting three facts: \n\n\u2022 when the task is very small almost every choice of node function for one node \n\nis compatible with choices for the neighbouring nodes. \n\n\u2022 when the task is very large, there so many constraints on what a node must \ncompute that it is easy to resolve what that should be without going far afield. \n\n\u2022 when the task is intermediate-sized, the problem is harder. \n\nNote the very low distances involved-even the peak of the curve is well below 2, \nso nowhere would you expect to have to pass data more than 2 support cones away. \nAlthough this worst-expected-case would surely be larger for deeper nets, current \nwork is attempting to see how badly this would scale with depth (larger d). \nThe curve labeled LUA is for the case where all Boolean functions are used as the \nnode function set. Note that it is significantly higher in the region 6 < t < 12. The \nimplication is that although the node function set being used here is a superset of \nthe linearly separable functions, it takes more computation at loading time to be \nable to exploit that extra power. \n\n\f2.6 ~----~-------r------~------r-----~-------r------~----~ \n\nConstant-Time Loading of Shallow I-Dimensional Networks \n\n869 \n\n\"L5A\" -+-\n\"L5B\" -t- . \n\"LUA\" -G-\n\"LUB\" \u00b7M\u00b7_\u00b7 \n\n2.4 \n\n2.2 \n\n2 \n\n1.8 \n\n1.6 \n\n1.4 \n\n1.2 \n\n'.', \n\\ \"'X \n\\ \n~ \n\n\\ \n\n\u2022 \n\n\\ , \n\\ \n~ \n\\ \n'. \n0. \\ \n, '. , . \n\nx .. \" .. \n:' \n: \n.' \n! \n\" \n\nI \n\n, \n\n\u2022 \n\nI \n\nI \n\n\\ \n\n........ ~ \n,')(\\ \n, \n, \nt, \n/ ' : ~ \n/ \n~;-~ \n/ \n!,' \\ \n, ' I \\ \n/ \n\u2022 \n/ \n/ \n\u2022\u2022\u2022\u2022 X \nI . ' \nI?<' \nI : \nI .: \nI! \n~ \nf \n~ \nif \n:, \n\n\\ \n\\ \n\n: / I \n\nI \nI \n\n\\ \n\n\\ \n\n~, \nUI \n~ \ni \n! \n\nl' \n\n~ \n~ \n\\tl. \n~ , \n, \n, \n, \nX. \n' \n\u2022\u2022\u2022\u2022\u2022 I!l. \n.'X \\ \n\n\\ \n\n\\ \n\\ \n\\ \n\\ \n\\ \n\\\\ \n\n\\ \n\n\\ \n\n\\ \n\n, \n\n\\ \n\n\\ \n\n\\ \n\n'.' ',. \n\n'')( , . \n~ ..... \n''x \n\\ \n\" \n\n\\ \n\n\\ \n\n\\ \n\n+--~ \n, \n, \n\n\\ \n!!I \n\n. , , . . , . , \n\no--A..-' \n\nl __ ~.-~~----~------~------~------L-----~--~--~--~--~ \n16 \n\n12 \n\n8 \n\n6 \n\n4 \n\n2 \n\n14 \n\n10 \n\no \n\nFigure 2: Measured Domination Distances. \n\nThe curve labeled LSB shows the expected distance one has to explore before finding \na dominating assignment for the node labelled B in figure 1. The node function \nset used was the set of linearly separable functions, Note that it is everywhere \nhigher that the LSA curve, indicating that the difficulty of settling on a correct \nnode function for a second-layer node is somewhat higher than finding one for a \nfirst-layer node. \n\nFinally, there is a curve for node B when all Boolean functions are used (LUB), It \nis generally higher than when just linearly separable functions are used, but not so \nmarkedly so as in the case of node A. \n\n6 Conclusions \n\nThe model of computation used here is much more biologically relevant than the \nones previously used for complexity results, but the algorithm used here runs in an \noff-line \"batch mode\" (i.e. it has all the data before it starts processing). This has \nan unbiological nature, but no more so than the customary connectionist habit of \nrepeating the data many times. \n\n\f870 \n\nJudd \n\nA weakness of our analysis is that (as formulated here) it is only for discrete node \nfunctions, exact answers, and noise-free data. Extensions for any of these additional \ndifficulties may be possible, and the bell shape of the curves should survive. \n\nThe peculiarities of the regular 3-layer network examined here may appear restric(cid:173)\ntive, but it was taken as an example only; what is really implied by the term \"l-D\" \nis only that the bandwidth of the SCI graph for the architecture be bounded (see \n[J ud90] for definitions). This constraint allows several degrees of freedom in choos(cid:173)\ning the architecture, but domination is such a robust combinatoric event that the \nessential observation about bell-shaped curves made in this paper will persist even \nin the face of large changes from these examples. \n\nWe suggest that whatever architectures and node function sets a designer cares to \nuse, the notion of domination distance will help reveal important computational \ncharacteristics of the design. \n\nAcknowledgements \n\nThanks go to Siemens and CalTech for wads of computer time. \n\nReferences \n\n[Jud88] J. S. Judd. On the complexity ofloading shallow neural networks. Journal \nof Complexity, September 1988. Special issue on Neural Computation, in \npress. \n\n[Jud90] J. Stephen Judd. Neural Network Design and the Complexity of Learning. \n\nMIT Press, Cambridge, Massachusetts, 1990. \n\n\f", "award": [], "sourceid": 457, "authors": [{"given_name": "Stephen", "family_name": "Judd", "institution": null}]}