{"title": "Counting function theorem for multi-layer networks", "book": "Advances in Neural Information Processing Systems", "page_first": 375, "page_last": 382, "abstract": null, "full_text": "Counting function theorem for \n\nmulti-layer networks \n\nAdam Kowalczyk \n\nTelecom Australia, Research Laboratories \n\n770 Blackburn Road, Clayton, Vic. 3168, Australia \n\n(a.kowalczyk@trl.oz.au) \n\nAbstract \n\nWe show that a randomly selected N-tuple x of points ofRn with \nprobability> 0 is such that any multi-layer percept ron with the \nfirst hidden layer composed of hi threshold logic units can imple-\nment exactly 2 2:~~~ ( Nil) different dichotomies of x. If N > hin \nthen such a perceptron must have all units of the first hidden layer \nfully connected to inputs. This implies the maximal capacities (in \nthe sense of Cover) of 2n input patterns per hidden unit and 2 input \npatterns per synaptic weight of such networks (both capacities are \nachieved by networks with single hidden layer and are the same as \nfor a single neuron). Comparing these results with recent estimates \nof VC-dimension we find that in contrast to the single neuron case, \nfor sufficiently large nand hl, the VC-dimension exceeds Cover's \ncapacity. \n\n1 \n\nIntroduction \n\nIn the course of theoretical justification of many of the claims made about neural \nnetworks regarding their ability to learn a set of patterns and their ability to gen(cid:173)\neralise, various concepts of maximal storage capacity were developed. In particular \nCover's capacity [4] and VC-dimension [12] are two expressions of this notion and \nare of special interest here. We should stress that both capacities are not easy \nto compute and are presen tly known in a few particular cases of feedforward net(cid:173)\nworks only. VC-dimension, in spite of being introduced much later, has been far \n\n375 \n\n\f376 \n\nKowalczyk \n\nmore researched, perhaps due to its significance expressed by a well known relation \nbetween generalisation and learning errors [12, 3]. Another reason why Cover's ca(cid:173)\npacity gains less attention, perhaps, is that for the single neuron case it is twice \nhigher than VC-dimension. Thus if one would hypothesise a similar relation to be \ntrue for other feedforward networks, he would judge Cover's capacity to be quite \nan unattractive parameter for generalisation estimates, where VC-dimension is be(cid:173)\nlieved to be unrealistically big. One of the aims of this paper is to show that this \nlast hypothesis is not true, at least for some feedforward networks with sufficiently \nlarge number of hidden units. In the following we will always consider multilayer \nperceptrons with n continuously-valued inputs, a single binary output, and one or \nmore hidden layers, the first of which is made up of threshold logic units only. \n\nThe derivation of Cover's capacity for a single neuron in [4] is based on the so-called \nFunction Counting Theorem, proved for the linear function in the sixties (c.f. [4]), \nwhich states that for an N -tuple i of points in general position one can implement \nC( N, n) deC 2 2::~=o (Nil) different dichotomies of i. Extension of this result to the \nmultilayer case is still an open problem (c.f. T. Cover's address at NIPS'92). One of \nthe complications arising there is that in contrast to the single neuron case even for \nperceptrons with two hidden units the number of implementable dichotomies may \nbe different for different N -tuples in general position [8]. Our first main result states \nthat this dependence on i is relatively weak, that for a multilayer perceptron the \nnumber of implementable dichotomies (counting function) is constant on each of a \nfinite number of connected components into which the space of N-tuples in general \nposition can be decomposed. Then we show that for one of these components \nC(N, nh 1 ) different dichotomies can be implemented, where hl is the number of \nhidden units in the first hidden layer (all assumed to be linear threshold logic \nunits). This leads to an upper bound on Cover's capacity of 2n input patterns per \n(hidden) neuron and 2 patterns per adjustable synaptic weight, the same as for a \nsingle neuron. Comparing this result with a recent lower bound on VC-dimension \nof multilayer perceptrons [10] we find that for for sufficiently large nand hl the \nVC-dimension is higher than Cover's capacity (by a factor log2(h1 )). \n\nThe paper extends some results announced in [5] and is an abbreviated version of \na forthcoming paper [6J. \n\n2 Results \n\n2.1 Standing assumptions and basic notation \n\nWe recall that in this paper a multilayer perceptron means a layered feedforward \nnetwork with one or more hidden layers, and the first hidden layer built exclusively \nfrom threshold logic units. \nA dichotomy of an N-tuple i = (Xl, ... , XN) E (Rn)N is a function 6: {Xl, ... , XN} -\n{O,l} let i ~ CF(i) denote the \n{0,1}. For a multilayer perceptron F : Rn -\nnumber of different dichotomies of i which can be implemented for all possible \nselections of synaptic weights and biases. We shail call CF(i) a counting function \nfollowing the terminology used in [4]. \n\n\fCounting Function Theorem for Multi-Layer Networks \n\n377 \n\nExample 1. C\u00a2(x) = C(N, n) def 2 :E?=o (Nil) for a single threshold logic unit \n\u00a2 : R n -+ {O, 1} [4]. 0 \nPoints of an N-tuple x E (Rn)N are said to be in general po&ition if there does not \nl)-dimensional affine hyperplane in R n containing (l + 2) \nexist an 1 ~r min(N, n -\nof them. We use a symbol gP(n, N) C (Rn)N to denote that set of all N-tuples x \nin general position. \n\nThroughout this paper we assume to be given a probability measure dlJ def f dx on \nRn such that the density f : Rn -+ R is a continuous function. \n\n2.2 Counting function is locally constant \n\nWe start with a basic characterisations of the subset gP(n, N) C (Rn)N. \n\nTheorem 1 (i) gP(n, N) is an open and dense subset of (Rn)N with a finite \nnumber of connected components. \n(ii) Any of these components is unbounded, has an infinite Lebesgue measure and \nhas a positive probability measure. \nProof outline. (i) The key point to observe is that gP(n, N) = {x : p(x) =I- O}, \nwhere p : (Rn)N -+ R is a polynomial on (Rn)N. This implies immediately that \ngP( n, N) is open and dense in (R n)N. The finite number of connected components \nfollows from the results of Milnor [7] (c.f. [2]). \n(ii) This follows from an observation that each of the connected components Ci has \nthe property that if (Xl, ... , XN) E Ci and a > 0, then (ax!, ... ,axN) E C,. 0 \nAs Example 1 shows, for a single neuron the counting function is constant on \ngP(n, N). However, this may not be the case even for perceptrons with two hidden \nunits and two inputs (c.f. [8, 6] for such examples and Corollary 8). Our first main \nresult states that this dependence on x is relatively weak. \n\nTheorem 2 CF(X) is constant on connected components ofgP(n, N). \n\nProof outline. The basic heuristic behind the proof of this theorem is quite simple. \nIf we have an N-tuple x E (Rn)N which is split into two parts by a hyperplane, \nthen this split is preserved for any sufficiently small perturbation Y E (R n)N of x, \nand vice versa, any split of y corresponds to a split of X. The crux is to show that \nif x is in general position, then a minute perturbation y of x cannot allow a bigger \nnumber of splits than is possible for x. We refer to [6] for details. 0 \nThe following corollary outlines the main impact of Theorem 2 on the rest of the \npaper. It reduces the problem of investigation of the function CF(X) on gP(n, N) to \na consideration of a set of individual, special cases of N-tuples which, in particular, \nare amenable to be solved analytically. \nCorollary 3 If x E gP(n, N), then CF(X) = CF(f) for a randomly &elected N(cid:173)\ntuple f E (Rn)N with a probability> O. \n\n\f378 \n\nKowalczyk \n\n2.3 A case of special component of gP( n, N) \n\nThe following theorem is the crux of the paper. \n\nTheorem 4 There exists a connected component CC C gP( n , N) C (R n)N such \nthat \n\nh1n (N -1) \n\ni \n\nCF(i) = C(N, nh1 ) = 2 t; \n\n(for i E CC) \n\nwith equality iff the input and first hidden layer are fully connected. The synaptic \nweights to units not in the first hidden layer can be constant. \n\nU sing now Corollary 3 we obtain: \nCorollary 5 CF(i) = C(N, nh1 ) for i E (Rn)N with a probability> O. \nThe component CC C gP(n, N) in Theorem 4 is defined as the connected compo(cid:173)\nnent containing \n\n(1) \nwhere c : R __ R n is the curve defined as c(t) de! (t, t 2, ... ,tn) for t E Rand \no < tt < t2 < ... < tN are some numbers (this example has been considered \npreviously in [11]). The essential part of the proof of Theorem 4 is showing the \nbasic properties of the N-tuple PN which will be described by the Lemma below. \nAny dichotomy h of the N-tuple fiN (c.f. 1) is uniquely defined by its value at C(tl) \n(2 options) and the set of indices 1 :s; il < i2 < ... < ile < N of all transitional \npairs (C(ti;), C(ti;+I)), i.e. all indices i j such that h(C(ti;)) =f: h(C(ti;+I)), where j = \n1, \"'1 k, (additional (N;l) options). Thus it is easily seen that there exist altogether \n2 (N;I) different dichotomies of PN for any given number k of transitional pairs, \nwhere 0 5 k < N. \nLemma 6 Given integers n, N, h > 0, k ~ 0 and a dichotomy h of PN with k \ntransitional pairs. \n(i) If k 5 nh, then there exist hyperplanes H(Wi,bi)' (Wi, bd E Rn x R, such that \n\n.(pj) = 9 (bo + t,.,9(W'. P; + b,\u00bb) , \n\n(2) \n\n(3) \nfor i = 1, ... , hand j = 1, \"', N; here Vi de! 1 ifn is even and Vi del (_l)i ifn is odd, \nbo de! -0.5 if n is odd, h is even and h(po) = 1, and bo de! 0.5, otherwise. \n(ii) If k = nh, then Wij =f: 0 for j = 1, ... , nand i = 1, \"'1 h, where Wi = \n(Will Wi2, ... ,Win)' \n(iii) If k > nh, then (2) and (3) cannot be satisfied. \n\nThe proof of Lemma 6 relies on usage of the Vandermonde determinant and its \nderivatives. It is quite technical and thus not included here (c.f. [6] for details). \n\n\fCounting Function Theorem for Multi-Layer Networks \n\n379 \n\nTheorem 7 \n\n(Mitchison & Durbin [lO]f../\u00b7\u00b7 \n\nHuang & Huang [6] \n\n.' ... \n\nBaum [2]. Sakurai [11] \n\n.... \n\n......... \n\n10 \n\n2 \n1 -t--~~-.I--~~~I~I~~~I~~~-.I--~.~ \n\n1 \n\n2 \n\n5 \n\n1 0 \n\n102 \n\n1 03 \n\n104 \n\nNumber of hidden units(h1) \n\nFigure 1: Some estimates of capacity. \n\n3 Discussion \n\n3.1 An upper bound on Cover's capacity \n\n-+ {O,1}, \nThe Cover's capacity (or just capacity) of a neural network F : R n \nG ap( F), is defined as the maximal N such that for a randomly selected N -tuple \ni = (Xl, ... ,XN) E (Rn)N of points of Rn, the network can implement 1/2 of all \ndichotomies of x with probability 1 [4, 8]. \nCorollary 5 implies that Gap(F) is not greater than maximal N such that \n\nGp(PN )/2N = G(N, nhl) ~ 1/2. \n\n(4) \nsince any property which holds with probability 1 on (Rn)N must also hold prob(cid:173)\nability 1 on GG (c.f Theorem 4). The left-hand-side of the above equation is just \nthe sum of the binomial expansion of (1/2 + 1/2)N-l up to hln-th term, so, using \nthe symmetry argument, we find that it is ~ 1/2 if and only if it has at least half \nof the all terms, i.e. when N - 1 + 1 ::; 2(hln + 1). Thus the 2(hln + 1) is the \nmaximal value of N satisfying (4). 1 Now let us recall that a multilayer perceptron \nas in this paper can implement any dichotomy of any N-tuple x in general position \nif N < nhl + 1 [I, 11]. This leads to the following result: \n\nTheorem 7 \n\nnhl + 1 ~ Gap(F) ::; 2(nhl + 1). \n\nlNote that for large N the choice of cutoff value 1/2 is not critical, since the probability \n\nof a dichotomy being implementable drops rapidly as hi n a.pproa.ches 2N /2. \n\n\f380 \n\nKowalczyk \n\nN I #w 10 \n\n8 \n\ni5.. co c >.. \nI/) - 6 \n0 ... CD .a \n\n4 \n\n2 \n\nI/) \n~ \n0) \n\u00b7iii \n:it \n.2 \n\nE \n~ \nZ \n..... \nCIl \nE \ng \n15 \nc.. \n\"5 \nc.. \n\n.s -0 \n\ndVC<F)/#w \n(Sakurai [11]) \n\n..---(Cap(F)/#W ) \n(Theorem 7) \n\nFigure 2: Comparison of estimates of the ratios of Cover's capacity per synaptic \nweight (Cap(F)/#w) and VC-dimension per synaptic weight (dvc(F)/#w). (Note \nthat the upper bound for VC-dimension has so far been proved for low number of \nhidden layers [9,10].) \n\nfor any multilayer perceptron F : R n -+ {O, I} with the first hidden layer built from \nthe hi threshold logic units. For the most efficient networks in this class, with a \nsingle hidden layer, we thus obtain the following result: \n1 - O(I/nhl) ::; Cap(F)/#w ::; 2, \n\nwhere #w denotes the number of synaptk weights and biases. \n\n3.2 A relation to VC-dimension \n\nThe VC-dimension, dvc(F), is defined as the largest N such that there exists an \nN-tuple i = (Xl, ... ,XN) E (Rn)N for which the network can implement all possible \n2N dichotomies. Recent results of Sakurai [10] imply \n\n(5) \nFor sufficiently large nand hl this estimate exceeds 2(nhl + 1) which is an upper \nbound on Cap(F). Thus, in contrast to the single threshold logic unit case we have \nthe following (c.f. Fig. 3): \nCorollary 8 Cap(F) < dvc(F) if hi \u00bb 1. \n\n3.3 Memorisation ability of multilayer perceptron \n\nCorollary 8 combined with Theorem 7 and Figure 2 imply that for some cases of \npatters in general position multilayer perceptron can memorise and reliably retrieve \n\n\fCounting Function Theorem for Multi-Layer Networks \n\n381 \n\n(even with 100% accuracy) much more (~ log2(h1 ) times more) than 2 patterns \nper connection, as is the case for a single neuron [4]. This proves that co-operation \nbetween hidden units can significantly improve the storage efficiency of neural net(cid:173)\nworks. \n\n3.4 A relation to PAC learning \n\nVapnik's estimate of generalisation error [12] (an error rate on independent test set) \n\nEG(F) ~ EL(F) + D(N, dvc(F), EL, '1) \n\n(6) \nholds for N > dvc(F) with probability larger that (1 - '1). It contains two terms: \n(i) learning error E L( F) and (ii) confidence interval \n\nD(p, dvc, EL, '1) del 2W(p, dvc, '1) [1 + ,,11 + EL!W(p, dvc, '1)] , \n\nwhere \n\nIn '1 \nw(N, dvc, 11) = (In dvc + 1) 2N - N\u00b7 \n\ndvc \n\n2N \n\nThe ability of obtaining small learning error EL(F) is, in a sense, controlled by \nCap(F), while the size of the confidence interval D is controlled by both dvc(F) \nand Cap(F) (through EL(F)). For a multilayer perceptron as in Theorem 7 when \ndvc(F) \u00bb Cap(F) (Fig. 2) it can turn out that actually the capacity rather than \nthe VC-dimension is the most critical factor in obtaining low generalisation error \nEG(F). This obviously warrants further research into the relation between capacity \nand generalisation. \n\nThe theoretical estimates of generalisation error based on VC-dimension are believed \nto be too pessimistic in comparison with some experiments. One may hypothesise \nthat this is caused by too high values of dvc(F) used in estimates such as (6). Since \nCover's capacity in the case multilayer perceptron with hl \u00bb 1 turned up to be \nmuch lower than VC-dimension, one may hope that more realistic estimates could \nbe achieved with generalisation estimates linked directly to capacity. This subject \nwill obviously require further research. Note that some results along these lines can \nbe found in Cover's paper [4]. \n\n3.5 Some open problems \n\nTheorem 7 gives estimates of capacity per variable connection for a network with \nthe minimal number of neurons in the first hidden layer showing that these neurons \nhave to be fully connected. The natural question arises at this point as to whether \na network with a bigger number but not fully connected neurons in the first hidden \nlayer can achieve a better capacity (per adjustable synaptic weight). \nThe values of the counting function i f-t Cp(i) are provided in this paper for the \nparticular class of points in general position, for i E CC C (Rn)N. The natural \nquestion is whether they may be by chance a lower or upper bound for the counting \nfunction for the general case of i E (Rn)N ? The results of Sakurai [11] seem \nto point to the former case: in his case, the sequences PN = (p!, ... , PN) turned \nout to be \"the hardest\" in terms of hidden units required to implement 100% of \n\n\f382 \n\nKowalczyk \n\ndichotomies. Corollary 8 and Figure 1 also support this lower bound hypothesis. \nThey imply in particular that there exists a N'-tuple Y = (Yl, Yl, ... , YN') E (Rfl.)N', \nwhere N' deC VC-dimension > N, such that CF(Y) = 2N' \u00bb 2N > CF(PN) for \nsufficiently large nand h. \n\n4 Acknowledgement \n\nThe permission of Managing Director, Research and Information Technology, Tele(cid:173)\ncom Australia, to publish this paper is gratefully acknowledged. \n\nReferences \n\n[1] E. Baum. On the capabilities of multilayer perceptrons. Journal of Complezity, \n\n4:193-215, 1988. \n\n[2] S. Ben-David and M. Lindenbaum. Localization VS. identification of semi(cid:173)\n\nalgebraic sets. In Proceedings of the Sizth Annual Workshop on Computational \nLearning Theory (to appear), 1993. \n\n[3] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and \nthe Vapnik-Chernovenkis dimensions. Journal of the ACM, 36:929-965, (Oct. \n1989). \n\n[4] T.M. Cover. Geometrical and statistical properties of linear inequalities with \napplications to pattern recognition. IEEE Trans. Elec. Comp., EC-14:326-\n334, 1965. \n\n[5) A. Kowalczyk. Some estimates of necessary number of connections and hidden \n\nunits for feed-forward networks. In S.l. Hanson et al., editor, Advances in Neu(cid:173)\nral Information Processing Systems, volume 5. Morgan Kaufman Publishers, \nInc., 1992. \n\n[6] A. Kowalczyk. Estimates of storage capacity of multi-layer perceptron with \n\nthreshold logic hidden units. In preparation, 1994. \n\n[7) J. Milnor. On Betti numbers of real varieties. Proceedings of AMS, 15:275-280, \n\n1964. \n\n[8] G.J. Mitchison and R.M. Durbin. Bounds on the learning capacity of some \n\nmulti-layer networks. Biological Cybernetics, 60:345-356, (1989). \n\n[9] A. Sakurai. On the VC-dimension of depth four threshold circuits and the \n\ncomplexity of boolean-valued functions. Manuscript, Advanced Research Lab(cid:173)\noratory, Hitachi Ltd., 1993. \n\n[10] A. Sakurai. Tighter bounds of the VC-dimension of three-layer networks. In \n\nWCNN93, 1993. \n\n[11] A. Sakurai. n-h-1 networks store no less n\u00b7 h + 1 examples but sometimes no \n\nmore. In Proceedings of IJCNN9~, pages 111-936-111-941. IEEE, June 1992. \n[12] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer(cid:173)\n\nVerlag, 1982. \n\n\f", "award": [], "sourceid": 739, "authors": [{"given_name": "Adam", "family_name": "Kowalczyk", "institution": null}]}