{"title": "Perturbing Hebbian Rules", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 26, "abstract": null, "full_text": "Networks with Learned Unit Response Functions \n\nJohn Moody and Norman Yarvin \nYale Computer Science, 51 Prospect St. \n\nP.O. Box 2158 Yale Station, New Haven, CT 06520-2158 \n\nAbstract \n\nFeedforward networks composed of units which compute a sigmoidal func(cid:173)\ntion of a weighted sum of their inputs have been much investigated. We \ntested the approximation and estimation capabilities of networks using \nfunctions more complex than sigmoids. Three classes of functions were \ntested: polynomials, rational functions, and flexible Fourier series. Un(cid:173)\nlike sigmoids, these classes can fit non-monotonic functions. They were \ncompared on three problems: prediction of Boston housing prices, the \nsunspot count, and robot arm inverse dynamics. The complex units at(cid:173)\ntained clearly superior performance on the robot arm problem, which is \na highly non-monotonic, pure approximation problem. On the noisy and \nonly mildly nonlinear Boston housing and sunspot problems, differences \namong the complex units were revealed; polynomials did poorly, whereas \nrationals and flexible Fourier series were comparable to sigmoids. \n\n1 \n\nIntroduction \n\nA commonly studied neural architecture is the feedforward network in which each \nunit of the network computes a nonlinear function g( x) of a weighted sum of its \ninputs x = wtu. Generally this function is a sigmoid, such as g( x) = tanh x or \ng(x) = 1/(1 + e(x-9\u00bb). To these we compared units of a substantially different \ntype: they also compute a nonlinear function of a weighted sum of their inputs, \nbut the unit response function is able to fit a much higher degree of nonlinearity \nthan can a sigmoid. The nonlinearities we considered were polynomials, rational \nfunctions (ratios of polynomials), and flexible Fourier series (sums of cosines.) Our \ncomparisons were done in the context of two-layer networks consisting of one hidden \nlayer of complex units and an output layer of a single linear unit. \n\n1048 \n\n\fNetworks with Learned Unit Response Functions \n\n1049 \n\nThis network architecture is similar to that built by projection pursuit regression \n(PPR) [1, 2], another technique for function approximation. The one difference is \nthat in PPR the nonlinear function of the units of the hidden layer is a nonparamet(cid:173)\nric smooth. This nonparametric smooth has two disadvantages for neural modeling: \nit has many parameters, and, as a smooth, it is easily trained only if desired output \nvalues are available for that particular unit. The latter property makes the use of \nsmooths in multilayer networks inconvenient. If a parametrized function of a type \nsuitable for one-dimensional function approximation is used instead of the nonpara(cid:173)\nmetric smooth, then these disadvantages do not apply. The functions we used are \nall suitable for one-dimensional function approximation. \n\n2 Representation \n\nA few details of the representation of the unit response functions are worth noting. \n\nPolynomials: Each polynomial unit computed the function \n\ng(x) = alX + a2x2 + ... + anxn \n\nwith x = wT u being the weighted sum of the input. A zero'th order term was not \nincluded in the above formula, since it would have been redundant among all the \nunits. The zero'th order term was dealt with separately and only stored in one \nlocation. \n\nRationals: A rational function representation was adopted which could not have \nzeros in the denominator. This representation used a sum of squares of polynomials, \nas follows: \n\nao + alx + ... + anxn \n\n( ) \n9 x \n\n-\n- 1 + (b o + b1x)2 + (b 2x + b3x2)2 + (b4x + b5x 2 + b6X3 + b7x4)2 + .,. \n\nThis representation has the qualities that the denominator is never less than 1, \nand that n parameters are used to produce a denominator of degree n. If the above \nformula were continued the next terms in the denominator would be of degrees eight, \nsixteen, and thirty-two. This powers-of-two sequence was used for the following \nreason: of the 2( n - m) terms in the square of a polynomial p = am xm + '\" + anxn , \nit is possible by manipulating am ... an to determine the n - m highest coefficients, \nwith the exception that the very highest coefficient must be non-negative. Thus \nif we consider the coefficients of the polynomial that results from squaring and \nadding together the terms of the denominator of the above formula, the highest \ndegree squared polynomial may be regarded as determining the highest half of the \ncoefficients, the second highest degree polynomial may be regarded as determining \nthe highest half of the rest of the coefficients, and so forth. This process cannot set \nall the coefficients arbitrarily; some must be non-negative. \n\nFlexible Fourier series: The flexible Fourier series units computed \n\nn \n\ng(x) = L: ai COS(bi X + Ci) \n\ni=O \n\nwhere the amplitudes ai, frequencies bi and phases Ci were unconstrained and could \nassume any value. \n\n\f1050 \n\nMoody and Yarvin \n\nSigmoids: We used the standard logistic function: \ng(x) = 1/(1 + e(x-9)) \n\n3 Training Method \n\nAll the results presented here were trained with the Levenberg-Marquardt modifi(cid:173)\ncation of the Gauss-Newton nonlinear least squares algorithm. Stochastic gradient \ndescent was also tried at first, but on the problems where the two were compared, \nLevenberg-Marquardt was much superior both in convergence time and in quality of \nresult. Levenberg-Marquardt required substantially fewer iterations than stochas(cid:173)\ntic gradient descent to converge. However, it needs O(p2) space and O(p2n) time \nper iteration in a network with p parameters and n input examples, as compared \nto O(p) space and O(pn) time per epoch for stochastic gradient descent. Further \ndetails of the training method will be discussed in a longer paper. \nWith some data sets, a weight decay term was added to the energy function to be \noptimized. The added term was of the form A L~=l w;. When weight decay was \nused, a range of values of A was tried for every network trained. \n\nBefore training, all the data was normalized: each input variable was scaled so that \nits range was (-1,1), then scaled so that the maximum sum of squares of input \nvariables for any example was 1. The output variable was scaled to have mean zero \nand mean absolute value 1. This helped the training algorithm, especially in the \ncase of stochastic gradient descent. \n\n4 Results \n\nWe present results of training our networks on three data sets: robot arm inverse \ndynamics, Boston housing data, and sunspot count prediction. The Boston and \nsunspot data sets are noisy, but have only mild nonlinearity. The robot arm inverse \ndynamics data has no noise, but a high degree of nonlinearity. Noise-free problems \nhave low estimation error. Models for linear or mildly nonlinear problems typically \nhave low approximation error. The robot arm inverse dynamics problem is thus a \npure approximation problem, while performance on the noisy Boston and sunspots \nproblems is limited more by estimation error than by approximation error. \n\nFigure la is a graph, as those used in PPR, of the unit response function of a one(cid:173)\nunit network trained on the Boston housing data. The x axis is a projection (a \nweighted sum of inputs wT u) of the 13-dimensional input space onto 1 dimension, \nusing those weights chosen by the unit in training. The y axis is the fit to data. The \nresponse function of the unit is a sum ofthree cosines. Figure Ib is the superposition \nof five graphs of the five unit response functions used in a five-unit rational function \nsolution (RMS error less than 2%) of the robot arm inverse dynamics problem. The \ndomain for each curve lies along a different direction in the six-dimensional input \nspace. Four of the five fits along the projection directions are non-monotonic, and \nthus can be fit only poorly by a sigmoid. \n\nTwo different error measures are used in the following. The first is the RMS error, \nnormalized so that error of 1 corresponds to no training. The second measure is the \n\n\fNetworks with Learned Unit Response Functions \n\n1051 \n\nRobot arm fit to data \n\n40 \n\n20 \n\no \n\n-zo \n\n-40 \n\n1.0 \n\n-4 \n\n. \n\n. \n\n\" . \n. ' . . ' \n\n~ \n.; 2 \no \n~ o \n! .. c \n\no \n\n-2 \n\n-2.0 \n\nFigure 1: \n\na \n\nb \n\nsquare of the normalized RMS error, otherwise known as the fraction of explained \nvarIance. We used whichever error measure was used in earlier work on that data \nset. \n\n4.1 Robot arm inverse dynamics \n\nThis problem is the determination of the torque necessary at the joints of a two(cid:173)\njoint robot arm required to achieve a given acceleration of each segment of the \narm , given each segment's velocity and position. There are six input variables to \nthe network, and two output variables. This problem was treated as two separate \nestimation problems, one for the shoulder torque and one for the elbow torque. The \nshoulder torque was a slightly more difficult problem, for almost all networks. The \n1000 points in the training set covered the input space relatively thoroughly. This, \ntogether with the fact that the problem had no noise, meant that there was little \ndifference between training set error and test set error. \n\nPolynomial networks of limited degree are not universal approximators, and that \nis quite evident on this data set; polynomial networks of low degree reached their \nminimum error after a few units. Figure 2a shows this. If polynomial, cosine, ra(cid:173)\ntional, and sigmoid networks are compared as in Figure 2b, leaving out low degree \npolynomials, the sigmoids have relatively high approximation error even for net(cid:173)\nworks with 20 units. As shown in the following table, the complex units have more \nparameters each, but still get better performance with fewer parameters total. \n\nType \ndegree 7 polynomial 5 \ndegree 6 rational \n5 \n2 term cosine \n6 \nsigmoid \n10 \nsigmoid \n20 \n\nUnits Parameters Error \n\n65 \n95 \n73 \n81 \n161 \n\n.024 \n.027 \n.020 \n.139 \n.119 \n\nSince the training set is noise-free, these errors represent pure approximation error. \n\n\f1052 \n\nMoody and Yarvin \n\n0.8 \n\n0.8 \n\n~ \n\u2022 \n0.4 \n\n0.2 \n\n0.8 \n\nO.S \n\n.. \nE \n0.4 \n\n0 \n\n0.2 \n\n~.Iilte ...... \n+ootII1n.. 3 ler .... \nOoooln.. 4 tel'lNl \n\nopoJynomleJ de, 7 \n\nXrationeJ do, 8 \n\u2022 ... \"'0101 \n\n0.0 L---,b-----+--~::::::::8~~\u00a7=t::::::!::::::1J \n\nnumbel' of WIIt11 \n\nFigure 2: \n\na \n\n10 \n\nnumber Dr WIIt11 \n\n111 \n\n20 \n\nb \n\nThe superior performance of the complex units on this problem is probably due to \ntheir ability to approximate non-monotonic functions. \n\n4.2 Boston housing \n\nThe second data set is a benchmark for statistical algorithms: the prediction of \nBoston housing prices from 13 factors [3]. This data set contains 506 exemplars and \nis relatively simple; it can be approximated well with only a single unit. Networks \nof between one and six units were trained on this problem. Figure 3a is a graph \nof training set performance from networks trained on the entire data set; the error \nmeasure used was the fraction of explained variance. From this graph it is apparent \n\n03 tenD coolh. \nx.itmold \n\n1.0 \n\no polJDomll1 d., fi \n+raUo,,\"1 dec 2 \n02 term.....m. \n0 3 term COllin. \nx.tpnotd \n\n0.5 \n\n0 .20 \n\nO. lfi \n\n~ \u2022 \n\n0.10 \n\n0.05 \n\nFigure 3: \n\na \n\nb \n\n\fNetworks with Learned Unit Response Functions \n\n1053 \n\nthat training set performance does not vary greatly between different types of units, \nthough networks with more units do better. \n\nOn the test set there is a large difference. This is shown in Figure 3b. Each point \non the graph is the average performance of ten networks of that type. Each network \nwas trained using a different permutation of the data into test and training sets, the \ntest set being 1/3 of the examples and the training set 2/3. It can be seen that the \ncosine nets perform the best, the sigmoid nets a close second, the rationals third, \nand the polynomials worst (with the error increasing quite a bit with increasing \npolynomial degree.) \n\nIt should be noted that the distribution of errors is far from a normal distribution, \nand that the training set error gives little clue as to the test set error. The following \ntable of errors, for nine networks of four units using a degree 5 polynomial, is \nsomewhat typical: \n\nSet \ntraining \ntest \n\nError \n\n0.091 I \n\n0.395 \n\nOur speculation on the cause of these extremely high errors is that polynomial ap(cid:173)\nproximations do not extrapolate well; if the prediction of some data point results in \na polynomial being evaluated slightly outside the region on which the polynomial \nwas trained, the error may be extremely high. Rational functions where the nu(cid:173)\nmerator and denominator have equal degree have less of a problem with this, since \nasymptotically they are constant. However, over small intervals they can have the \nextrapolation characteristics of polynomials. Cosines are bounded, and so, though \nthey may not extrapolate well if the function is not somewhat periodic, at least do \nnot reach large values like polynomials. \n\n4.3 Sunspots \n\nThe third problem was the prediction of the average monthly sunspot count in a \ngiven year from the values of the previous twelve years. We followed previous work \nin using as our error measure the fraction of variance explained, and in using as \nthe training set the years 1700 through 1920 and as the test set the years 1921 \nthrough 1955. This was a relatively easy test set - every network of one unit which \nwe trained (whether sigmoid, polynomial, rational, or cosine) had, in each of ten \nruns, a training set error between .147 and .153 and a test set error between .105 \nand .111. For comparison, the best test set error achieved by us or previous testers \nwas about .085. A similar set of runs was done as those for the Boston housing \ndata, but using at most four units; similar results were obtained. Figure 4a shows \ntraining set error and Figure 4b shows test set error on this problem. \n\n4.4 Weight Decay \n\nThe performance of almost all networks was improved by some amount of weight \ndecay. Figure 5 contains graphs of test set error for sigmoidal and polynomial units, \n\n\f1054 \n\nMoody and Yarvin \n\n0.18 ,..-,------=..::.;==.::.....:::...:=:..:2..,;:.::.:..----r--1 0.25 ~---..::.S.::.:un:::;;a.!:..po.:...:l:....:t:.::e.:...:Bt:....:lI:.::e..:..l ..:.:,mre.::.:an~ __ --,-, \n\nOP0lr.!:0mt .. dea \n\nt ~\u00b7leO:: o~:~~ \n\nC 3 hrm corlne \nX_lamold \n\n0.14 \n\nO.IZ \n\n.. 0 \nI: .. \n\n0.10 \n\nO.OB \n\nOpolynomlal d \u2022\u2022 1\\ \n\"\"\"allon.. de. 2 \n02 term co.lne \ncs term coolne \nx.tamcld \n\n0.20 \n\n0.15 \n\n0.10 \n\n0.08 ' - -+1 - - - - -\u00b1 2 - - - - - !S e - - - - - -+ - - ' \n\nnumber of WIlle \n\nFigure 4: \n\na \n\n2 \n\nDumb .... of unit. \n\n3 \n\nb \n\nusing various values of the weight decay parameter A. For the sigmoids, very little \nweight decay seems to be needed to give good results, and there is an order of \nmagnitude range (between .001 and .01) which produces close to optimal results. \nFor polynomials of degree 5, more weight decay seems to be necessary for good \nresults; in fact, the highest value of weight decay is the best. Since very high values \nof weight decay are needed, and at those values there is little improvement over \nusing a single unit, it may be supposed that using those values of weight decay \nrestricts the multiple units to producing a very similar solution to the one-unit \nsolution. Figure 6 contains the corresponding graphs for sunspots. Weight decay \nseems to help less here for the sigmoids, but for the polynomials, moderate amounts \nof weight decay produce an improvement over the one-unit solution. \n\nAcknowledgements \n\nThe authors would like to acknowledge support from ONR grant N00014-89-J-\n1228, AFOSR grant 89-0478, and a fellowship from the John and Fannie Hertz \nFoundation. The robot arm data set was provided by Chris Atkeson. \n\nReferences \n\n[1] J. H. Friedman, W. Stuetzle, \"Projection Pursuit Regression\", Journal of the \nAmerican Statistical Association, December 1981, Volume 76, Number 376, \n817-823 \n\n[2] P. J. Huber, \"Projection Pursuit\", The Annals of Statistics, 1985 Vol. 13 No. \n\n2,435-475 \n\n[3] L. Breiman et aI, Classification and Regression Trees, Wadsworth and Brooks, \n\n1984, pp217-220 \n\n\fBoston housin \n\n0.30 r-T\"=::...:..:;.:;:....:r:-=::;.5I~;=::::..:;=:-;;..:..:..::.....;;-=..:.!ar:......::=~..., \n\nhi decay \n\nNetworks with Learned Unit Response Functions \n\n1055 \n\n00 \n+.0001 \n0.001 \n0.01 \n)(.1 \n'.3 \n\n00 \n+.0001 \n0.001 \n0.01 \nX.l \n\u00b7.3 \n\n1.0 \n\n0.5 \n\n0.25 \n\n~0.20 \n\u2022 \n\n0.15 \n\nFigure 5: Boston housing test error with various amounts of weight decay \n\n0.16 \n\n0.14 \n\nmoids wilh wei hl decay \n\n00 \n+.0001 \n0 .001 \n0 .01 \n><.1 \n\u00b7 .3 \n1.8 \n\nO.IB \n\n0. 111 \n\n.. \n1: 0.12 \nD ~ 0. 12 \n~~ \nsea \n\n::::::,. \n\n0.10 \n\n0.1 \u2022 \n\n0. 10 \n\n0.08 \n\n2 \n\nDum be .. of 1IJlIt, \n\n3 \n\n0.08 \n\n<4 \n\n2 \n\nDumb.,. 01 WIll' \n\n3 \n\nFigure 6: Sunspot test error with various amounts of weight decay \n\n\fPerturbing Hebbian Rules \n\nPeter Dayan \nCNL, The Salk Institute \nPO Box 85800 \nSan Diego CA 92186-5800, USA \ndayan~helrnholtz.sdsc.edu \n\nGeoffrey Goodhill \nCOGS \nUniversity of Sussex, Falmer \nBrighton BNl 9QN, UK \ngeoffg~cogs.susx.ac.uk \n\nAbstract \n\nRecently Linsker [2] and MacKay and Miller [3,4] have analysed Hebbian \ncorrelational rules for synaptic development in the visual system, and \nMiller [5,8] has studied such rules in the case of two populations of fibres \n(particularly two eyes). Miller's analysis has so far assumed that each of \nthe two populations has exactly the same correlational structure. Relaxing \nthis constraint by considering the effects of small perturbative correlations \nwithin and between eyes permits study of the stability of the solutions. \nWe predict circumstances in which qualitative changes are seen, including \nthe production of binocularly rather than monocularly driven units. \n\n1 \n\nINTRODUCTION \n\nLinsker [2] studied how a Hebbian correlational rule could predict the development \nof certain receptive field structures seen in the visual system. MacKay and Miller \n[3,4] pointed out that the form of this learning rule meant that it could be analysed \nin terms of the eigenvectors of the matrix of time-averaged presyna ptic correlations. \nMiller [5,8, 7] independently studied a similar correlational rule for the case of two \neyes (or more generally two populations), explaining how cells develop in VI \nthat are ultimately responsive to only one eye, despite starting off as responsive \nto both. This process is again driven by the eigenvectors and eigenvalues of \nthe developmental equation, and Miller [7] relates Linsker's model to the two \npopulation case. \nMiller's analysis so far assumes that the correlations of activity within each popula(cid:173)\ntion are identical. This special case simplifies the analysis enabling the projections \nfrom the two eyes to be separated out into sum and difference variables. In general, \n19 \n\n\f20 \n\nDayan and Goodhill \n\none would expect the correlations to differ slightly, and for correlations between the \neyes to be not exactly zero. We analyse how such perturbations affect the eigenvec(cid:173)\ntors and eigenvalues of the developmental equation, and are able to explain some \nof the results found empirically by Miller [6]. \nFurther details on this analysis and on the relationship between Hebbian and \nnon-Hebbian models of the development of ocular dominance and orientation \nselectivity can be found in Goodhill (1991). \n\n2 THE EQUATION \n\nMacKay and Miller [3,4] study Linsker's [2] developmental equation in the form: \n\nw = (Q + k2J)W+ kIn \n\nwhere W = [wd, i E [1, n] are the weights from the units in one layer 'R, to a \nparticular unit in the next layer S, Q is the covariance matrix of the activities of the \nunits in layer'R\" J is the matrix hi = 1, Vi, j, and n is the 'DC' vector ni = 1, Vi. \nThe equivalent for two populations of cells is: \n\n( :~ ) = ( g~! ~~~ g~! ~~~ ) ( :~ ) + kl ( ~ ) \n\nwhere Ql gives the covariance between cells within the first population, Q2 gives \nthat between cells within the second, and Qc (assumed symmetric) gives the covari(cid:173)\nance between cells in the two populations. Define Q. as this full, two population, \ndevelopment matrix. \nMiller studies the case in which Ql = Q2 = Q and Qc is generally zero or slightly \nnegative. Then the development of WI - W2 (which Miller calls So) and WI + W2 \n(SS) separate; for Qc = 0, these go like: \nSSS \n\nSS 0 \nbt = QSo and St = (Q + 2k2J)SS + 2kln. \n\nand, up to various forms of normalisation and/or weight saturation, the patterns \nof dominance between the two populations are determined by the initial value \nand the fastest growing components of So. If upper and lower weight saturation \nlimits are reached at roughly the same time (Berns, personal communication), the \nconventional assumption that the fastest growing eigenvectors of So dominate the \nterminal state is borne out. \nThe starting condition Miller adopts has WI - W2 = \u20ac' a and WI + W2 = b, where \nis small, and a and b are 0(1). Weights are constrained to be positive, and \n\u20ac' \nsaturate at some upper limit. Also, additive normalisation is applied throughout \ndevelopment, which affects the growth of the SS (but not the SO) modes. As \ndiscussed by MacKay and Miller [3,4]' this is approximately accommodated in the \nk2J component. \nMackay and Miller analyse the eigendecomposition of Q + k2J for general and \nradially symmetric covariance matrices Q and all values of k2. It turns out that the \neigendecomposition of Q. for the case Ql = Q2 = Q and Qc = 0 (that studied by \nMiller) is given in table form by: \n\n\fE-vector E-value \n(Xi, xt) \n(Xi, -xl) \n(Yi, -yt) \n(Zit zl) \n\nAi \nAi \n~i \n'Vi \n\nConditions \n\nQXi = AiXi \nQXi = AiXi \nQYi = ~iYi \n\n(Q + 2k2J)Zi = 'ViZi \n\nPerturbing Hebbian Rules \n\n21 \n\nn'Xi = 0 \nn.Xi = 0 \nn\u00b7Yi f. 0 \nn.zi f. 0 \n\nFigure 1 shows the matrix and the two key (y, -y) and (x, -x) eigenvectors. \nThe details of the decomposition of Q. in this table are slightly obscured by de(cid:173)\ngeneracy in the eigendecomposition of Q + k2J. Also, for clarity, we write (Xi, Xi) \nfor (Xi, Xi) T. A consequence of the first two rows in the table is that (l1Xi, aXi) is an \neigenvector for any 11 and a; this becomes important later. \n\nThat the development of SD and S5 separates can be seen in the (u, u) and (u, -u) \nforms of the eigenvectors. In Miller's terms the onset of dominance of one of the \ntwo populations is seen in the (u, -u) eigenvectors - dominance requires that ~j \nfor the eigenvector whose elements are all of the same sign (one such exists for \nMiller's Q) is larger than the ~i and the Ai for all the other such eigenvectors. In \nparticular, on pages 296-300 of [6], he shows various cases for which this does and \none in which it does not happen. To understand how this comes about, we can \ntreat the latter as a perturbed version of the former. \n\n3 PERTURBATIONS \n\nConsider the case in which there are small correlations between the projections \nand/ or small differences between the correlations within each projection. For \ninstance, one of Miller's examples indicates that small within-eye anti-correlations \ncan prevent the onset of dominance. This can be perturbatively analysed by setting \nQl = Q + eEl, Q2 = Q + eE2 and Qe = eEe. Call the resulting matrix Q;. \nTwo questions are relevant. Firstly, are the eigenvectors stable to this perturbation, \nie are there vectors al and a2 such that (Ul + eal, U2 + ea2) is an eigenvector of \nQ; if (Ul, U2) is an eigenvector of Q. with eigenvalue 4>? Secondly, how do the \neigenvalues change? \nOne way to calculate this is to consider the equation the perturbed eigenvector \nmust satisfy:l \n\nQ\u20ac \n\u2022 \n\nU2 + ea2 \n\n( Ul + eal ) = (4) + elP) ( Ul + eal ) \n\nU2 + ea2 \n\nand look for conditions on Ul and U2 and the values of al, a2 and lP by equating \nthe O( e) terms. We now consider a specific exam pIe. Using the notation of the \ntable above, (Yi + eal, -Yi + ea2) is an eigenvector with eigenvalue ~i + elPi if \n\n(Q - ~i1) al + k2J(al + a2) \n(Q - ~i1) a2 + k2J (al + a2) = \n\n-(El- Ee - lPd)Yi, and \n- (Ee - E2 + lPiI)Yi. \n\nSubtracting these two implies that \n\n(Q - ~i1) (al - a2) = - (El - 2Ee + E2 - 2lPi1) Yi. \n\nlThis is a standard method for such linear systems, eg in quantum mechanics. \n\n\f22 \n\nDayan and Goodhill \n\nHowever, Y{ (Q - lii I) = 0, since Q is symmetric and Yi is an eigenvector with \neigenvalue Iii, so multiplying on the left by yl, we require that \n\n2lViyJ Yi = y[ (E 1 - 2Ee + E2) Yi \n\nwhich sets the value of lVi' Therefore (Yit -yt) is stable in the required manner. \nSimilarly (Zit Zi) is stable too, with an equivalent perturbation to its eigenvalue. \nHowever the pair (Xit xt) and (Xit -Xi) are not stable - the degeneracy from their \nhaving the same eigenvalue is broken, and two specific eigenvectors, (~Xit (3iXi) \nand (- (3iXit ~Xi) are stable, for particular values (Xi and (3i' This means that to first \norder, SD and SS no longer separate, and the full, two-population, matrix must be \nsolved. \nTo model Miller's results, call Q;,m the special case of Q; for which El = E2 = E \nand Ee = O. Also, assume that the Xit Yi and Zi are normalised, let el (u) = uTE 1 u t \netc, and define 1'(u) = (el (u) - e2(u) )/2ee(u), for ee (u) =f. 0, and 1'i = 1'(xt). Then \nwe have \n\n(1) \n\nand the eigenvalues are: \n\nE-vector \n((XiXit (3iXt) \n( - (3iXit (XiX;.) \n\n(\"Yit -yt) \n(Zit zt) \n\nQ. \nAi \nAi \nIii \n'Vi \n\nEigenvalue for case: \n\n9: \n\nQ;,m \n\nAi + eel Xi \nAi + eel Xi \nIii + eel Yd \n'Vi + eel Zi \n\nAi + e ell xl) + e2(Xi) + =i]/Z \nAi - e ell xd + e2(xd + =d/2 \n\nIii + e[ el Yi + e2 Yi - Zee YdJ/Z \n'Vi + e el Zi ) + e2 Zi +Zee zt)J/2 \n\nwhere =i = v'[ el (Xi) - e2(Xi)]2 + 4ee(xi)2. For the case Miller treats, since E 1 = E2, \nthe degeneracy in the original solution is preserved, ie the perturbed versions of \n(Xit xt) and (Xit -xt) have the same eigenvalues. Therefore the SD and SS modes \nstill separate. \nThis perturbed eigendecomposition suffices to show how small additional correla(cid:173)\ntions affect the solutions. We will give three examples. The case mentioned above \non page 299 of [6], shows how small same-eye anti-correlations within the radius \nof the arbor function cause a particular (Yit -yt) eigenvector (Le. one for which \nall the components of Yi have the same sign) to change from growing faster than \na (Xit -xt) (for which some components of Xi are positive and some negative to \nensure that n.Xi = 0) to growing slower than it, converting a monocular solution \nto a binocular one. \nIn our terms, this is the Q;' m case, with E 1 a negative matrix. Given the conditions \non signs of their components, el (yt) is more negative than el(xi), and so the \neigenvalue for the perturbed (Yit -Yi) would be expected to decrease more than \nthat for the perturbed (Xit -xt). This is exactly what is found. Different binocular \neigensolutions are affected by different amounts, and it is typically a delicate issue \nas to which will ultimately prevail. Figure 2 shows a sample perturbed matrix for \nwhich dominance will not develop. If the change in the correlations is large (0(1 ), \nthen the eigenfunctions can change shape (eg Is becomes 2s in the notation of [4]). \nWe do not address this here, since we are considering only changes of O( e). \n\n\fPerturbing Hebbian Rules \n\n23 \n\n.. \n\n\" \n\nFigure 1: Unperturbed two-eye correlation matrix and (y, -y), (x, -x) eigenvec(cid:173)\ntors. Eigenvalues are 7.1 and 6.4 respectively. \n\n80 \n\nFigure 2: Same-eye anti-correlation matrix and eigenvectors. (y, -y), (x, -x) eigen(cid:173)\nvalues are 4.8 and 5.4 respectivel)\" and so the order has swapped. \n\n80 \n\n\f24 \n\nDayan and Goodhill \n\nPositive opposite-eyecorrelations can have exactly the same effect. This time ec(yd \nis greater than ec(xd, and so, again, the eigenvalue for the perturbed (Yi. -Yd \nwould be expected to decrease more than that for the perturbed (Xi. -Xi)' Figure 3 \nshows an example which is infelicitous for dominance. \nThe third case is for general perturbations in Q!. Now the mere signs of the \ncomponents of the eigenvectors are not enough to predict which will be affected \nmore. Figure 4 gives an example for which ocular dominance will still occur. Note \nthat the (Xi. -Xi) eigenvector is no longer stable, and has been replaced by one of \nthe form (~Xi. f3i.xd. \nIf general perturbations of the same order of magnitude as the difference between \nWI and W2 (ie \u20ac' ~ \u20ac) are applied, the OCi and f3i terms complicate Miller's So \nanalysis to first order. Let Wl(O) - W2(0) = \u20aca and apply Q! as an iteration matrix. \nWI (n) -w2(n), the difference between the projections aftern iterations has no 0(1) \ncomponent, but two sets of O(\u20ac) components; {21l-f (a.Yi) yd, and \n\n{ Af[l + \u20ac(Ti + 3i)/2Adn (OCiXi.Wl(O) + f3iXi.W2(0)) (OCi - f3i)Xi -\nAf[l + \u20ac(Ti - 3i)/2Ai]n (OCiXi.W2(0) - f3iXi.Wl (0)) (OCi + f3i)Xi \n\n} \n\nwhere Ti = el(xi) + e2(xd. Collecting the terms in this expression, and using \nequation 1, we derive \n\n{ Af [(oct + f3f)xi. a + 2n ~~),i~f3iXi.b 1 Xi} \n\nwhere b = Wl(O) + W2(0). The second part of this expression depends on n, \nand is substantial because Wl(O) + W2(0) is 0(1). Such a term does not appear \nin the unperturbed system, and can bias the competition between the Yi and the \nXi eigenvectors, in particular towards the binocular solutions. Again, its precise \neffects will be sensitive to the unperturbed eigenvalues. \n\n4 CONCLUSIONS \n\nPerturbation analysis applied to simple Hebbian correlational learning rules reveals \nthe following: \n\n\u2022 Introducing small anti-correlations within each eye causes a tendency toward \n\nbinocularity. This agrees with the results of Miller. \n\n\u2022 Introducing small positive correlations between the eyes (as will inevitably \n\noccur once they experience a natural environment) has the same effect. \n\n\u2022 The overall eigensolution is not stable to small perturbations that make the \ncorrelational structure of the two eyes unequal. This also produces interesting \neffects on the growth rates of the eigenvectors concerned, given the initial \nconditions of approximately equivalent projections from both eyes. \n\nAcknowledgements \n\nWe are very grateful to Ken Miller for helpful discussions, and to Christopher \nLonguet-Higgins for pointing us in the direction of perturbation analysis. Support \n\n\fPerturbing Hebbian Rules \n\n25 \n\nFigure 3: Opposite-eye positive correlation matrix and eigenvectors. Eigenvalues \nof (y, -Y)I (x, -x) are 4.8 and 5.41 so ocular dominance is again inhibited. \n\nso \n\nFigure 4: The effect of random perturbations to the matrix. Although the order is \nrestored (eigenvalues are 7.1 and 6.4)1 note the ((xx, (3x) eigenvector. \n\nso \n\n\f26 \n\nDayan and Goodhill \n\nwas from the SERC and a Nuffield Foundation Science travel grant to GG. GG \nis grateful to David Willshaw and the Centre for Cognitive Science for their hos(cid:173)\npitality. GG's current address is The Centre for Cognitive Science, University of \nEdinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, Scotland, and correspondence \nshould be directed to him there. \n\nReferences \n\n[1] Goodhill, GJ (1991). Correlations, Competition and Optimality: Modelling the De(cid:173)\nvelopment of Topography and Ocular Dominance. PhD Thesis, Sussex University. \n[2] Linsker, R (1986). From basic network principles to neural architecture (series). \n\nProc. Nat. Acad. Sci., USA, 83, pp 7508-7512,8390-8394,8779-8783. \n\n[3] MacKay, DJC & Miller, KD (1990). Analysis of Linsker's simulations of Heb(cid:173)\n\nbian rules. Neural Computation, 2, pp 169-182. \n\n[4] MacKajj DJC & Miller, KD (1990). Analysis of Linsker' sa pplication of Hebbian \n\nrules to linear networks. Network, 1, pp 257-297. \n\n[5] Miller, KD (1989). Correlation-based Mechanisms in Visual Cortex: Theoretical and \n\nEmpirical Studies. PhD Thesis, Stanford University Medical School. \n\n[6] Miller, KD (1990). Correlation-based mechanisms of neural development. In \nMA Gluck & DE Rumelhart, editors, Neuroscience and Connectionist Theory. \nHillsborough, NJ: Lawrence Erlbaum. \n\n[7] Miller, KD (1990). Derivation of linear Hebbian equations from a nonlinear \n\nHebbian model of synaptic plasticity. Neural Computation, 2, pp 321-333. \n\n[81 Miller, KD, Keller, JB & Stryker, MP (1989). Ocular dominance column devel(cid:173)\n\nopment: Analysis and simulation. Science, 245, pp 605-615. \n\n\f", "award": [], "sourceid": 575, "authors": [{"given_name": "Peter", "family_name": "Dayan", "institution": null}, {"given_name": "Geoffrey", "family_name": "Goodhill", "institution": null}]}