{"title": "Learning in Feedforward Networks with Nonsmooth Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1056, "page_last": 1063, "abstract": null, "full_text": "Learning in Feedforward Networks with Nonsmooth \n\nFunctions \n\nNicholas J. Redding\u00b7 \n\nInformation Technology Division \nDefence Science and Tech. Org. \n\nP.O. Box 1600 Salisbury \n\nAdelaide SA 5108 Australia \n\nT.Downs \n\nIntelligent Machines Laboratory \nDept of Electrical Engineering \n\nUniversity of Queensland \nBrisbane Q 4072 Australia \n\nAbstract \n\nThis paper is concerned with the problem of learning in networks where some \nor all of the functions involved are not smooth. Examples of such networks are \nthose whose neural transfer functions are piecewise-linear and those whose error \nfunction is defined in terms of the 100 norm. \nUp to now, networks whose neural transfer functions are piecewise-linear have \nreceived very little consideration in the literature, but the possibility of using an \nerror function defined in terms of the 100 norm has received some attention. In \nthis latter work, however, the problems that can occur when gradient methods are \nused for non smooth error functions have not been addressed. \nIn this paper we draw upon some recent results from the field of nonsmooth \noptimization (NSO) to present an algorithm for the non smooth case. Our moti(cid:173)\nvation for this work arose out of the fact that we have been able to show that, \nin backpropagation, an error function based upon the 100 norm overcomes the \ndifficulties which can occur when using the 12 norm. \n\n1 \n\nINTRODUCTION \n\nThis paper is concerned with the problem of learning in networks where some or all of \nthe functions involved are not smooth. Examples of such networks are those whose neural \ntransfer functions are piecewise-linear and those whose error function is defined in terms \nof the 100 norm. \n\n\u00b7The author can be contacted via email atinternetaddressredding@itd.dsto.oz.au. \n\n1056 \n\n\fLearning in Feedforward Networks with Nonsmooth Functions \n\n1057 \n\nUp to now. networks whose neural transfer functions are piecewise-linear have received \nvery little consideration in the literature. but the possibility of using an error function defined \nin terms of the \u00a300 norm has received some attention [1]. In the work described in [1]. \nhowever. the problems that can occur when gradient methods are used for nonsmooth error \nfunctions have not been addressed. \nIn this paper we draw upon some recent results from the field of nonsmooth optimization \n(NSO) to present an algorithm for the nonsmooth case. Our motivation for this work arose \nout of the fact that we have been able to show [2]1 that an error function based upon the \u00a300 \nnorm overcomes the difficulties which can occur when using backpropagation's \u00a32 norm \n[4]. \nThe framework for NSO is the class of locally Lipschitzian functions [5]. Locally Lips(cid:173)\nchitzian functions are a broad class of functions that include. but are not limited to. \"smooth to \n(completely differentiable) functions. (Note. however. that this framework does not include \nstep-functions.) We here present a method for training feedforward networks (FFNs) whose \nbehaviour can be described by a locally Lipschitzian function y = f lJIIt(w, x). where the \ninput vector x = (Xl, ... , xn) is an element of the set of patterns X C Rn. W E Ril is the \nweight vector. and y E Rm is the m-dimensional output. \nThe possible networks that fit within the locally Lipschitzian framework include any network \nthat has a continuous. piecewise differentiable description. i.e., continuous functions with \nnondifferentiable points (\"non smooth functionstt). \nTraining a network involves the selection of a weight vector W* which minimizes an error \nfunction E( w). As long as the error function E is locally Lipschitzian. then it can be trained \nby the procedure that we will outline. which is based upon a new technique for NSO [6]. \nIn Section 2. a description of the difficulties that can occur when gradient methods are \napplied to nonsmooth problems is presented. In Section 3. a short overview of the Bundle(cid:173)\nTrust algorithm [6] for NSO is presented. And in Section 4 details of applying a NSO \nprocedure to training networks with an \u00a300 based error function are presented. along with \nsimulation results that demonstrate the viability of the technique. \n\n2 FAll..URE OF GRADIENT METHODS \n\nTwo difficulties which arise when gradient methods are applied to nonsmooth problems will \nbe discussed here. The first is that gradient descent sometimes fails to converge to a local \nminimum. and the second relates to the lack of a stopping criterion for gradient methods. \n\n2.1 THE \"JAMMING\" EFFECT \n\nWe will now show that gradient methods can fail to converge to a local minimum (the \n\"jamming\" effect [7.8]). The particular example used here is taken from [9]. \nConsider the following function. that has a minimum at the point w* = (0,0): \n\n(1) \nIf we start at the point Wo = (2,1). it is easily shown that a steepest descent algorithm2 \nwould generate the sequence WI = (2, -1)/3. W2 = (2,1)/9 \u2022...\u2022 so that the sequence \n\nfl(W) = 3(w? + 2wi). \n\nlThis is quite simple. using a theorem due to Krishnan [3]. \nlrrhis is achieved by repeatedly perfonning a line search along the steepest descent direction. \n\n\f1058 \n\nRedding and Downs \n\nnondifferenliable \n\nhalf-line \n\n\u2022 \n\nFigure 1: A contour plot of the function h. \n\n{w/c} oscillates between points on the two half-lines Wl = 2'lV2 and Wl = -2'lV2 for \nWl ~ O. converging to the optimal point w\u00b7 = (0,0). Next. from the function ft. create a \nnew function h in the following manner: \n\n(2) \nThe gradient at any point of h is proportional to the gradient at the same point on It. so \nthe sequence of points generated by a gradient descent algorithm starting from (2, 1) on h \nwill be the same as the case for It. and will again converge3 to the optimal point. again \nw\u00b7 = (0,0). \nLastly. we shift the optimal point away from (0, 0), but keep a region including the sequence \n{ w/c} unchanged to create a new function 13 (w): \n\nhew) = { V3(w? + 2wD \n\nif 0 ~ 1'lV21 ~ 2Wl \n\n~(Wl + 41'lV21) elsewhere. \n\n(3) \n\nThe new function 13, depicted in fig. 1, is continuous, has a discontinuous derivative only \non the half-line Wl ~ O. 'lV2 = 0, and is convex with a \"minimum\" as Wl -- -00. In spite \nof this, the steepest descent algorithm still converges to the now nonoptimal \"jamming\" \npoint (0, 0). A multitude of possible variations to It exist that will achieve a similar result, \nbut the point is clear: gradient methods can lead to trouble when applied to non smooth \nproblems. \nThis lesson is important, because the backpropagation learning algorithm is a smooth \ngradient descent technique. and as such will have the difficulties described when it, or an \nextension (eg., [1]). are applied to a nonsmooth problem. \n\n2.2 STOPPING CRITERION \n\nThe second significant problem associated with smooth descent techniques in a nonsmooth \ncontext occurs with the stopping criterion. In normal smooth circumstances. a stopping \n\n3Note that for this new sequence of points, the gradient no longer converges to 0 at (0,0), but \n\noscillates between the values v'2(1, \u00b1l). \n\n\fLearning in Feedforward Networks with Nonsmooth Functions \n\n1059 \n\ncriterion is determined using \n\nIIV'/II ~ (, \n\n(4) \nwhere ( is a small positive quantity determined by the required accuracy. However, it is \nfrequently the case that the minimum of a non smooth function occurs at a nondifferentiable \npoint or \"kink\", and the gradient is of little value around these points. For example, the \ngradient of /( w) = Iwl has a magnitude of 1 no matter how close w is to the optimum at \nw = O. \n\n3 NONSMOOTH OYfIMIZATION \n\nFor any locally Lipschitzian function /, the generalized directional derivative always exists, \nand can be used to define a generalized gradient or subdifferential, denoted by 8/. which \n,is a compact convex set4 [5]. A particular element g E 8/(w) is termed a subgradientof \n/ at W [5,10]. In situations where / is strictly differentiable at w. the generalized gradient \nof / at W is equal to the gradient, i.e., 8/( w) = V' /( w). \nWe will now discuss the basic aspects of NSO and in particular the Bundle-Trust (Bn \nalgorithm [6]. \nQuite naturally. subgradients in NSO provide a substitute for the gradients in standard \nsmooth optimization using gradient descent. Accordingly, in an NSO procedure, we require \nthe following to be satisfied: \n\nAt every w, we can compute /(w) and any g E 8/(w). \n\n(5) \nTo overcome the jamming effect, however. it is not sufficient replace the gradient with \na subgradient in a gradient descent algorithm -\nthe strictly local information that this \nprovides about the function's behaviour can be misleading. For example, an approach like \nthis will not change the descent path taken from the starting point (2,1) on the function h \n(see fig. 1). \nThe solution to this problem is to provide some \"smearing\" of the gradient information by \nenriching the information at w with knowledge of its surroundings. This can be achieved \nby replacing the strictly local subgradients g E 8/(w) by UVEB g E 8/(v) where B is a \nsuitable neighbourhoodofw, and then define the (-generalized gradient 8 f /(w) as \n\n8f /(w) 6 co { u 8/(V)} \n\nVEB(W,f) \n\n(6) \n\nwhere ( > 0 and small, and co denotes a convex hull. These ideas were first used by [7] \nto overcome the lack of continuity in minimax problems, and have become the basis for \nextensive work in NSO. \nIn an optimization procedure. points in a sequence {WI:, k = 0,1, ... } are visited until a \npoint is reached at which a stopping criterion is satisfied. In a NSO procedure, this occurs \nwhen a point WI: is reached that satisfies the condition 0 E 8f /(wl:). and the point is said \nto be (-optimal. That is, in the case of convex /, the point WI: is (-optimal if \n\n(7) \n\n4In other words, a set of vectors will define the generalized gradient of a non smooth function at a \n\nsingle point. rather than a single vector in the case of smooth functions. \n\n\f1060 \n\nRedding and Downs \n\nand in the case of nonconvex I. \n\nI(Wk) ~ I(w) + (IIW - w,,11 + (forall wEB \n\n(8) \nwhere B is some neighbourhood of Wk of nonzero dimension. Obviously. as ( ~ 0. then \nWk ~ w\u00b7 at which 0 E () I( w\u00b7). i.e., Wk is ''within (\" of the local minimum w\u00b7. \nUsually the (-generalized gradient is not available. and this is why the bundle concept is \nintroduced. The basic idea of a bundle concept in NSO is to replace the (-generalized \ngradient by some inner approximating polytope P which will then be used to compute a \ndescent direction. If the polytope P is a sufficiently good approximation to I. then we will \nfind a direction along which to descend (a so-called serious step). In the case where P is not \na sufficiently good approximation to I to yield a descent direction. then we perfonn a null \nstep. staying at our current position W. and try to improve P by adding another subgI'adient \n() I ( v) at some nearby point v to our current position w. \nA natural way of approximating I is by using a cutting plane (CP) approximation. The CP \napproximation of I( w) at the point Wk is given by the expression [6] \n\nmax {gNw - Wi) + I(Wi)}' \n1 (i(k \n\n(9) \n\nwhere gi is a subgradient of I at the point Wi. We see then that (9) provides a piecewise \nlinear approximation of convexs I from below. which will coincide with I at all points Wi. \nFor convenience, we redefine the CP approximation in terms of d = W - Wk. d E Rb, the \nvector difference of the point of approximation. W. and the current point in the optimization \nsequence. Wk, giving the CP approximation I Cp of I: \n\nI CP(Wk, d) = max {g/d + g/(Wk - w;) + I(Wi)}. \n\nl(i(k \n\n(10) \n\nNow, when the CP approximation is minimized to find a descent direction, there is no \nreason to trust the approximation far away from Wk. So, to discourage a large step size, a \nstabilizing term Ikd td. where tk is positive, is added to the CP approximation. \nIf the CP approximation at Wk of I is good enough, then the dk given by \n\ndk = arg min I CP(Wk, d) + _1_d td \n\nd \n\n2tk \n\n(11) \n\nwill produce a descent direction such that a line search along Wk + Adk will find a new \npoint Wk+l at which I(Wk+l) < I(Wk) (a serious step). It may happen that I Cp is such a \npoor approximation of I that a line search along dk is not a descent direction. or yields only \na marginal improvement in I. If this occurs, a null step is taken and one enriches the bundle \nof subgradients from which the CP approximation is computed by adding a subgradient \nfrom () I (Wk + Adk) for small A > O. Each serious step guarantees a decrease in I. and a \nstopping criterion is provided by tenninating the algorithm as soon as dk in (11) satisfies \nthe (-optimality criterion, at which point Wk is (-Optimal. These details are the basis of \nbundle methods in NSO [9,10]. \nThe bundle method described suffers from a weak point: its success depends on the delicate \nselection of the parameter tk in (11) [6], This weakness has led to the incorporation of a \n\"trust region\" concept [11] into the bundle method to obtain the B T (bundle-trust) algorithm \n[6]. \n\nSIn the nonconVex f case, (9) is not an approximation to f from below, and additional tolerance \n\nparameters must be considered to accommodate this situation [6]. \n\n\fLearning in Feedforward Networks wirh Nonsmoorh Funcrions \n\n1061 \n\nTo incorporate a trust region, we define a \"radius tt that defines a ball in which we can \"trust\" \nthat fa> is a good approximation of f. In the BT algorithm, by following trust region \nconcepts, the choice of t A: is not made a priori and is determined during the algorithm by \nvarying tA: in a systematic way (trust part) and improving the CP approximation by null \nsteps (bundle part) until a satisfactory CP approximation f a> is obtained along with a ball \n(in terms of t A:) on which we can trust the approximation. Then the dA: in (11) willlea1 to \na substantial decrease in f. \nThe full details of the BT algorithm can be found in [6], along with convergence proofs. \n\n4 EXAMPLES \n\n4.1 A SMOOTH NETWORK WITH NONSMOOTH ERROR FUNCTION \n\nThe particular network example we consider here is a two-layer FFN (i.e .\u2022 one with a single \nlayer of hidden units) where each output unit's value Yi is computed from its discriminant \nfunction Qo; = WiO+ 2:7=1 Wij Zj, by the transfer function Yi = tanh(Qo;), where Zj is the \noutput of the j-th hidden unit. The j-th hidden unit's output Zj is given by Zj = tanh(Qhj)' \nwhere QhL\" = VjO + 2:~=1 VjA:Xj is its discriminant function. The \u00a300 error function (which \nis locally \n\nipschitzian) is defined to be \n\nE(w) = max rJ\u00a5lX IQo, (x) - ti(X)I, \n\nxeX l(.(m \n\n(12) \n\nwhere ti (x) is the desired output of output unit i for the input pattern x EX. \nTo make use of the B T algorithm described in the previous section, it is necessary to obtain \nan expression from which a subgradient at w for E( w) in (12) can be computed. Using the \ngeneralized gradient calculus in [5, Proposition 2.3.12], a subgradientg E 8E(w) is given \nby the expression6 \n\ng = sgn (QO;I (x') - ti' (x')) VWQO;I (x') for some i', x' E .J \n\n(14) \nwhere .J is the set of patterns and output indices for which E (w) in (12) obtains it maximum \nvalue, and the gradient VWQO,I (x') is given by \n\n1 \nZj \n(1 - zJ)Wilj \nx1:(1 - ZJ)Wilj \no \n\nw.r.t. Wi'O \nw.r.t. Wi'j \nw.r.t. VjO \nw.r.t. VjA: \nelsewhere. \n\n(15) \n\n(Note that here j = 1,2, ... , h and k = I, ... , n). \nThe BT technique outlined in the previous section was applied to the standard XOR and \n838 encoder problems using the \u00a300 error function in (12) and subgradients from (14,15). \n6Note that for a function f(w) = Iwl = max{w, -w}. the generalized gradient is given by the \n\nexpression \n\n8f(w) = \n{ \n\nw>o \n1 \nco{l, -l} x = 0 \nx < 0 \n-1 \n\n(13) \n\nand a suitable subgradient g E 8 f ( w) can be obtained by choosing g = sgn( w ). \n\n\f1062 \n\nRedding and Downs \n\nIn all test runs, the BT algorithm was run until convergence to a local minimum of the too \nerror function occurred with (set at 10-4\u2022 On the XOR problem, over 20 test runs using a \nrandomly initialized 2-2-1 network, an average of 52 function and subgradient evaluations \nwere required. The minimum number of function and subgradient evaluations required \nin the test runs was 23 and the maximum was 126. On the 838 encoder problem, over \n20 test runs using a randomly initialized 8-3-8 network, an average of 334 function and \nsubgradient evaluations were required. For this problem, the minimum number of function \nand subgradient evaluations required in the test runs was 221 and the maximum was 512. \n\n4.2 A NON SMOOTH NETWORK AND NONSMOOTH ERROR FUNCTION \n\nIn this section we will consider a particular example that employs a network function that \nis nonsmooth as well as a nonsmooth error function (the too error function of the previous \nexample). \nBased on the piecewise-linear network employed by [12], let the i-th output of the network \nbe given by the expression \nn \n\nn \n\nh \n\nYi = L \"ikXk + L Wij L VjkXk + VjO + WiO \n\nk=1 \n\nj=1 \n\nk=1 \n\nwith an too -based error function \n\nE(w) = max m~ IYi(X) - ti(x)l. \n\nxeK U;,'m \n\nOnce again using the generalized gradient calculus from [5, Proposition 2.3.12], a single \nsubgradient g E 8E(w) is given by the expression \n\n(16) \n\n(17) \n\n(18) \n\nw.r.t. \"i'k \nw.r.t. Wi'O \nw.r.t. Wi'j \nw.r.t. VjO \nw.r.t. Vjk \nelsewhere. \n\n(Note that j = 1,2, ... , h, k = 1,2, ... , n). \nIn all cases the (-stopping criterion is set at 10-4\u2022 On the XOR problem, over 20 test \nruns using a randomly initialized 2-2-1 network, an average of 43 function and subgradient \nevaluations were required. The minimum number of function and subgradient evaluations \nrequired in the test runs was 30 and the maximum was 60. On the 838 encoder problem, \nover 20 test runs using a randomly initialized 8-3-8 network, an average of 445 function and \nsubgradient evaluations were required. For this problem, the minimum number of function \nand subgradient evaluations required in the test runs was 386 and the maximum was 502. \n\n5 CONCLUSIONS \n\nWe have demonstrated the viability of employing NSO for training networks in the case \nwhere standard procedures, with their implicit smoothness assumption, would have diffi(cid:173)\nculties or find impossible. The particular nonsmooth examples we considered involved an \nerror function based on the too norm, for the case of a network with sigmoidal characteristics \nand a network with a piecewise-linear characteristic. \n\n\fLearning in Feedforward Networks with Nonsmooth Functions \n\n1063 \n\nNonsmooth optimization problems can be dealt with in many different ways. A possible \nalternative approach to the one presented here (that works for most NSO problems) is \nto express the problem as a composite function and then solve it using the exact penalty \nmethod (termed composite NSO) [11]. Fletcher [11, p. 358] states that in practice this \ncan require a great deal of storage or be too complicated to formulate. In contrast, the \nBT algorithm solves the more general basic NSO problem and so can be more widely \napplied than techniques based on composite functions. The BT algorithm is simpler to \nset up, but this can be at the cost of algorithm complexity and a computational overhead. \nThe BT algorithm, however, does retain the gradient descent flavour of backpropagation \nbecause it uses the generalized gradient concept along with a chain rule for computing these \n(generalized) gradients. Nongradient-based and stochastic methods for NSO do exist, but \nthey were not considered here because they do not retain the gradient-based deterministic \nflavour. It would be useful to see if these other techniques are faster for practical problems. \n\nThe message should be clear however -\nsmooth gradient techniques should be treated with \nsuspicion when a nonsmooth problem is encountered, and in general the more complicated \nnonsmooth methods should be employed. \n\nReferences \n\n[1] P. Burrascano, \"A norm selection criterion for the generalized delta rule,\" IEEE Trans(cid:173)\n\nactions on Neural Networks 2 (1991), 125-130. \n\n[2] N. J. Redding, \"Some Aspects of Representation and Learning in Artificial Neural \n\nNetworks,\" University of Queensland, PhD Thesis, June, 1991. \n\n[3] T. Krishnan, \"On the threshold order of a Boolean function,\" IEEE Transactions on \n\nElectronic Computers EC-15 (1966),369-372. \n\n[4] M. L. Brady, R. Raghavan & J. Slawny, \"Backpropagation fails to separate where \n\nperceptrons succeed,\" IEEE Transactions on Circuits and Systems 36 (1989). \n\n[5] F. H. Clarke, Optimization and Nonsmooth Analysis, Canadian Mathematical Society \nSeries of Monographs and Advanced Texts, John Wiley & Sons, New York, NY, 1983. \n[6] H. Schramm & J. Zowe, \"A version of the bundle idea for minimizing a nonsmooth \nfunction: conceptual ideas, convergence analysis, numerical results,\" SIAM Journal on \nOptimization (1991), to appear. \n\n[7] V. F. Dem'yanov & V. N. Malozemov, Introduction to Minimax, John Wiley & Sons, \n\nNew York,NY, 1974. \n\n[8] P. Wolfe, \"A method of conjugate subgradients for minimizing nondifferentiable func(cid:173)\n\ntions,\" in Nondifferentiable Optimization, M. L. Balinski & P. Wolfe, eds., Mathematical \nProgramming Study #3, North-Holland, Amsterdam, 1975,145-173. \n\n[9] C. Lemarechal, ''Nondifferentiable Optimization,\" in Optimization, G. L. Nemhauser, \nA. H. G. Rinnooy Kan & M. J. Todd, eds., Handbooks in Operations Research and \nManagement Science #1, North-Holland,Amsterdam, 1989,529-572. \n\n[10] K. C. Kiwiel, Methods of Descent for Nondifferentiable Optimization, Leet. Notes in \n\nMath. # 1133, Springer-Verlag, New York-Heidelberg-Berlin, 1985. \n\n[11] R. Fletcher, Practical Methods of Optimization second edition, John Wiley & Sons, \n\nNew York, NY, 1987. \n\n[12] R. Batruni, \"A multilayer neural network with piecewise-linear structure and back(cid:173)\n\npropagation learning,\" IEEE Transactions on Neural Networks 2 (1991),395-403. \n\n\f", "award": [], "sourceid": 494, "authors": [{"given_name": "Nicholas", "family_name": "Redding", "institution": null}, {"given_name": "T.", "family_name": "Downs", "institution": null}]}