{"title": "Exponentially many local minima for single neurons", "book": "Advances in Neural Information Processing Systems", "page_first": 316, "page_last": 322, "abstract": null, "full_text": "Exponentially many local minima for single \n\nneurons \n\nPeter Auer \n\nMark Herbster \n\nManfred K. Warmuth \n\nDepartment of Computer Science \n\nSanta Cruz, California \n\n{pauer,mark,manfred} @cs.ucsc.edu \n\nAbstract \n\nWe show that for a single neuron with the logistic function as the transfer \nfunction the number of local  minima of the error function based on the \nsquare loss can grow exponentially in the dimension. \n\n1 \n\nINTRODUCTION \n\nConsider a single artificial neuron with d inputs.  The neuron has d weights w  E Rd.  The \noutput of the  neuron for an  input pattern x  E  Rd  is  y =  \u00a2(x\u00b7 w),  where \u00a2  : R  -+  R \nis  a transfer function.  For a given sequence  of training examples  ((Xt, Yt))I<t<m,  each \nconsisting of a pattern Xt  E R d  and a desired output Yt  E R, the goal of the training phase \nfor  neural  networks consists of minimizing the error function  with respect to the  weight \nvector w  E Rd.  This function is the sum of the losses between outputs of the neuron and \nthe desired outputs summed over all training examples.  In notation, the error function is \n\nm \n\nE(w) = L L(Yt, \u00a2(Xt . w))  , \n\nt=1 \n\nwhere L  : R  x  R  -+ [0,00) is the loss function. \nA common example of a transfer function is the logistic function logistic( z)  =  I+!-' which \nhas the bounded range (0, 1).  In contrast, the identity function id(z)  =  z has  unbounded \nrange.  One of the most common loss functions is the square loss L(y, y)  = (y - Y)2.  Other \nexamples are the absolute loss Iy - yl  and the entropic1oss yin? + (1  - y) In ::::l \nWe  show that for  the  square loss and  the  logistic function  the error function  of a  single \nneuron for n training examples may have L n / dJ d  local minima.  More generally, this holds \nfor any loss and transfer function for  which the composition of the loss function with the \ntransfer function  (in  notation L(y, \u00a2(x . w)) is continuous and has  bounded range.  This \n\n\fExponentially Many  Local Minima for  Single  Neurons \n\n317 \n\nFigure 1:  Error Function with 25 Local Minima (16 Visible), Generated by  10 Two(cid:173)\n\nDimensional Examples. \n\nproves that for any transfer function with bounded range exponentially many local minima \ncan occur when the loss function is the square loss. \n\nThe sequences of examples that we use in our proofs have the property that they are non(cid:173)\nrealizable in the sense that there is no  weight vector W  E R d  for which the error function \nis zero, i.e.  the neuron cannot produce the desired output for all  examples.  We show with \nsome minimal assumptions on the loss and transfer functions that for a single neuron there \ncan be no local minima besides the global minimum if the examples are realizable. \nIf the  transfer  function  is  the  logistic  function  then  it  has  often  been  suggested  in  the \nliterature to  use  the entropic loss in  artificial  neural  networks in place of the square loss \nIn  that  case  the  error  function  of a  single  neuron  is \n[BW88,  WD88,  SLF88,  Wat92]. \nconvex and thus has only one minimum even in the non-realizable case.  We generalize this \nobservation by defining a matching loss for any differentiable increasing transfer functions \n\u00a2: \n\nL</>(y,  f))  = 1  (\u00a2(z) - y)  dz  . \n\n,p-l(y) \n\n</>-l(y) \n\nThe loss is the area depicted in Figure 2a.  If \u00a2 is the identity function then L</>  is the square \nloss likewise if \u00a2 is the logistic function then L</>  is the entropic loss.  For the matching loss \nthe gradient descent update for minimizing the error function for a sequence of examples \nis simply \n\nWnew  := Wold  -1] (f)\u00a2(Xt . Wold)  - Yt)Xt) \n\n, \n\nt=1 \n\nwhere  1]  is  a positive learning rate.  Also the second derivatives are easy  to calculate for \nthis general  setting:  L4>(Y~:v~<:Wt;W))  =  \u00a2'(Xt  . W)Xt,iXt,j.  Thus,  if Ht(w)  is  the Hessian \nof L</>(Yt, \u00a2(Xt  . w))  with  respect  to  W  then  v T Ht(w)v  =  \u00a2'(Xt  . w)(v . Xt)2.  Thus \n\n\f318 \n\nP. AUER. M. HERBSTER, M. K.  WARMUTH \n\n0.8 \n\nwO.I \n\n0.4 \n\n0.2 \n\n... \n\n- 2 \n\no ... \n(b) \n\n.-1 (9)  = w \u00b7 x \n\n(a) \n\nFigure 2: \n\n(a)  The Matching Loss Function L</>. \n(b)  The Square Loss becomes Saturated, the Entropic Loss does not. \n\nH t  is  positive semi-definite  for  any  increasing  differentiable transfer function.  Clearly \nL:~I Ht(w) is the Hessian of the error function E(w) for a sequence of m examples and \nit is  also positive semi-definite.  It follows  that for  any  differentiable increasing  transfer \nfunction the error function with respect to the matching loss is always convex. \n\nWe  show that in the case of one neuron the logistic function paired with the square loss \ncan lead to exponentially many  minima.  It is open  whether the number of local minima \ngrows exponentially for  some  natural  data.  However there  is  another problem  with the \npairing of the logistic and the square loss that makes it hard to optimize the error function \nwith gradient based methods.  This is the problem of flat  regions.  Consider one example \n(x, y)  consisting of a pattern  x  (such  that x  is  not equal  to  the  all  zero  vector)  and  the \ndesired output y.  Then the square loss (Iogistic(x . w) - y)2,  for y  E  [0, I]  and w  E R d , \nturns flat as a function of w when f)  =  logistic( x  . w) approaches zero or one (for example \nsee  Figure 2b  where  d  =  I  and  y  =  0).  It  is  easy  to  see  that for  all  bounded transfer \nfunctions  with a finite  number of minima and corresponding bounded loss functions,  the \nsame phenomenon occurs.  In other words,  the composition L(y, \u00a2(x . w\u00bb  of the square \nloss with any bounded transfer function \u00a2 which has a finite number of extrema turns flat as \nIx . w I becomes large. Similarly, for multiple examples the error function E( w) as defined \nabove becomes flat.  In flat  regions the gradients with respect to the weight vector w  are \nsmall, and thus gradient-based updates of the weight vector may  have a hard time moving \nthe  weight vector out of these flat  regions.  This phenomenon can  easily  be observed  in \npractice and is sometimes called \"saturation\" [Hay94].  In contrast, if the logistic function \nis paired with the entropic loss (see Figure 2b), then the error function turns flat only at the \nglobal minimum.  The same holds for any increasing differentiable transfer function and its \nmatching loss function. \n\nA  number of previous papers  discussed conditions necessary  and  sufficient for  mUltiple \nlocal minima of the error function of single neurons or otherwise small  networks [WD88, \nSS89, BRS89, Blu89, SS91, GT92].  This previous work only discusses the occurrence of \nmultiple local  minima whereas in this paper we show that the number of such minima can \ngrow exponentially with the dimension.  Also the previous work has  mainly been limited \nto the demonstration of local minima in networks or neurons that have used the hyperbolic \ntangent or logistic function with the square loss.  Here  we show that exponentially many \nminima occur whenever the composition of the loss function with the transfer function is \ncontinuous and bounded. \n\nThe paper is outlined as follows. After some preliminaries in the next section, we gi ve formal \n\n\fExponentially  Many  Local  Minima for  Single Neurons \n\n319 \n\n04$ \n\nO. \n\n036 \n\n03 \n\n026 \n\n02 \n\n0.t5 \n\n0.1 \n\n0.06 \n\n0 \n-2 \n\n-1 \n\n(a) \n\n11 \n\n0.9 \n\n0.1 \n\n07 \n\nWOI \n\nas  -- --- ------------ __ \n\nO. \n\n03 \n\n02 \n\n, \n\n,-\n, \n' \n\n/  ' ..  \" \n\n, \n,  , \n\\ \n' -\n\n01 L......L-~~-~~~~-'--~-~ \n\n-8-e-~-2  0 \n\nlog .. \n\n(b) \n\nFigure 3: \n\n(a)  Error Function for the Logistic Transfer Function and the \n\nSquare Loss with Examples ((10, .55), (.7, .25\u00bb) \n\n(b)  Sets of Minima can be Combined. \n\nstatements and proofs of the results mentioned above in Section 3.  At first (Section 3.1) we \nshow that n one-dimensional examples might result in n local minima of the error function \n(see e.g. Figure 3a for the error function of two one-dimensional examples).  From the local \nminima in one dimension it follows easily that n  d-dimensional examples might result in \nL n/ dJ d local minima of the error function (see Figure 1 and discussion in Section 3.2). \nWe  then  consider neurons with a bias (Section 4),  i.e.  we add  an  additional input that is \nclamped to one.  The error function  for a sequence of examples S  = ((Xt, Yt\u00bb)I<t<m  is \nnow \n\nEs(B, w) = I: L(Yt, r/>(B + WXt\u00bb, \n\nm \n\nt=1 \n\nwhere B  denotes the bias, i.e. the weight of the input that is clamped to one.  We can prove \nthat the error function  might have  L n/2dJ d  local  minima if loss and transfer function  are \nsymmetric.  This holds for example for  the square loss and  the logistic transfer function . \nThe proofs are omitted due to space constraints.  They are given in the full paper [AHW96], \ntogether with additional results for general loss and transfer functions. \n\nFinally we show in Section 5 that with minimal assumptions on transfer and loss functions \nthat there is only one minimum of the error function if the sequence of examples is realizable \nby the neuron. \n\nThe essence of the proofs is quite simple. At first observe that ifloss and transfer function are \nbounded and the domain is unbounded, then there exist areas of saturation where the error \nfunction is essentially flat.  Furthermore the error function is \"additive\" i.e. the error function \nproduced by examples in SUS' is simply the error function produced by the examples in \nS  added to the error function produced by the examples in S', Esusl = Es + ESI.  Hence \nthe local minima of Es remain local minima of Esus 1  if they fall into an area of saturation \nof Es.  Similarly,  the  local  minima of ESI  remain  local  minima of Esusl  as  well  (see \nFigure 3b).  In this way sets of local minima can be combined. \n\n2  PRELIMINARIES ' \n\nWe  introduce the notion of minimum-containing set  which will prove useful for counting \nthe minima of the error function. \n\n\f320 \n\nP. AUER, M. HERBSTER, M. K. WARMUTH \n\nDefinition 2.1  Let f  : Rd_R be a continuous function.  Then  an  open  and bounded set \nU  E Rd is called a minimum-containing set for f  if for each w on the boundary of U there \nis a w'\"  E  U such that f(w\"')  < f(w). \n\nObviously any minimum-containing set contains a local minimum of the respecti ve function. \nFurthermore each of n disjoint minimum-containing sets contains a distinct local minimum. \nThus  it  is  sufficient  to find  n  disjoint minimum-containing sets  in  order to  show  that  a \nfunction has at least n local minima. \n\n3  MINIMA FOR NEURONS WITHOUT BIAS \n\nWe  will  consider  transfer  functions  \u00a2  and  loss  functions  L  which  have  the  following \nproperty: \n\n(PI):  The transfer function \u00a2  : R-R is non-constant.  The loss function L  : \u00a2(R) x \n\u00a2(R)-[O, 00)  has  the property  that  L(y, y)  =  0  and  L(y, f))  >  0  for  all  y  f. \nf)  E \u00a2(R). FinallythefunctionL(\u00b7,\u00a2(\u00b7)): \u00a2(R) x  R-[O,oo) is continuous and \nbounded. \n\n3.1  ONE MINIMUM PER EXAMPLE IN ONE DIMENSION \n\nTheorem 3.1  Let  \u00a2  and  L  satisfy ( PI).  Then for all n  ~ I  there  is  a  sequence  of n \nexamples S  =  (XI, y), ... , (xn, y)),  Xt  E R,  y  E \u00a2(R),  such that Es(w) has n distinct \nlocal minima. \n\nSince L(y, \u00a2( w))  is continuous and non-constant there are w- , w\"', w+  E  R  such that the \nvalues  \u00a2( w-), \u00a2( w\"'), \u00a2( w+)  are  all  distinct.  Furthermore we  can  assume  without loss \nof generality  that 0  <  w- <  w'\"  <  w+.  Now  set  y  =  \u00a2(w\"').  If the  error function \nL(y, \u00a2(w)) has infinitely many local minima then Theorem  3.1  follows immediately, e.g. \nby  setting  XI  =  ...  =  Xn  =  1.  If L(y, \u00a2(w))  has  only  finitely  many  minima  then \nlimw ..... oo  L(y, \u00a2(w))  =  L(y, \u00a2(oo))  exists  since  L(y, \u00a2(w))  is  bounded and  continuous. \nWe use this fact in the following lemma.  It states that we get a new minimum-containing \nset by adding an example in the area of saturation of the error function. \n\nLemma 3.2  Assume  that  limw ..... oo  L(y, \u00a2( w))  exists.  Let  S  =  (XI, YI), ... , (xn, Yn)) \nbe  a  sequence  of examples  and 0 <  WI  <  wi  <  wt  <  ...  <  w;;  <  w~  <  w~ \nsuch  that  Es(wt ) >  Es(wn  and  Es(wn  <  Es(wt) for t  =  1, ... , n.  Let S'  = \n(xo, y} (XI, Yd, ... , (xn, Yn)) where Xo is sufficiently large.  Furthermoreletwo  =  w'\" /xo \nand Wo  =  w\u00b1 /xo (where w-, w\"', w+, Y =  \u00a2(w\"') are as above).  Then 0 < we;  < Wo  < \nwt < WI  < wi  <  wt <  ... <  w;;  < w~ <  w~ and \n\nProof.  We have to show that for all Xo  sufficiently large condition (l) is satisfied, i.e. that \n\n(2) \n\nWe get \n\nlim  ESI(WO)  = L(y, \u00a2(w\"')) +  lim  Es(w'\" /xo) =  L(y, \u00a2(w\"')) + Es(O), \n~ ..... oo \n\n~-oo \n\nrecalling that Wo  = w'\" /xo and S' = S u (xo, y) . Analogously \n\nlim  ESI(w~) = L(y,\u00a2(w\u00b1)) + Es(O). \n\nx 0\"'\" 00 \n\n\fExponentially  Many Local  Minima  for  Single  Neurons \n\n321 \n\nThus equation (2) holds for t =  0.  For t =  1, ... , n we get \n\nlim  ESI(w;) = \n\n:1:0-+00 \n\nlim  L(y, \u00a2(w;xo)) + Es(wn =  L(y, \u00a2(oo)) + Es(wn \n\n:1:0-+00 \n\nand \n\nSince Es (w;)  < Es (w;) for t  = 1,  ... , n, the lemma follows. \nProof of Theorem 3.1.  The theorem  follows by  induction from  Lemma 3.2  since  each \ninterval ( wi, wi) is a minimum-containing set for the error function . \n0 \nRemark. Though the proof requires the magnitude of the examples to be arbitrarily large I \nin practice local minima show up for even moderately sized w (see Figure 3a). \n\no \n\n3.2  CURSE OF DIMENSIONALITY: THE NUMBER OF MINIMA MIGHT \n\nGROW EXPONENTIALLY WITH THE DIMENSION \n\nWe  show how the  I-dimensional  minima of Theorem  3.1  can  be combined  to  obtain  d(cid:173)\ndimensional minima. \n\nLemma 3.3  Let I  : R  -+ R  be a continuous function with n  disjoint minimum-containing \nsets UI ,  .\u2022. ,Un.  Then the sets UtI  x  ... X  Utd , tj  E {I, ... , n}, are n d  disjoint minimum-\ncontaining sets for the function 9  : Rd  -+ R, g(XI, . .. , Xd)  =  l(xI) + ... + I(xd). \n\nProof.  Omitted. \n\no \n\nTheorem 3.4  Let \u00a2 and L satisfy ( PI).  Then for all n  ~ 1 there is a sequence of examples \nS  =  (XI,Y),\"\"(xn,y)),  Xt  E Rd,  y  E \u00a2(R),  such that Es(w) has  l~Jd distinct local \nminima. \n\nProof.  By  Lemma  3.2  there  exists  a  sequence  of one-dimensional  examples  S'  = \n(xI,y)\"\" , (xLcrJ'Y))  such  that  ESI(w)  has  L~J  disjoint  minimum-containing  sets. \nThus  by  Lemma  3.3  the  error function  Es (w)  has  l ~ J d  disjoint minimum-containing \nsets  where  S  =  ((XI, 0, .. . ,0), y), ... , \u00abxLcrJ' 0, . .. ,0), y), .. . , \u00ab0, ... , xI), y), .. . , \n\u00ab0, .. . , xLcrJ), y)). \n0 \n\n4  MINIMA FOR NEURONS WITH A BIAS \n\nTheorem 4.1  Let the transfer function \u00a2 and the loss function L  satisfy \u00a2( Bo + z) - \u00a2o = \n\u00a2o - \u00a2(Bo - z) and L(\u00a2o + y, \u00a2o + y) = L(\u00a2o - y, \u00a2o - y)for some Bo, \u00a2o  E R and all \nz E  R,  y, Y E \u00a2(R).  Furthermore let \u00a2 have a continuous second derivative and assume \nthat the first derivative of \u00a2  at Bo  is non-zero.  At last let  ~L(y, y)  be continuous in y \nand y,  L(y, y)  =  0for all y E \u00a2(R), and (~L(Y, y)) (\u00a2o, \u00a2o)  > 0.  Then for all n  ~ 1 \nthere  is a  sequence of examples S  =  (XI, YI), . .. , (xn, Yn)),  Xt  E  Rd,  Yt  E  \u00a2(R),  such \nthat Es (B, w) has l ~ J d  distinct local minima. \n\nNote  that  the  square  loss  along  with  either  the  hyperbolic  or logistic  transfer  function \nsatisfies the conditions of the theorem. \n\nIThere is a parallel proof where the magnitudes of the examples may be arbitrarily small. \n\n\f322 \n\nP. AUER, M. HERBSTER, M.  K. WARMUTH \n\n5  ONE MINIMUM IN THE REALIZABLE CASE \n\nWe show that when transfer and loss function are monotone and the examples are realizable \nthen  there is  only  a single minimal  surface.  A  sequence  of examples  S  is  realizable  if \nEs(w) = 0 for some wE Rd. \n\nTheorem 5.1  Let 4>  and L  satisfy (P 1).  Furthermore  let 4>  be mOriotone and L  such that \nL(y, y + rl)  ~ L(y, y + r2)  for 0  ~ rl  ~ r2  or 0  ~ rl  ~ r2.  Assume that for some \nsequence of examples S  there is a weight vectorwo E Rd  such that Es(wo) = O.  Thenfor \neach WI  E Rd  the function h( a)  =  Es (( 1 - a )wo + aWl) is increasing for a  ~ O. \n\nThus each  minimum WI  can  be connected  with Wo  by the line segment WOWI  such  that \nEs(w) = 0 for all W on WOWI. \nProof  of  Theorem  5.1. \nE~=I L(yt, 4>(WOXt  + a(wl  - wo)xt}).  Since  Yt  =  4>(WOXt)  it suffices  to  show  that \nL(4)(z), 4>(z+ar)) is monotonically increasing in a  ~ o for all Z, r  E R. Let 0  ~ al  ~ a2. \nSince 4>  is  monotone we  get 4>(z + aIr)  = 4>(z)  + rl,  4>(z  + a2r)  = 4>(z)  + r2  where \no ~ rl  ~ r2 or 0  ~ rl  ~ r2\u00b7  Thus L(4)(z), 4>(z + aIr)) ~ L(4)(z), 4>(z + a2r)). \n0 \n\n((XI, yd, ... , (xn, Yn)). \n\nLet  S  = \n\nThen  h(a) \n\nAcknowledgments \n\nWe  thank Mike  Dooley, Andrew Klinger and Eduardo Sontag for valuable discussions.  Peter Auer \ngratefully  acknowledges support from  the FWF,  Austria,  under grant J01028-MAT. Mark  Herbster \nand Manfred Warmuth were supported by NSF grant IRI-9123692. \n\nReferences \n[AHW96]  P.  Auer, M.  Herbster, and M.  K.  Warmuth.  Exponentially many local minima for single \nneurons.  Technical  Report UCSC-CRL-96-1,  Univ.  of Calif.  Computer Research  Lab, \nSanta Cruz, CA, 1996. In preperation. \nE.K.  Blum.  Approximation of boolean functions by sigmoidal networks:  Part i:  Xor and \nother two-variable functions. Neural Computation,  1 :532-540, February 1989. \n\n[Blu89] \n\n[BRS89]  M.L.  Brady,  R.  Raghavan,  and  J.  Slawny.  Back  propagation  fails  to  separate  where \nperceptrons succeed.  IEEE Transactions On  Circuits and Systems, 36(5):665-674, May \n1989. \n\n[BW88]  E.  Baum  and  F.  Wilczek.  Supervised  learning  of probability  distributions  by  neural \nnetworks.  In  D.Z.  Anderson, editor, Neural Information  Processing Systems, pages 52-\n61, New York,  1988. American Insitute of Physics. \n\n[GT92]  Marco Gori  and Alberto Tesi.  On the problem of local minima in backpropagation. IEEE \n\nTransaction on Pattern Analysis and Machine Intelligence, 14(1):76-86, 1992. \n\n[Hay94]  S. Haykin. Neural Networks:  a Comprehensive Foundation.  Macmillan, New York, NY, \n\n1994. \n\n[SLF88]  S. A.  Solla, E. Levin, and M.  Fleisher.  Accelerated learning in  layered neural networks. \n\n[SS89] \n\n[SS91] \n\nComplex Systems, 2:625-639,1988. \nE.D.  Sontag and H.l. Sussmann. Backpropagation can give rise to spurious local minima \neven for networks without hidden layers.  Complex Systems, 3(1):91-106, February 1989. \nE.D. Sontag and H.l. Sussmann. Back propagation separates where perceptrons do. Neural \nNetworks,4(3),1991. \n\n[Wat92]  R.  L.  Watrous.  A comparison between squared error and relative entropy metrics using \n\nseveral optimization algorithms.  Complex Systems, 6:495-505, 1992. \n\n[WD88]  B.S. Wittner and J .S. Denker. Strategies for teaching layered networks classification tasks. \nIn D.Z.  Anderson, editor, Neural Information  Processing Systems, pages 850--859, New \nYork,  1988. American Insitute of Physics. \n\n\f", "award": [], "sourceid": 1028, "authors": [{"given_name": "Peter", "family_name": "Auer", "institution": null}, {"given_name": "Mark", "family_name": "Herbster", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}