{"title": "Remarks on Interpolation and Recognition Using Neural Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 939, "page_last": 945, "abstract": null, "full_text": "REMARKS  ON INTERPOLATION AND \nRECOGNITION  USING  NEURAL NETS \n\nEduardo D.  Sontag\u00b7 \nSYCON - Center for  Systems and  Control \nRutgers  University \nNew  Brunswick,  N J  08903 \n\nAbstract \n\nWe consider different  types  of single-hidden-Iayer feedforward  nets:  with \nor  without  direct  input  to  output  connections,  and  using  either  thresh(cid:173)\nold  or  sigmoidal activation functions.  The  main results  show  that  direct \nconnections in  threshold nets  double  the  recognition  but not  the interpo(cid:173)\nlation power, while using sigmoids  rather than thresholds allows (at least) \ndoubling  both.  Various results are also given on VC dimension and  other \nmeasures of recognition capabilities. \n\n1 \n\nINTRODUCTION \n\nIn this work we continue to develop the theme of comparing threshold and sigmoidal \nfeedforward  nets.  In  (Sontag and  Sussmann,  1989)  we  showed that  the  \"general(cid:173)\nized  delta rule\"  (backpropagation) can give rise  to pathological behavior -namely, \nthe  existence  of spurious  local  minima  even  when  no  hidden  neurons  are  used,(cid:173)\nin  contrast  to  the  situation  that  holds  for  threshold  nets.  On  the  other  hand,  in \n(Sontag and Sussmann,  1989)  we  remarked that  provided that  the  right  variant  be \n'Used,  separable  sets  do  give  rise  to  globally  convergent  back propagation,  in  com(cid:173)\nplete analogy to the classical perceptron learning theorem.  These results and those \nobtained by other authors probably settle most general questions about the case of \nno hidden  units,  so  the  next step  is  to look  at  the case of single  hidden  layers.  In \n(Sontag,  1989)  we  announced  the fact  that sigmoidal  activations  (at  least)  double \nrecognition  power.  Here  we  provide details,  and  we  make several further  remarks \non  this as well as on the topic of interpolation. \nNets  with  one  hidden  layer  are  known  to  be  in  principle  sufficient  for  arbitrary \nrecognition tasks.  This follows from the approximation theorems proved by various \n\n.\"E-mail:  sontag@hilbert.rutgers.edu \n\n\f940 \n\nSontag \n\nauthors:  (Funahashi,1988),  (Cybenko,1989),  and  (Hornik  et. al.,  1989).  However, \nwhat is far  less  clear is  how  many neurons are  needed for  achieving a  given recog(cid:173)\nnition,  interpolation, or approximation objective.  This is of importance both in its \npractical aspects (having rough estimates of how many neurons will be needed is es(cid:173)\nsential when applying back propagation) and in evaluating generalization properties \n(larger  nets  tend  to  lead  to  poorer  generalization).  It is  known  and  easy  to  prove \n(see  for  instance  (Arai,  1989),  (Chester,  1990))  that  one  can  basically  interpolate \nvalues at any n + 1 points using an n-neuron net, and in particular that any n + 1-\npoint set can be dichotomized by such nets.  Among other facts,  we  point out here \nthat allowing direct  input  to output connections  permits  doubling  the  recognition \npower to 2n, and the same result is achieved if sigmoidal neurons are used  but such \ndirect  connections are not allowed.  Further,  we  remark  that approximate interpo(cid:173)\nlation of 2n - 1 points is  also  possible,  provided that sigmoidal units be employed \n(but direct  connections in threshold nets do not suffice). \nThe dimension of the input space  (that is,  the number of \"input units\") can influ(cid:173)\nence  the number of neurons needed,  are  least  for  dichotomy problems for  suitably \nchosen  sets.  In  particular,  Baum had  shown  some  time  back  (Baum,  1988)  that \nthe VC dimension of threshold nets with a  fixed  number of hidden  units is at least \nproportional  to  this  dimension.  We  give  lower bounds,  in  dimension  two,  at least \ndoubling  the VC  dimension  if sigmoids or direct connections are allowed. \nLack  of space  precludes  the  inclusion  of proofs;  references  to  technical  reports  are \ngiven as appropriate.  A full-length  version  of this  paper is  also available from the \nauthor. \n\n2  DICHOTOMIES \n\nThe first  few  definitions are  standard.  Let  N  be a  positive integer.  A  dichotomy \nor two-coloring (S_, S+) on a  set S ~ JRN  is a  partition S =  S_ U S+  of S into two \ndisjoint subsets.  A function I  : JRN  -. JR  will  be said to  implement this dichotomy \nif it holds  that \n\nI(u) > 0 for  u E S+  and  I(u) < 0 for  u E S_ \n\n. \n\nLet F  be a  class of functions from JRN  to JR,  assumed to be nontrivial, in the sense \nthat for each point u E lRN  there is some 11  E F  so that II (u)  > 0 and some h  E F \nso  that h (u)  < O.  This class  shatters the  set  S  ~ RN  if each  dichotomy on  S  can \nbe implemented by some I  E F. \nHere  we  consider,  for  any class  of functions  F  as above, the  following  measures  of \nclassification  power.  First we  introduce J1.  and J1.,  dealing with  \"best\"  and  \"worst\" \ncases respectively:  J1.(F)  denotes the largest integer I 2:  1 (possibly 00) so that there \nis  at least  some set  S  of cardinality  I in  JRN  which  can  be  shattered by F,  while \nJ1.(F)  is  the largest integer I 2:  1 (possibly  00)  so  that  every set of cardinality I can \nbe shattered by F.  Note that by definition,  J1.(F)  ~ /1(F)  for  every class  F. \nIn  particular,  the  definitions  imply  that  no  set  of  cardinality  J1.(F)  + 1  can  be \nshattered, and that there is at least some set of cardinality J1.(F) + 1 which cannot be \nshattered.  The integer J1.  is usually called the  Vapnik-Chervonenkis  (VC) dimension \nof  the  class  F  (see  for  instance  (Baum,1988)),  and  appears  in  formalizations  of \nlearning in the distribution-free sense. \n\n\fRemarks on Interpolation and Recognition Using Neural Nets \n\n941 \n\nA  set  may fail  to  be  shattered  by  :F  because  it  is  very  special  (see  the  example \nbelow with colinear  points).  In  that sense,  a  more robust  measure  is  useful:  p,(:F) \nis  the  largest  integer I  2:  1  (possibly  00)  for  which  the class  of sets  S  that can  be \nshattered by :F is dense, in the sense that given every I-element set S = {st, ... , S,} \nthere are points Si  arbitrarily close  to the  respective  Si'S such that S = {st, ... , s,} \ncan be shattered by :F.  Note  that \n\np,(:F)  ::;  p,(:F)  ::;  1l(:F) \n\n(1) \n\nfor  all :F. \nTo obtain an upper bound m  for  p,(:F)  one needs  to exhibit an open class of sets of \ncardinality m + 1 none of which can be shattered. \nTake as an example the class :F  consisting of all affine functions  f( x)  =  ax + by + c \non lR 2 \u2022  Since any three poin ts can be shattered by an affine map provided that they \nare not colinear  (just  choose a  line  ax + by + c = 0 that separates any poin t  which \nis  colored different from the  rest), it follows  that 3 ::;  p,.  On the other hand, no set \nof four  points can ever be dichotomized, which implies  that II ::;  3 and therefore the \nconclusion  p,  = II = 3 for  this  class.  (The negative statement can  be  verified  by a \ncase  by case analysis:  if the four  points form  the  vertices of a  4-gon  color them in \n\"XOR\"  fashion,  alternate  vertices of the  same  color;  if 3  form  a  triangle  and  the \nremaining one is inside, color the extreme points differently from the remaining one; \nif all colinear then use  an alternating coloring).  Finally, since  there is  some set of 3 \npoints which cannot be dichotomized (any set of three colinear points is  like  this), \nbut every set  of two can,  !!:.. = 2 . \nWe shall say that :F  is  robust if whenever S  can be shattered by :F also every small \nenough  perturbation of S can be shattered.  For a robust class and I =  p,(:F),  every \nset in an open dense subset in the above topology, i.e.  almost every set of 1 elements, \ncan be shattered. \n\n3  NETS \n\nWe define a  \"neural net\"  as a  function  of a  certain type, corresponding  to the idea \nof feedforward  interconnections,  via additive links,  of neurons each  of which  has a \nscalar response  or  activation function  (). \nDefinition 3.1  Let  ()  :  lR  ~ lR  be  any  function.  A  function  f  : lRN  ~ lR  is \na  single-hidden-Iayer  neural  net  with  k  hidden  neurons  of type  ()  and  N \ninputs, \nor just a  (k , ()-net,  if there are  real  numbers Wo, Wl, \u2022\u2022\u2022 ,  Wk, 'Tl,  \u2022\u2022\u2022 , 'Tk  and vectors \nvo, Vi, \u2022 \u2022 \u2022 ,  Vk  E lRN  such that, for  all u  E lRN, \n\nf(u)  = Wo  + Vo\u00b7U  + L Wi()(Vi. U -\n\nk \n\n'Ti) \n\ni=l \n\n(2) \n\nwhere  the dot  indicates inner  product.  A  net with  no  direct  i/o connections is  one \nfor  which Vo  =  O. \n\nFor  fixed  (),  and  under  mild  assumptions  on  (),  such  neural  nets  can  be  used  to \napproximate uniformly arbitrary continuous functions  on compacts.  In  particular, \nthey can  be  used  to implement arbitrary dichotomies. \n\n\f942 \n\nSontag \n\nIn neural net practice, one often takes 9 to be the standard sigmoid u( x) =  1+!-\"  or \nequivalently,  up  to  translations and  change of coordinates,  the  hyperbolic  tangent \ntanh(x).  Another usual choice is the hardlimiter, threshold,  or Heaviside function \n\n1t(x) = { ~ \n\nif x  ~ 0 \nif x> 0 \n\nwhich can be approximated well  by u( \"'(x)  when  the  \"gain\"  \"'(  is  large.  Yet another \npossibility is  the use  of the piecewise linear function \n\n-1 \n\n{ \n\n7r(x) =  !  if x  ~ -1 \n\nif x> 1 \notherwise. \n\nMost  analysis has been done for  1t and no direct connections,  but numerical tech(cid:173)\nniques  typically  use  the  standard  sigmoid  (or  equivalently  tanh).  The  activation \n1(\"  will  be  useful  as  an  example  for  which  sharper  bounds  can  be  obtained.  The \nexamples  u  and  1(\",  but  not  1t,  are  particular  cases  of the  following  more  general \ntype of activation function: \nDefinition 3.2  A  function  9  : m.  --+ m.  will  be called  a  sigmoid  if these  two prop(cid:173)\nerties hold: \n\n(51)  t+  := limx--++oo 9(x) and L  := liffix--+-oo 9(x)  exist, and t+  =1=  t_. \n(52)  There is  some point c such that 9 is differentiable at c and 9'(c) = JL  =1=  o. \nAll  the  examples  above  lead  to  robust  classes,  in  the  sense  defined  earlier.  More \nprecisely,  assume  that  9  is  continuous except  for  at  most  finitely  many  points  x, \nand  it  is  left  continuous at  such  x,  and  let  :F  be  the  class  of (k,9)-nets,  for  any \nfixed  k.  Then:F is  robust,  and  the  same  statement  holds  for  nets  with  no  direct \nconnections. \n\n0 \n\n4  CLASSIFICATION RESULTS \nWe  let  JL(k, 9, N) denote  JL(:F),  where  :F  is  the  class  of (k, 9)-nets in m.N  with  no \ndirect  connections,  and similarly for  JL  and  JL,  and  a  superscript  d  is  used  for  the \nclass of arbitrary such nets (with possible direct connections from input to output). \nThe lower measure  JL  is independent of dimension: \nLemma 4.1  For each k, 9, N, !!:.(k, 9, N) = JL(k, 9,1) and !!:.d(k, 9, N) = JLd(k, 9,1). \nThis justifies denoting  these quantities just as JL( k, 9)  and JLd( k, 9)  respectively,  as \nwe do from now on, and giving proofs only for N = 1. \n\n-\n\nLemma 4.2  For any sigmoid 9,  and for  each  k, N, \n\nJL(k + 1,9, N) > JLd(k, 1t, N) \n\nand similarly for  JL  and JL. \n\nThe main results on classification will  be as follows. \n\n\fRemarks on Interpolation and Recognition Using Neural Nets \n\n943 \n\nTheorem 1  For any sigmoid 9,  and for  each  k, \n\n-\n\nJ-L(k,1t) \nJ-Ld(k,1t) \n\nk + 1 \n2k + 2 \n\nJ-L(k,9)  >  2k. \n\nTheorem 2  For  each  k, \n\n4l~J ~ J-L(k, 1t, 2)  <  2k + 1 \nJ-Ld(k, 1t, 2)  <  4k + 3 . \n\nTheorem 3  For  any sigmoid 9,  and for  each  k, \n\n2k + 1  <  J-L(k, 1t, 2) \n4k + 3  <  ~(k, 1t, 2) \n4k - 1  <  J-L(k, 9, 2) . \n\nThese  results  are  proved  in  (Sontag,  1990a).  The  first  inequality  in  Theorem  2 \nfollows  from  the  results  in  (Baum,  1988),  who  in  fact  established  a  lower  bound \nof 2N l ~ J for  J-L(k, 1t, N)  (and  hence  for  J-L  too),  for  every  N,  not  just  N  =  2  as \nin  the ~eorem above.  We  conjecture,  but have as  yet been  unable  to prove,  that \ndirect connections or sigmoids should also improve these bounds by at least a factor \nof 2,  just  as  in  the  two-dimensional case  and  in  the  worst-case  analysis.  Because \nof Lemma  4.2,  the  last  statements  in  Theorems  1  and  3  are  consequences  of the \nprevious two. \n\n5  SOME PARTICULAR ACTIVATION FUNCTIONS \n\nConsider  the  last  inequality in  Theorem  1.  For arbitrary sigmoids,  this  is  far  too \nconservative,  as  the  number J-L  can  be  improved considerably  from  2k,  even  made \ninfinite (see below).  We conjecture that for the important practical case 9(x) =  O'(x) \nit  is  close  to  optimal, but  the  only  upper  bounds  that  we  have are  still  too  high. \nFor the piecewise  linear function 11\",  at least,  one has equality: \nLemma 5.1  ~(k, 11\")  = 2k. \n\nIt is  worth remarking  that  there  are sigmoids  9,  as  differentiable  as  wanted,  even \nreal-analytic,  where  all classification measures are infinite.  Of course,  the function \n9 is so  complicated that there is no reasonably  \"finite\"  implementation for  it.  This \nremark  is  only  of theoretical  interest,  to  indicate  that,  unless  further  restrictions \nare  made  on  (S1)-(S2),  much  better  bounds  can  be  obtained.  (If  only  J-L  and  J-L \nare  desired  to  be  infinite,  one  may  also  take  the  simpler  example  9( x)  = sin( x). \nNote  that for  any I  rationally independent real numbers Xi,  the vectors of the form \n(sin(-YIXI), ... , sin(-y,xr),  with  the  'Yi'S  real,  form  a  dense  subset  of [-1,1]',  so  all \ndichotomies on  {Xl,\"\"  xd can  be implemented with (1, sin)-nets.) \nLemma 5.2  There is  some sigmoid 9,  which can be  taken to be an analytic func(cid:173)\ntion, so  that J-L(1, 9)  =  00. \n\n\f944 \n\nSontag \n\n6 \n\nINTERPOLATION \n\nWe now  consider  the  following  approximate interpolation  problem.  Assume  given \na  sequence of k  (distinct)  points  Xl,  \u2022\u2022\u2022 ,  Xk  in  RN,  any \u00a3  > 0,  and any sequence  of \nreal  numbers YI,\"\"  Yk,  as  well  as some  class  :F  of functions  from  JRN  to  JR.  We \nask if there exists some \n\nI  E :F  so  that  I/(xd - yd  < \u00a3  for  each  i. \n\n(3) \nLet ~(:F) be the largest integer k  ~ 1,  possibly infinite, so that for  every set of data \nas above (3)  can be solved.  Note that, obviously, ~(:F)  ~ p,(:F).  Just as in Lemma \n4.1,  .1. is independent of the dimension N  when applied to nets.  Thus we let .1.d(k, B) \nand  .1.(k, B)  be  respectively  the  values  of .1.(:F)  when  applied  to  (k, B)-nets  with  or \nwithout direct connections. \nWe  now  summarize properties  of.1..  The  next  result  -see  (Sontag,1991),  as  well \nas  the  full  version of this  paper,  for  a  proof- should  be  compared  with  Theorem \n1.  The main difference is in the second equality.  Note that one can prove .1.( k, B)  ~ \n~d(k - 1,1l),  in  complete  analogy  with  the  case  of  p\"  but  this  is  not  sufficient \nanymore  to  be  able  to  derive  the  last  inequality in  the  Theorem  from  the  second \nequality. \n\nTheorem 4  For  any continuous sigmoid B,  and lor each  k, \n\n.1.(k,1l) \n,Ad(k,1l) \n\nk + 1 \nk + 2 \n\n.1.(k, B)  >  2k - 1 . \n\nRemark 6.1  Thus  we  can  approximately  interpolate  any  2k  - 1  points  using  k \nsigmoidal  neurons.  It is  not  hard  to  prove  as  a  corollary  that,  for  the  standard \nsigmoid,  this  approximate  interpolation  property  holds  in  the  following  stronger \nsense:  for  an  open  dense  set  of 2k  - 1 points,  one  can  achieve  an open  dense  set \nof values;  the  proof involves looking first  at  points with  rational coordinates,  and \nusing  that  on  such  points one  is  dealing  basically  with  rational  functions  (after  a \ndiffeomorphism),  plus  some  theory  of semialgebraic  sets.  We  conjecture  that  one \nshould  be  able  to  interpolate  at  2k  points.  Note  that  for  k  = 2  this  is  easy  to \nachieve:  just choose the slope d  so  that some  Zi  - Zi+l  becomes zero and the  Zi  are \nallowed to be  nonincreasing or nondecreasing.  The same  proof,  changing the  signs \nif necessary,  gives  the  wanted net.  For  some  examples,  it  is  quite  easy  to  get  2k \npoints.  For instance, .1.(k,1r)  = 2k  for  the piecewise linear sigmoid 1r. \n0 \n\n7  FURTHER REMARKS \n\nThe  main conclusion  from  Theorem 1  is  that sigmoids  at least  double  recognition \npower for  arbitrary sets.  It may be  the case  that p,(k, (7, N)j p,(k, 1l, N) :::::::  2  for  all \nN;  this is  true for  N  =  1 and is  strongly suggested by Theorem 3  (the first  bound \nappears  to  be  quite  tight).  Unfortunately  the  proof of this  theorem is  based  on  a \nresult  from  (Asano et.  al.,  1990)  regarding  arrangements of points in  the  plane,  a \nfact  which does not generalize  to dimension  three or higher. \nOne may also compare the power of nets with and without connections, or threshold \nvs  sigmoidal  processors,  on  Boolean  problems.  For  instance,  it  is  a  trivial  conse(cid:173)\nquence  from  the  given results  that parity on  n  bits  can  be  computed  with  rni1l \n\n\fRemarks on Interpolation and Recognition Using Neural Nets \n\n945 \n\nhidden  sigmoidal  units  and  no  direct  connections,  though  requiring  (apparently, \nthough  this  is  an  open  problem)  n  thresholds.  In  addition,  for  some  families  of \nBoolean  functions,  the  gap  between sigmoidal  nets  and  threshols  nets  may be  in(cid:173)\nfinitely  large  (Sontag,  1990a).  See  (Sontag,  1990b)  for  representation  properties  of \ntwo-hidden-Iayer nets \n\nAcknow ledgements \nThis  work  was  supported in  part  by  Siemens  Corporate  Research,  and  in part  by \nthe CAIP  Center, Rutgers  University. \n\nReferences \nArai, M.,  \"Mapping abilities of three-layer neural networks,\" Proc.  IJCNN Int. Joint \nConf.on Neural Networks,  Washington,  June  18-22,  1989, IEEE Publications, 1989, \npp.  1-419/424. \nAsano,T.,  J.  Hershberger,  J.  Pach, E.D.  Sontag,  D.  Souivaine, and  S.  Suri,  \"Sepa(cid:173)\nrating Bi-Chromatic Points by Parallel Lines,\"  Proceedings  of the  Second  Canadian \nConference  on  Computational Geometry, Ottawa, Canada,  1990,  p.  46-49. \nBaum, E.B.,  \"On the capabilities of multilayer perceptrons,\"  J.Complexity 4(1988): \n193-215. \nChester,  D.,  \"Why two hidden  layers and better than one,\"  Proc.  Int.  Joint  Conf. \non Neural Networks, Washington, DC, Jan. 1990, IEEE Publications, 1990, p. 1.265-\n268. \nCybenko,  G.,  \"Approximation  by  superpositions  of a  sigmoidal function,\"  Math. \nControl,  Signals,  and Systems  2(1989):  303-314. \nFunahashi, K.,  \"On the approximate realization of continuous mappings by neural \nnetworks,\"  Proc.  Int.  Joint  Conf.  on Neural Networks, IEEE Publications,  1988,  p. \n1.641-648. \nHornik,  K.M.,  M.  Stinchcombe,  and  H.  White,  \"Multilayer feedforward  networks \nare universal approximators,\"  Neural Networks  2(1989):  359-366. \nSontag,  E.D.,  \"Sigmoids distinguish  better than  Heavisides,\"  Neural  Computation \n1(1989):  470-472. \nSontag,  E.D.,  \"On  the  recognition  capabilities  of  feedforward  nets,\"  Report \nSYCON-90-03,  Rutgers Center for  Systems  and  Control,  April  1990. \nSontag,  E.D.,  \"Feedback  Stabilization  Using  Two-Hidden-Layer  Nets,\"  Report \nSYCON-90-11,  Rutgers  Center for  Systems  and  Control,  October  1990. \nSontag,  E.D.,  \"Capabilities and  training  of feedforward  nets,\"  in  Theory  and  Ap(cid:173)\nplications  of Neural Networks  (R.  Mammone and J.  Zeevi,  eds.),  Academic  Press, \nNY,  1991, to appear. \nSontag, E.D., and H.J. Sussmann,  \"Back propagation can give rise  to spurious local \nminima even  for  networks  without  hidden  layers,\"  Complex  Systems  3(1989):  91-\n106. \nSontag,  E.D.,  and  H.J.  Sussmann,  \"Backpropagation separates  where  perceptrons \ndo,\"  Neural Networks(1991),  to appear. \n\n\f", "award": [], "sourceid": 436, "authors": [{"given_name": "Eduardo", "family_name": "Sontag", "institution": null}]}