{"title": "Threshold Network Learning in the Presence of Equivalences", "book": "Advances in Neural Information Processing Systems", "page_first": 879, "page_last": 886, "abstract": null, "full_text": "Threshold  Network  Learning  in the Presence of \n\nEquivalences \n\nJohn Shawe-Taylor \n\nDepartment of Computer  Science \n\nRoyal Holloway  and  Bedford  New  College \n\nUniversity  of London \n\nEgham, Surrey TW20 OEX,  UK \n\nAbstract \n\nThis  paper  applies  the  theory  of Probably  Approximately  Correct  (PAC) \nlearning  to  multiple  output  feedforward  threshold  networks  in  which  the \nweights  conform  to  certain  equivalences.  It is  shown that  the sample  size \nfor  reliable  learning  can  be  bounded  above  by  a  formula  similar  to  that \nrequired  for  single  output  networks  with  no equivalences.  The best  previ(cid:173)\nously obtained  bounds  are improved for  all  cases. \n\n1 \n\nINTRODUCTION \n\nThis paper develops  the results of Baum and Haussler  [3]  bounding the sample sizes \nrequired for  reliable  generalisation of a single output feedforward threshold network. \nThey prove their  result  using  the theory of Probably Approximately  Correct  (PAC) \nlearning introduced  by Valiant  [11].  They show  that for  0 < \u00ab:  :S  1/2, if a  sample of \nsIze \n\n64N \nm  2::  rna  =  - - log - -\n\n64W \n\n\u00ab: \n\n\u00ab: \n\nis  loaded into a  feedforward  network of linear  threshold  units  with  N  nodes  and W \nweights,  so  that a  fraction  1- \u00ab:/2  of the examples are correctly  classified,  then with \nconfidence  approaching certainty  the  network  will  correctly  classify  a  fraction  1 - \u00ab: \nof future  examples drawn according  to the same distribution.  A  similar  bound was \nobtained  for  the  case  when  the  network  correctly  classified  the  whole  sample.  The \nresults  below  will  imply  a  significant  improvement  to both of these  bounds. \n\n879 \n\n\f880 \n\nShawe-Taylor \n\nIn  many  cases  training  can  be  simplified  if  known  properties  of a  problem  can \nbe  incorporated  into  the  structure  of a  network  before  training  begins.  One  such \ntechnique  is  described  by  Shawe-Taylor  [9],  though  many  similar  techniques  have \nbeen  applied  as  for  example  in  TDNN's  [6].  The  effect  of these  restrictions  is  to \nconstrain  groups  of  weights  to  take  the  same  value  and  learning  algorithms  are \nadapted to respect  this  constraint. \n\nIn  this  paper  we  consider  the  effect  of this  restriction  on  the  generalisation  per(cid:173)\nformance  of the  networks  and  in  particular  the  sample  sizes  required  to  obtain  a \ngiven level  of generalisation.  This extends  the  work described  above  by  Baum and \nHaussler  [3]  by  improving  their  bounds  and  also  improving  the  results  of Shawe(cid:173)\nTaylor  and Anthony  [10],  who  consider  generalisation  of multiple-output  threshold \nnetworks.  The remarkable fact is  that in all cases  the formula obtained is  the same, \nwhere  we  now  understand  the  number  of weights  W  to  be  the  number  of weight \nclasses,  but N  is  still  the  number of computational nodes. \n\n2  DEFINITIONS  AND  MAIN  RESULTS \n\n2.1  SYMMETRY  AND  EQUIVALENCE  NETWORKS \n\nWe  begin  with  a  definition  of threshold  networks.  To  simplify  the  exposition  it is \nconvenient  to incorporate  the  threshold  value  into  the  set  of weights.  This is  done \nby creating a  distinguished  input that always has value  1 and is called the threshold \ninput.  The following  is  a  formal notation for  these  systems. \nA  network  N  =  (C, I, 0, no, E)  is  specified  by  a  set  C  of computational  nodes,  a \nset  I  of input nodes,  a  subset  0  ~ C  of output  nodes  and a  node no  E  I,  called  the \nthreshold  node.  The connectivity is  given  by a  set  E  ~ (C u 1)  x C  of connections, \nwith  {no}  x  C  ~ E. \nWith  network N  we  associate  a  weight  function  W  from  the  set  of connections  to \nthe  real  numbers.  We  say  that  the  network  N  is  in  state  w.  For  input  vector  i \nwith  values  in  some  subset  of the  set  'R  of real  numbers,  the  network  computes  a \nfunction  F./If(w, i). \nAn automorphism')' of a  network N  =  (C, I, 0, no, E) is  a  bijection of the  nodes of \nN  which  fixes  I  setwise  and {no} U  0  pointwise,  such  that the  induced  action fixes \nE  setwise.  We  say  that  an  automorphism')'  preserves  the  weight  assignment  W  if \nWji  =  w(-yj)(\"Yi )  for  all  i  E  I  u C,  j  E  C.  Let')'  be  an  automorphism  of a  network \nN  = (C, 1,0, no, E)  and let  i be  an input  to N.  We  denote  by i\"Y  the input  whose \nvalue  on input  k is  that of i  on input  ,),-lk. \n\nThe following  theorem  is  a  natural  generalisation  of part  of the  Group  Invariance \nTheorem of Minsky  and  Pappert  [8]  to multi-layer  perceptrons. \n\nTheorem 2.1  [9J  Let')'  be  a weight preserving  automorphism of the  network N  = \n( C, I, 0, no, E)  in state  w.  Then  for  every  input vector i \n\nF./If(w, i) =  F./If(w, P). \n\nFollowing  this  theorem  it  is  natural  to  consider  the  concept  of a  symmetry  net(cid:173)\nwork  [9].  This  is  a  pair  (N, r),  where  N  is  a  network  and  r  a  group  of weight \n\n\fThreshold Network Learning in the Presence of Equivalences \n\n881 \n\npreserving  automorphims  of N.  We  will  also  refer  to  the  automorphisms  as  sym(cid:173)\nmetries.  For  a  symmetry  network  (N, r), we  term  the  orbits  of the  connections  E \nunder  the action of r  the  weight  classes. \nFinally  we  introduce  the  concept  of an  equivalence  network.  This  definition  ab(cid:173)\nstracts from  the symmetry networks precisely  those  properties we  require  to obtain \nour  results.  The  class  of  equivalence  networks  is,  however,  far  larger  than  that \nof symmetry  networks  and includes  many classes  of networks  studied  by  other  re(cid:173)\nsearchers  [6,  7]. \n\nDefinition 2.2  An equivalence  network is  a threshold network in which an  equiva(cid:173)\nlence  relation  is  dejined  on  both  weights  and  nodes.  The  two  relations  are  required \nto  be  compatible  in that weights in the  same  class  are  connected to nodes in the  same \nclass,  while  nodes  in  the  same  class  have  the  same  set  of input  weight  connection \ntypes.  The  weights in  an  equivalence  class  are  at  all times required to remain equal. \n\nNote  that  every  threshold  network  can  be  viewed  as  an  equivalence  network  by \ntaking  the  trivial  equivalence  relations.  We  now  show  that  symmetry  networks \nare  indeed  equivalence  networks  with  the  same  weight  classes  and  give  a  further \ntechnical  lemma.  For  both lemmas  proofs  are omitted. \n\nLemma 2.3  A  symmetry  network  (N, r)  is  an  equivalence  network,  where  the \nequivalence  classes  are  the  orbits  of connections  and  nodes  respectively. \n\nLemma 2.4  Let N  be  an  equivalence  network and C  be  the  set of classes  of nodes. \nThen  there  is an  indezing  of the  classes,  Gi,  i  = 1, . . . , n,  such  that  nodes  in Gi  do \nnot  have  connections from  nodes  in Gj  for j  2 i. \n\n2.2  MAIN  RESULTS \n\nWe are now in a  position to state our main results.  Note that throughout this paper \nlog  means  natural logarithm,  while  an explicit  subscript  is  used  for  other  bases. \n\nTheorem 2.5  Let N  be  an equivalence  network with W  weight classes and N  com(cid:173)\nputational nodes.  If the  network correctly  computes  a function  on  a set of m  inputs \ndrawn  independently  according  to  a jized  probability  distribution,  where \n\nm  2 mo(\u20ac,I5)  =  \u20ac(1- J\u20ac) \n\n1  [  (1.3) \n\nlog  6  + 2Wlog  -\u20ac-\n\n(6VN) 1 \n\n.. \n\nthen  with  probability  at  least  1 - 15  the  error rate  of the  network  will  be  less  than  \u20ac \non  inputs  drawn  according  to  the  same  distribution. \n\nTheorem 2.6  Let  N  be  an  equivalence  network  with  W  weight  classes  and  N \ncomputational  nodes.  If the  network  correctly  computes  a  function  on  a  fraction \n1 - (1  -1')\u20ac  of m  inputs  drawn  independently  according  to  a jized probability  distri(cid:173)\nbution,  where \n\nm  2 mo(\u20ac,c5,1')  = \n\n1'2\u20ac(1  -\n\n1.jfJN  [4 log  (-154)  + 6Wlog (  ;~ )] \n15  the  error rate  of the  network  will  be  less  than  \u20ac \n\n\u20ac/  N) \n\nl' \n\n\u20ac \n\nthen  with  probability  at  least  1 -\non  inputs  drawn  according  to  the  same  distribution. \n\n\f882 \n\nShawe-Taylor \n\n3  THEORETICAL BACKGROUND \n\n3.1  DEFINITIONS  AND  PREVIOUS  RESULTS \n\nIn  order  to  present  results  for  binary  outputs  ({O,  I} functions)  and larger  ranges \nin  a  unified  way  we  will  consider  throughout  the  task  of learning  the  graph  of a \nfunction.  All  the  definitions  reduce  to  the  standard  ones  when  the  outputs  are \nbinary. \n\nWe  consider learning from examples as selecting  a  suitable function from  a set  H  of \nhypotheses,  being functions  from  a  space  X  to set  Y,  which  has at most  countable \nSlze.  At  all  times  we  consider  an  (unknown)  target function \n\nc:X---+Y \n\nwhich  we  are  attempting  to  learn.  To  this  end  the  space  X  is  required  to  be  a \nprobability  space  (X, lJ, p.),  with  appropriate  regularity  conditions  so  that  the  sets \nconsidered  are  measurable  [4].  In  particular  the  hypotheses  should  be  measurable \nwhen Y  is  given  the  discrete  topology  as  should  the  error  sets  defined  below.  The \nspace  S  = X  x Y  is  equipped  with  au-algebra  E  x  2Y  and  measure  v = v(p., e), \ndefined  by its value  on sets  of the  form  U  x {y}: \n\nv(U x {y}) = p.  (U n e- 1(y))  . \n\nU sing  this  measure  the  error  of a  hypothesis  is  defined  to be \n\nerv (h) = v{(:z:, y)  E  Slh(:z:)  =1=  y}. \n\nThe  introduction  of v  allows  us  to  consider  samples  being  drawn  from  S,  as  they \nwill  automatically  reflect  the  output  value  of  the  target.  This  approach  freely \ngeneralises  to  stochastic  concepts  though  we  will  restrict  ourselves  to  target  func(cid:173)\ntions  for  the  purposes  of  this  paper.  The  error  of  a  hypothesis  h  on  a  sample \nx  = ((:Z:1' yd, ... , (:Z:m, Ym))  E sm is  defined  to  be \n\nerx(h) = ~ l{ilh(:Z:i)  =1=  ydl\u00b7 \n\nm \n\nWe  also define  the VC  dimension of a  set of hypotheses  by reference  to  the  product \nspace  S.  Consider  a  sample x  =  ((:Z:1I  yI), ... , (:Z:m' Ym))  E  sm  and  the function \n\ngiven  by  X*(h)i  = 1 if and  only if h(:z:,)  = Yi,  for  i  =  1, ... ,m.  We  can  now  define \nthe growth function  BH(m)  as \n\nx*  : H  ---+ {O,  l}m, \n\nBH(m) = max  l{x*(h)lh E H}I ~ 2m. \n\nXES\"' \n\nThe Vapnik-Chervonenkis  dimension  of a  hypothesis space  H  is  defined  as \nif BH(m) = 2m ,  for  all  m; \notherwise. \n\nIn  the  case  of  a  threshold  network  oN,  the  set  of functions  obtainable  using  all \npossible  weight  assignments  is  termed  the  hypothesis  space  of oN  and we  will  refer \n\n\fThreshold Network Learning in the Presence of Equivalences \n\n883 \n\nto it as N.  For a  threshold  network N,  we  also introduce  the state growth function \nSJV(m).  This  is  defined  by  first  considering  all  computational  nodes  to  be  output \nnodes,  and  then counting different  output sequences. \n\n.  m'!-x \n\n'R}I \nSJV(m)  = \nwhere  X  = [0,1]111  and N' is  obtained  from  N  by  setting  0  = C.  We  clearly  have \nthat for  all Nand m,  BJV(m) ::;  SJV(m). \n\nI{(FJV'(w, il), FJV'(w, i2)'\"'' FJV'(w, im))lw : E  -\n\nX=(ll, ... ,lm.)EX= \n\nTheorem 3.1 \nany  \u20ac  > 0  and  k > m  and \n\n[2J  If a  hypothesis  space  H  has  growth  function  BH(m)  then  for \n\n1 \nO<r<I---\n..;\u20ack \n\nthe  probability  that  there  is  a  function  in H  which  agrees  with  a  randomly  chosen \nm  sample  and  has  error greater  than  \u20ac \n\nis  less  than \n\n\u20ack(l-r)2 \n)2 \nk( \nl-r  -1 \n\n\u20ac \n\nk '  \n\nm+ \n\nBH(m + k) exp  -r\u20ac \n\n{ km  } \n\nThis  result  can  be  used  to obtain  the  following  bound  on  sample  size  required  for \nPAC learnability of a  hypothesis space with VC dimension d.  The theorem improves \nthe  bounds reported  by  Blumer  et  al.  (4). \n\n[2J  If a  hypothesis  space  H  has  finite  VC  dimension  d  >  I,  then \nTheorem 3.2 \nthere  is  mo  =  mo( \u20ac,  6)  such  that  if m  > mo  then  the  probability  that  a  hypothesis \nconsistent with a  randomly  chosen  sample  of size m  has  error greater than  \u20ac \nis  less \nthan 6.  A  suitable  value  of rna  is \n\nrna  =  \u20ac  (1 ~ 0) [log  (  d / (d6 - 1)) + 2d log (~) ]. \n\no \n\nFor  the  case  when  we  allow  our hypothesis  to incorrectly  compute  the  function  on \na  small fraction  of the  training  sample,  we  have  the  following  result.  Note  that we \nare  still considering  the discrete  metric  and so in  the case  where  we  are  considering \nmultiple  output  feedforward  networks  a  single  output  in  error  would  count  as  an \noverall  error. \nTheorem 3.3  [10J  Let 0 < \u20ac  <  1 and 0 < \"(  ::;  1.  Suppose  H  is  a hypothesis  space \nof functions  from  an  input  space  X  to  a  possibly  countable  set Y,  and  let  v  be  any \nprobability  measure  on  S  = X  x Y.  Then  the  probability  (with  respect to  v m )  that, \nfor  x  E  sm,  there  is  some h  E  H  such that \n\nerll(h)  > \u20ac \n\nand  erx(h)::;  (1  - ,,()erll(h) \n\nis  at  most \n\n4BH(2m)exp  --4- . \n\n( \n\n\"(2\u20acm) \n\nFurthermore,  if H  has finite  VC dimension d,  this  quantity  is  less  than 6  for \n\nm> mo(\u20ac,6,,,()  =  \"(2\u20ac(11_  0) [410g (~) + 6dlog ('Y2~3\u20ac)]' \n\no \n\n\f884 \n\nShawe-Taylor \n\n4  THE  GROWTH  FUNCTION  FOR EQUIVALENCE \n\nNETWORKS \n\nWe  will  bound  the  number  of output  sequences  B,,(m) for  a  number  m  of inputs \nby  the  number  of distinct  state  sequences  S,,(m)  that  can  be  generated  from  the \nm  inputs  by different  weight  assignments.  This follows  the approach  taken in  [10]. \n\nTheorem 4.1  Let.N  be  an  equivalence  network with W  weight equivalence  classes \nand  a  total of N  computational nodes.  Then  we  can  bound  S,,(m)  by \n\nIdea of Proof:  Let  Gi,  i  =  1, ... , n,  be  the  equivalence  classes  of nodes  indexed \nas  guaranteed  by  Lemma 2.4  with  IGil = Ci  and  the  number of inputs for  nodes in \nGi  being  ni  (including  the  threshold  input).  Denote  by .AIj  the  network  obtained \nby  taking  only  the  first  j  node  equivalence  classes.  We  omit  a  proof by  induction \nthat \n\nj \n\nS\"j (m)  :S II Bi(mci), \n\nwhere  Bi  is  the  growth function  for  nodes  in  the  class  Gi. \n\ni=1 \n\nUsing  the  well  known  bound  on  the  growth  function  of a  threshold  node  with  ni \ninputs  we  obtain \n\nSN( m) ~ ll. ( e:;, ) n, \n\nConsider  the  function  !( ~) = ~ log~.  This is  a  convex function  and  so  for  a  set  of \nvalues  ~1, ..\u2022 ,  ~M, we  have  that  the  average  of f(~i) is  greater  than or  equal to  f \napplied  to  the  average  of ~i.  Consider  taking  the  ~'s to  be  Ci  copies  of ni/ci  for \neach i  =  1, ... n.  We  obtain \n\n12:n \n-\nN  _ \n,=1 \n\nn-Iog - > -\n' \n\nni  W \nCi  - N \n\nlog -\n\nW \nN \n\nor \n\nand so \n\nas  required.  _ \n\nS,,(m) :S  ( emwN)W, \n\nThe  bounds  we  have obtained make it  possible  to  bound  the  Vapnik-Chervonenkis \ndimension  of equivalence  networks.  Though  we  we  will  not  need  these  results,  we \ngive  them here  for  completeness. \n\nProposition 4.2  The  Vapnik-Chervonenkis  dimension  of an  equivalence  network \nwith W  weight classes  and  N  computational  nodes  is  bounded  by \n\n2Wlog 2 eN. \n\n\fThreshold Network Learning in the Presence of Equivalences \n\n885 \n\n5  PROOF  OF  MAIN RESULTS \n\nU sing the results  of the last section  we are now in a  position  to prove Theorems 2.5 \nand 2.6. \n\nProof of Theorem 2.5:  (Outline)  We use  Theorem  3.1  which  bounds the  proba(cid:173)\nbility  that a  hypothesis  with  error  greater  than E can  match an  m-sample.  Substi(cid:173)\ntuting  our  bound on  the  growth function  of an equivalence  network and choosing \n\nand  r  as in  [1],  we  obtain the  following  bound on  the probability \n\n( d) (e 4Em2)W \n\nd _  1 \n\nW2 \n\nN W  exp( -Em). \n\nBy  choosing  m> me  where  me  is  given  by \n\nme = me(E, 6) = E(1  _  JE) \n\n1  [  (1.3) \n\nlog  6\"  + 2W log  -E-\n\n(6..fN)] \n\nwe  guarantee that the  above probability is  less  than 6  as  required.  _ \n\nOur second  main  result  can be obtained  more  directly. \n\nProof of Theorem 2.6:  (Outline)  We  use  Theorem  3.3  which  bounds  the  prob(cid:173)\nability  that  a  hypothesis  with  error  greater  than  E can  match  all  but  a  fraction \n(1  -1') of an m-sample.  The bound on the sample size  is  obtained from  the  proba(cid:173)\nbility  bound by  using  the inequality  for  BH(2m).  By  adjusting  the  parameters  we \nwill  convert  the probability expression  to  that obtained by substituting  our growth \nfunction.  We  can  then  read  off a  sample  size  by  the  corresponding  substitution  in \nthe  sample  size  formula.  Consider  setting  d = W,  E = E' IN  and  m  = N m'.  With \nthese  substitutions  the sample  size  formula is \n\nm  = \n\n, \n\n1 \n\n1'2 e'(1 - Je'IN) \n\n[ (  4 ) \n410 \ng  6 \n\n- + 6 W  10 \n\n(  4N  ) 1 \n\ng  1'2/3e' \n\nas required.  _ \n\n6  CONCLUSION \n\nThe problem of training feedforward  neural networks remains a  major hurdle to the \napplication  of this  approach  to large  scale  systems.  A  very promising  technique  for \nsimplifying  the  training problem is  to include  equivalences  in the network structure \nwhich  can be justified by a  priori  knowledge  of the application domain.  This paper \nhas  extended  previous  results  concerning  sample  sizes  for  feedforward  networks  to \ncover  so  called  equivalence  networks in  which  weights  are  constrained  in  this  way. \nAt the same time  we  have improved the sample size  bounds previously  obtained for \nstandard  threshold  networks  [3]  and multiple  output networks  [10]. \n\n\f886 \n\nShawe-Taylor \n\nThe  results  are  of the  same order  as  previous  results  and imply  similar  bounds on \nthe  Vapnik-Chervonenkis  namely  2W log2 eN.  They  perhaps  give  circumstancial \nevidence  for  the conjecture  that the loga eN factor in this expression is  real,  in that \nthe same expression  obtains even if the number of computational nodes is increased \nby  expanding  the  equivalence  classes  of weights.  Equivalence  networks  may  be  a \nuseful  area  to  search  for  high  growth  functions  and  perhaps  show  that  for  certain \nclasses  the  VC  dimension  is  O(Wlog N). \n\nReferences \n\n[1]  Martin Anthony, Norman Biggs and John Shawe-Taylor,  Learnability and For(cid:173)\n\nmal  Concept  Analysis,  RHBNC  Department  of Computer  Science,  Technical \nReport,  CSD-TR-624,  1990. \n\n[2]  Martin  Anthony,  Norman  Biggs  and  John  Shawe-Taylor,  The  learnability  of \nformal  concepts,  Proc.  COLT  '90,  Rochester,  NY.  (eds  Mark  Fulk  and  John \nCase)  (1990)  246-257. \n\n[3]  Eric  Baum and David Haussler,  What size net gives valid generalization,  Neural \n\nComputation,  1 (1)  (1989)  151-160. \n\n[4]  Anselm  Blumer,  Andrzej  Ehrenfeucht,  David  Haussler  and  Manfred  K.  War(cid:173)\nmuth,  Learnability  and  the  Vapnik-Chervonenkis  dimension,  JACM,  36  (4) \n(1989)  929-965. \n\n[5]  David  Haussler,  preliminary  extended  abstract,  COLT  '89. \n[6]  K.  Lang and G.E. Hinton,  The development  of TDNN  architecture  for  speech \nrecognition,  Technical  Report  CMU-CS-88-152,  Carnegie-Mellon  University, \n1988. \n\n[7]  Y.  Ie  Cun,  A  theoretical  framework  for  back  propagation,  in  D.  Touretzsky, \n\neditor,  Connectionist  Models:  A  Summer  School,  Morgan-Kaufmann,  1988. \n\n[8]  M.  Minsky  and  S.  Papert,  Perceptrons,  expanded  edition,  MIT  Press,  Cam(cid:173)\n\nbridge,  USA,  1988. \n\n[9]  John  Shawe-Taylor,  Building  Symmetries into  Feedforward  Network Architec(cid:173)\ntures,  Proceedings of First lEE Conference on Artificial  Neural Networks,  Lon(cid:173)\ndon,  1989,  158-162. \n\n[10]  John  Shawe-Taylor  and  Martin  Anthony,  Sample  Sizes  for  Multiple  Output \n\nFeedforward  Networks,  Network,  2  (1991)  107-117. \n\n[11]  Leslie  G.  Valiant,  A  theory of the learnable,  Communications of the ACM,  27 \n\n(1984)  1134-1142. \n\n\f", "award": [], "sourceid": 510, "authors": [{"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}]}