{"title": "Generalisation of A Class of Continuous Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 267, "page_last": 273, "abstract": null, "full_text": "Generalisation  of A  Class  of Continuous \n\nNeural Networks \n\nJohn Shawe-Taylor \n\nDept of Computer Science, \n\nRoyal Holloway,  University  of London, \n\nEgham,  Surrey  TW20  OEX,  UK \nEmail:  johnCdcs.rhbnc.ac . uk \n\nJieyu  Zhao* \n\nIDSIA,  Corso  Elvezia  36, \n6900-Lugano,  Switzerland \n\nEmail:  jieyuCcarota.idsia.ch \n\nAbstract \n\nWe  propose  a  way of using  boolean circuits  to perform real valued \ncomputation  in  a  way  that  naturally  extends  their  boolean  func(cid:173)\ntionality.  The  functionality  of multiple  fan  in  threshold  gates  in \nthis  model  is  shown  to  mimic  that  of a  hardware implementation \nof continuous Neural Networks.  A Vapnik-Chervonenkis dimension \nand  sample  size  analysis  for  the  systems  is  performed  giving  best \nknown  sample sizes  for  a  real  valued  Neural Network.  Experimen(cid:173)\ntal results confirm the conclusion that the sample sizes required for \nthe  networks  are significantly  smaller than for  sigmoidal networks. \n\n1 \n\nIntroduction \n\nRecent  developments  in  complexity  theory  have  addressed  the  question  of com(cid:173)\nplexity  of computation  over  the  real  numbers.  More  recently  attempts  have  been \nmade to introduce  some  computational cost  related  to  the accuracy  of the  compu(cid:173)\ntations  [5].  The  model  proposed  in  this  paper  weakens  the  computational  power \nstill further  by relying  on classical  boolean circuits  to  perform  the  computation  us(cid:173)\ning  a  simple  encoding  of the  real  values.  Using  this  encoding  we  also  show  that \nTeo  circuits  interpreted  in  the  model  correspond  to  a  Neural  Network  design  re(cid:173)\nferred  to as Bit  Stream Neural Networks,  which  have been  developed  for  hardware \nimplementation  [8]. \n\nWith the  perspective  afforded  by  the general approach considered  here,  we  are  also \nable  to analyse  the Bit Stream Neural Networks  (or indeed  any other adaptive sys(cid:173)\ntem based on the technique),  giving  VC  dimension  and sample size  bounds for PAC \nlearning.  The sample sizes  obtained are very similar to those for threshold networks, \n\n*Work performed while  at Royal Holloway,  University of London \n\n\f268 \n\n1.  SHAWE-TAYLOR, J.  ZHAO \n\ndespite  their being derived  by very different  techniques.  They give  the best bounds \nfor  neural  networks involving smooth activation functions,  being significantly  lower \nthan the  bounds  obtained recently  for  sigmoidal  networks  [4,  7]. \n\nWe  subsequently  present  simulation  results  showing  that  Bit  Stream  Neural  Net(cid:173)\nworks  based on  the  technique  can be used  to solve  a  standard benchmark  problem. \nThe  results  of the  simulations  support  the  theoretical  finding  that  for  the  same \nsample  size  generalisation  will  be  better  for  the  Bit  Stream  Neural Networks  than \nfor  classical  sigmoidal  networks.  It should  also  be  stressed  that  the  approach  is \nvery  general  - being  applicable  to  any  boolean  circuit  - and  by  its  definition  em(cid:173)\nploys compact digital hardware.  This fact motivates the introduction  of the model, \nthough it will  not play  an important part in  this  paper. \n\n2  Definitions  and  Basic  Results \n\nA  boolean  circuit is  a  directed  acyclic  graph  whose  nodes  are  referred  to  as  gates, \nwith  a  single  output  node  of out-degree  zero.  The  nodes  with  in-degree  zero  are \ntermed  input nodes.  The  nodes  that are  not input  nodes  are  computational nodes. \nThere is  a  boolean function  associated  with each computational node of arity equal \nto  its  in-degree.  The  function  computed  by  a  boolean  network  is  determined  by \nassigning  (input)  values  to  its  input  nodes  and  performing  the  function  at  each \ncomputational node  once  its  input  values  are  determined.  The  result  is  the  value \nat  the  output  node.  The  class  TCo is  defined  to  be  those  functions  that  can  be \ncomputed  by a  family  of polynomially sized  Boolean circuits  with unrestricted  fan(cid:173)\nin  and constant  depth,  where  the gates  are either  NOT or  THRESHOLD. \n\nIn order to use the boolean circuits  to compute with real numbers we use the method \nof stochastic  computing  to  encode  real  numbers  as  bit  streams.  The  encoding  we \nwill  use  is  to  consider  the  stream  of binary  bits,  for  which  the  l's  are  generated \nindependently  at  random  with  probability  p,  as  representing  the  number  p.  This \nis  referred  to  as  a  Bernoulli  sequence  of probability  p.  In  this  representation,  the \nmultiplication of two independently  generated streams can be achieved  by a simple \nAND gate, since  the probability of a  Ion the output stream is  equal to P1P2,  where \nPl  is  the  probability  of a  1  on  the  first  input  stream  and  P2  is  the  probability  of \na  1  on  the  second  input  stream.  Hence,  in  this  representation  the  boolean  circuit \nconsisting  of a  single  AND  gate can  compute the  product  of its  two  inputs. \n\nMore background information about stochastic computing can be found in the work \nof Gaines  [1].  The analysis  we  provide is  made by  treating the calculations  as exact \nreal valued computations.  In a practical (hardware) implementation real bit streams \nwould have to be generated  [3]  and the question of the accuracy of a delivered  result \narlses. \n\nIn the applications considered here the output values are used to determine  a binary \nvalue by comparing with a  threshold of 0.5.  Unless  the actual output is  exactly 1 or \n\n\u00b0 (which can happen),  then however many bits are collected  at the output there is  a \n\nslight  probability  that an incorrect  classification  will  be  made.  Hence,  the  number \nof bits  required  is  a  function  of the  difference  between  the  actual  output  and  0.5 \nand the level  of confidence  required  in  the  correctness  of the classification. \n\nDefinition 1  The  real  function  computed  by  a  boolean  circuit C,  which  computes \nthe  boolean  function \n\nfe : {O,  l}n -* {O,  I}, \n\nis  the  function \n\nge  : [0,  lr -* [0,1], \n\n\fGeneralisation of a Class of Continuous Neural  Networks \n\n269 \n\nobtained by coding each input independently as a  Bernoulli sequence  and interpreting \nthe  output  as  a  similar sequence. \n\nHence,  by the discussion  above we  have for  the circuit  C  consisting of a single  AND \ngate,  the function  ge  is  given  by  ge(:l:1,  :1:2)  =  :1:1:1:2. \n\nWe  now  give  a  proposition  showing  that  the  definition  of real  computation  given \nabove  is  well-defined  and  generalises  the  Boolean  computation  performed  by  the \ncircuit. \n\nProposition 2  The  bit  stream  on  the  output  of a  boolean  circuit computing  a  real \nfunction  is  a  Bernoulli  sequence .  The  real  function  ge  computed  by  an  n  input \nboolean  circuit C  can  be  expressed  in  terms  of the  corresponding  boolean  function \nfe  as  follows: \n\no:E{O,1}\" \n\nIn  particular,  gel{o,)}\"  =  fe \u00b7 \n\nn \n\ni=1 \n\nProof: The output bit stream is  a  Bernoulli  sequence,  since  the behaviour  at each \ntime step is independent  of the behaviour at previous  time sequences,  assuming the \ninput sequences  are independent.  Let  the probability of a  1 in  the output sequence \nbe  p.  Hence,  ge (:I:)  =  p.  At  any  given  time  the  input  to  the  circuit  must  be  one \nof the  2n  possible  binary  vectors  a.  P:l(a)  gives  the  probability  of the  vector  a \noccurring.  Hence,  the  expected  value  of the  output  of the  circuit  is  given  in  the \nproposition  statement,  but  by  the  properties  of a  Bernoulli  sequence  this  value  is \nalso  p.  The final  claim  holds  since  Po: (a)  = 1,  w hile  Po: (a') = 0 for  a  # a' .\u2022 \n\nHence,  the function computed by a circuit can be denoted by a  polynomial of degree \nn,  though  the  representation  given  above  may  involve  exponentially  many  terms. \nThis representation  will  therefore  only  be  used for  theoretical  analysis. \n\n3  Bit  Stream Neural Networks \n\nIn  this  section  we  describe  a  neural  network  model based  on stochastic  computing \nand  show  that it  corresponds  to  taking  TCo circuits  in  the  framework  considered \nin Section  2. \n\nA  Stochastic Bit  Stream Neuron is  a  processing  unit  which  carries  out very  simple \noperations  on its input  bit  streams.  All  input  bit  streams  are  combined  with  their \ncorresponding  weight  bit streams  and then  the weighted  bits  are  summed up.  The \nfinal  total is  compared to a  threshold  value.  If the sum is  larger  than the threshold \nthe neuron  gives  an output  1,  otherwise  O. \n\nThere are two different  versions of the Stochastic Bit Stream Neuron corresponding \nto the different  data representations.  The  definitions  are  given  as follows. \n\nDefinition 3  (AND-SBSN): A  n-input AND version Stochastic  Bit Stream Neu(cid:173)\nron  has  n  weights  in  the  range  [-1,1 j  and  n  inputs  in  the  range  [0,1 j,  which  are  all \nunipolar  representations  of Bernoulli  sequences.  An  extra  sign  bit  is  attached  to \neach  weight Bernoulli sequence.  The  threshold 9  is  an  integer lying between -n to n \nwhich  is  randomly  generated  according  to  the  threshold probability  density function \n\u00a2( 9).  The  computations  performed during  each  operational  cycle  are \n\n\f270 \n\nJ. SHAWE-TA YLOR, J.  ZHAO \n\n(1)  combining  respectively  the  n  bits  from  n  input  Bernoulli  sequences  with  the \ncorresponding  n  bits  from  n  weight Bernoulli  sequences  using  the  AND operation. \n\n(2)  assigning  n  weight  sign  bits  to  the  corresponding  output  bits  of the  AND  gate, \nsumming  up  all  the  n  signed  output  bits  and  then  comparing  the  total  with  the \nrandomly generated  threshold value.  If the  total  is not less  than  the  threshold value, \nthe  AND-SBSN outputs  1,  otherwise  it outputs  O. \n\nWe can now present  the  main result  characterising  the functionality  of a  Stochastic \nBit  Stream Neural Network as the real function  of an Teo  circuit. \n\nTheorem 4  The  functionality  of a  family  of feedforward  networks  of Bit  Stream \nNeurons  with  constant  depth  organised  into  layers  with  interconnections  only  be(cid:173)\ntween  adjacent  layers  corresponds  to  the  function  gc  for  an TCo  circuit C  of depth \ntwice  that  of the  network.  The  number  of input  streams  is  equal  to  the  number \nof network  inputs  while  the  number  of parameters  is  at  most  twice  the  number  of \nweights. \n\nProof:  Consider  first  an  individual  neuron.  We  construct  a  circuit  whose  real \nfunctionality  matches  that  of the  neuron.  The  circuit  has  two  layers.  The  first \nconsists  of a  series  of AND  gates.  Each  gate  links  one  input  line  of the  neuron \nwith  its  corresponding  weight  input.  The  outputs  of these  gates  are  linked  into  a \nthreshold  gate with fixed  threshold  2d for  the  AND-SBSN,  where  d  is  the  number \nof input  lines  to  the  neuron.  The  threshold  distribution  of the  AND  SBSN  is  now \nsimulated  by  having  a  series  of 2d  additional  inputs  to  the  threshold  gate.  The \nnumber of additional input streams required  to  simulate the  threshold  depends  on \nhow  general  a  distribution  is  allowed  for  the  threshold.  We consider  three  cases: \n\n1.  If the  threshold is  fixed  (i.e.  not programmable), then no additional inputs \n\nare  required,  since  the actual  threshold  can  be  suitably  adapted. \n\n2.  If the  threshold  distribution  is  always  focussed  on  one  value  (which  can \nbe varied),  then an additional  flog2(2d)1  (rlog2(d)l) inputs  are required  to \nspecify the binary value of this number.  A circuit feeding the corresponding \nnumber of 1 's to  the  threshold  gate is  not hard to construct. \n\n3.  In  the  fully  general  case  any series  of 2d + 1  (d + 1)  numbers summing  to \n\none  can  be  assigned  as the  probabilities  of the  possible  values \n\n4>(0),4>(1), ... , 4>(t), \n\nt \n\n2d  for \n\nwhere \nthe  AND  SBSN.  We  now  construct  a  circuit \nwhich  takes  t  input  streams  and  passes  the  I-bits  to  the  threshold \ngate  of  all  the  inputs  up  to  the  first  input  stream  carrying  a  O. \nIn  other  words \nNo  fUrther  input  is  passed  to  the  threshold  gate. \nInput streams  1, ... , s  have bit  1 and \neither  s = t  or input stream s + 1 has \ninput o. \n\nThreshold  gate receives  s \nbits of input \n\nq. \n\nWe  now  set  the  probability p,  of stream s  as follows; \n\nPI \n\np, \n\n1 - 4>(0) \n1 - 2:;~~ 4>( i) \n1 - 2:;~g 4>( i) \nfor  s  =  2, ... , t \n\nWith  these  values  the  probability  of the  threshold  gate  receiving  s  bits  is \n4>( s)  as required. \n\n\fGeneralisation of a Class  of Continuous Neural  Networks \n\n271 \n\nThis  completes  the  replacement  of a  single  neuron.  Clearly,  we  can  replace  all \nneurons in a network in the same manner and construct a  network with the required \nproperties  provided  connections  do  not  'shortcut'  layers,  since  this  would  create \ninteractions  between  bits in  different  time slots.  _ \n\n4  VC  Dimension  and  Sample  Sizes \n\nIn  order  to  perform  a  VC  Dimension  and  sample  size  analysis  of the  Bit  Stream \nNeural Networks described in the previous section we introduce the following general \nframework . \n\nDefinition 5  For  a  set Q  of smooth functions  f  : R n  x  Rl  -+  R,  the  class  F  is \ndefined  as \n\nF  = Fg  = {fw Ifw{x) = f{x, w), f  E Q}. \n\nThe  corresponding  classification  class  obtained by  taking a  fixed  set  of s  of the  func(cid:173)\ntions  from  Q,  thresholding  the  corresponding  functions  from  F  at  0  and  combin(cid:173)\ning  them  (with  the  same  parameter vector)  in  some  logical  formula  will  be  denoted \nH,{F).  We  will  denote  H1{F)  by H{F). \nIn our case  we  will  consider  a  set  of circuits  C each  with  n + l  input connections,  n \nlabelled  as  the  input  vector  and  l  identified  as  parameter input  connections.  Note \nthat if circuits  have too few  input connections,  we  can pad them with dummy ones. \nThe set g will  then  be  the set \n\nQ=Qe={gc!CEC}, \n\nwhile  Fgc  will  be denoted  by  Fe. \nWe now  quote some of the results of [7]  which  uses the techniques  of Karpinski and \nMacIntyre  [4]  to derive  sample sizes  for  classes  of smoothly parametrised functions . \n\nProposition 6 \nP : R n  x Rl -+ Rand \n\n[7}  Let  Q  be  the  set  of polynomials  P  of  degree  at  most  d  with \n\nF  = Fg  = {PwIPw{x)  = p{x, w),p E  g}. \n\nHence ,  there  are  l  adjustable  parameters  and  the  input  dimension  is  n .  Then  the \nVC-dimension  of the  class H,{Fe)  is  bounded  above  by \nlog2{2{2d)l) + 1711og2{s). \n\nCorollary  7  For  a  set  of  circuits  C,  with  n  input  connections  and  l  parameter \nconnections,  the  VC-dimension  of the  class  H,{Fe)  is  bounded  above  by \n\nProof:  By  Proposition  2  the  function  gc  computed  by  a  circuit  C  with  t  input \nconnections  has  the form \n\ngc{x) =  L  P;e(a)fc{a),  where  P;e{a)  = II xfi{l- xd 1 - cxi ). \n\nt \n\ni=l \n\nHence,  gc( x)  is  a  polynomial of degree  t.  In  the  case  considered  the  number  t  of \ninput  connections  is  n  + l.  The result follows  from  the proposition.  _ \n\n\f272 \n\nJ.  SHAWE-TAYLOR. 1.  ZHAO \n\nProposition 8 \np: 'Rn  X 'Rl  -+ 'R  and \n\n[7]  Let  9  be  the  set  of polynomials  P  of  degree  at  most  d  with \n\nF  = Fg  = {PwIPw(x)  = p(x, w),p E  g}. \n\nHence ,  there  are l  adjustable parameters and the input dimension is n.  If a  function \nh  E  H.(F)  correctly  computes  a  function  on  a  sample  of m  inputs  drawn  indepen(cid:173)\ndently  according  to  a  fixed  probability  distribution,  where \n\nm ~ \"\",(e, 0)  =  e(1 ~ y'\u20ac) \n\n[Uln ( 4e~) + In (2l/(~ - 1)) 1 \n\nthen  with probability  at  least  1 - 0  the  error rate  of h  will be  less  than  E  on  inputs \ndrawn  according  to  the  same  distribution. \n\nCorollary  9  For  a  set  of circuits  C,  with  n  input  connections  and  l  parameter \nconnections,  If a  function  h  E  H.(Fc)  correctly  computes  a  function  on  a  sample \nof m  inputs  drawn  independently according  to  a  fixed  probability distribution,  where \n\nm ~ \"\",(e, 0)  =  e(1 ~ y\u20ac) \n\n[Uln ( 4eJs~n +l)) + In Cl/(~ - 1)) 1 \n\nthen  with probability  at  least  1 - 0  the  error  rate  of h  will  be  less  than  E  on  inputs \ndrawn  according  to  the  same  distribution. \n\nProof:  As  in  the  proof of the  previous  corollary,  we  need  only  observe  that  the \nfunctions  gC  for  C  E C are polynomials  of degree  at most  n  + l. \u2022 \n\nNote  that  the  best known sample sizes  for  threshold  networks are  given in  [6]: \n\nm ~ \"\",(e, 0)  =  e(1 ~ y'\u20ac) \n\n[2Wln (6~) + In (l/(lo- 1)) 1 ' \n\nwhere  W  is  the  number  of adaptable  weights  (parameters)  and  N  is  the  number \nof computational nodes  in  the  network.  Hence,  the  bounds given  above are  almost \nidentical to those for  threshold  networks,  despite  the underlying techniques  used  to \nderive  them  being  entirely  different. \n\nOne  surprising  fact  about  the  above  results  is  that  the  VC  dimension  and sample \nsizes  are independent  of the complexity of the circuit  (except  in as much as it must \nhave  the required  number of inputs).  Hence,  additional layers of fixed  computation \ncannot increase  the sample complexity  above the  bound given). \n\n5  Simulation  Results \n\nThe  Monk's  problems  which  were  the  basis  of a  first  international  comparison  of \nlearning  algorithms,  are  derived  from  a  domain  in  which  each  training  example \nis  represented  by  six  discrete-valued  attributes.  Each  problem  involves  learning  a \nbinary  function  defined  over  this  domain,  from  a  sample  of training  examples  of \nthis function.  The  'true' concepts  underlying  each  Monk's  problem are  given  by: \nMONK-I:  (attributet  = attribute2) \n\nor  (attribute5  = 1) \n\nMONK-2:  (attributei  =  1) \n\nfor  EXACTLY TWO  i  E {I, 2,  ... , 6} \nMONK-3:  (attribute5  =  3  and  attribute4 =  1) \n\nor (attribute5  =1=  4 and attribute2  =1=  3) \n\n\fGeneralisation of a Class of Continuous Neural Networks \n\n273 \n\nThere are  124,  169  and 122 samples in the training sets  of MONK-I,  MONK-2  and \nMONK-3  respectively.  The testing set  has 432 patterns.  The network  had  17 input \nunits,  10  hidden units,  1 output unit, and was  fully  connected.  Two  networks  were \nused for each problem.  The first  was a standard multi-layer perceptron with sigmoid \nactivation  function  trained  using  the  backpropagation  algorithm  (BP  Network). \n\nThe second network had the same architecture,  but used bit stream neurons in place \nof sigmoid  ones  (BSN  Network).  The  functionality  of the  neurons  was  simulated \nusing  probability  generating functions  to  compute  the  probability  values  of the  bit \nstreams  output  at  each  neuron.  The  backpropagation  algorithm  was  adapted  to \ntrain  these  networks  by  computing  the  derivative  of the  output  probability  value \nwith  respect  to the individual inputs  to  that neuron  [8]. \n\nExperiments  were  performed  with  and  without  noise  in  the  training  examples. \nThere  is  5%  additional  noise  (misclassifications)  in  the  training  set  of MONK-3. \nThe results for  the  Monk's problems using  the moment generating function  simula(cid:173)\ntion  are  shown as  follows: \n\nMONK-l \nMONK-2 \nMONK-3 \n\ntraining \n100% \n100% \n97.1% \n\ntesting \n86.6% \n84.2% \n83.3% \n\nBP Network \n\nBSN  Network \n\ntraining \n\n100% \n100% \n98.4% \n\ntesting \n97.7% \n100% \n98.6% \n\nIt can  be  seen  that  the  generalisation  of the  BSN  network  is  much  better  than \nthat  of a  general  multilayer  backpropagation  network.  The  results  on  MONK-3 \nproblem is  extremely  good.  The results  reported  by  Hassibi  and  Stork  [2]  using  a \nsophisticated  weight  pruning  technique  are  only  93.4%  correct for  the  training  set \nand  97.2% correct  for  the  testing  set. \nReferences \n\n[1]  B.  R.  Gaines,  Stochastic  Computing Systems,  Advances  in  Information  Sys(cid:173)\n\ntems  Science  2  (1969)  pp37-172. \n\n[2]  B.  Hassibi  and D.G. Stork,  Second  order derivatives  for  network pruning:  Op(cid:173)\n\ntimal  brain  surgeon,  Advances  in  Neural  Information  Processing  System,  Vol \n5  (1993)  164-171. \n\n[3]  P.  Jeavons,  D.A.  Cohen  and  J.  Shawe-Taylor,  Generating  Binary  Sequences \nfor  Stochastic  Computing,  IEEE  Trans on  Information  Theory,  40  (3)  (1994) \n716-720. \n\n[4]  M.  Karpinski and A.  MacIntyre, Bounding VC-Dimension for Neural Networks: \nProgress  and  Prospects,  Proceedings  of  EuroCOLT'95,  1995,  pp.  337-341, \nSpringer  Lecture  Notes  in Artificial  Intelligence,  904. \n\n[5]  P.  Koiran,  A  Weak  Version  of the  Blum,  Shub  and  Smale  Model,  ESPRIT \n\nWorking  Group NeuroCOLT Technical  Report  Series,  NC-TR-94-5,  1994. \n\n[6]  J.  Shawe-Taylor, Threshold Network Learning in the  Presence  of Equivalences, \n\nProceedings  of NIPS  4,  1991,  pp.  879-886. \n\n[7]  J.  Shawe-Taylor,  Sample  Sizes  for  Sigmoidal Networks,  to  appear in  the  Pro(cid:173)\n\nceedings  of Eighth  Conference  on  Computational Learning  Theory,  COLT'95, \n1995. \n\n[8]  John  Shawe-Taylor,  Peter  Jeavons  and  Max  van  Daalen,  \"Probabilistic  Bit \n\nStream Neural  Chip:  Theory\",  Connection  Science,  Vol  3,  No  3,  1991. \n\n\f", "award": [], "sourceid": 1163, "authors": [{"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Jieyu", "family_name": "Zhao", "institution": null}]}