{"title": "Performance of Connectionist Learning Algorithms on 2-D SIMD Processor Arrays", "book": "Advances in Neural Information Processing Systems", "page_first": 810, "page_last": 817, "abstract": null, "full_text": "810 \n\nNunez and Fortes \n\nPerformance of Connectionist Learning Algorithms \n\non 2-D SIMD Processor Arrays \n\nFernando J. Nunez*  and  Jose A.B. Fortes \n\nSchool of Electrical Engineering \n\nPurdue University \n\nWest Lafayette, IN 47907 \n\nABSTRACT \n\nThe  mapping  of  the  back-propagation  and  mean  field  theory \nlearning  algorithms  onto  a  generic  2-D  SIMD  computer \nis \ndescribed.  This  architecture proves to  be very  adequate for  these \napplications  since  efficiencies  close \nto  the  optimum  can  be \nattained.  Expressions  to  find  the  learning  rates  are  given  and \nthen particularized to the DAP  array procesor. \n\n1  INTRODUCTION \nThe  digital  simulation  of  connectionist  learning  algorithms  is  flexible  and \naccurate.  However,  with  the  exception  of  very  small  networks,  conventional \ncomputer  architectures  spend  a  lot  of  time  in  the  execution  of  simulation \nsoftware.  Parallel  computers  can  be  used  to  reduce  the  execution  time.  Vector(cid:173)\npipelined,  multiprocessors,  and  array  processors  are  some  of the  most  important \nclasses  of  parallel  computers3 .  Connectionist  or  neural  net  (NN)  learning \nalgorithms have been mapped onto all of them. \nThe  focus  of  this  contribution  is  on  the  mapping  of  the  back-propagation  (BP) \nand  mean  field  theory  (MFT)  learning  algorithms  onto  the  subclass  of  SIMD \ncomputers  with  the  processors  arranged  in  a  square  two-dimensional  mesh  and \ninterconnected by  nearest-neighbor links. \nThe  material  is  organized  as  follows.  In  section  2,  the  execution  cost  of BP  and \nMFT  on  sequential  computers  is  found.  Two-dimensional  SIMD  processor  arrays \nare  described in section  3,  and  the  costs  of the  two  dominanting operations in  the \nsimulations  are  derived.  In  section  4  the  mapping  of BP  and  MFT is  I;ommented \n* Current address:  Motorola Inc., 1301  E Algonquin Rd., Schaumburg, IL 60196 \n\n\fPerformance of Connectionist Learning Algorithms \n\n811 \n\nand  expressions  for  the  learning  rates  are  obtained.  These  expressions  are \nparticularized to the DAP  computer in section  5.  Section 6 concludes this work. \n\n2  BACK-PROPAGATION AND MEAN FIELD THEORY \nIn  this  paper,  two  learning  algorithms:  Bp 7  and  MFT4;  and  3-layer  nets  are \nconsidered.  The number of neurons in  the  input, hidden,  and output layer is  I,  H, \nand  0  respectively.  BP  has been used  in many  applications.  Probably,  NETtalk8 \nis  the  best  known.  MFT  can  also  be  used  to  learn  arbitrary  mappings  between \ntwo  sets,  and  remarkably,  to  find  approximate  solutions  to  hard  optimization \nproblems much more efficiently  than a Boltzmann Machine does4,5. \n\nj'l\"j \n\ni  will  be  denoted  as  Vi \n\nand  called  value: \nThe  output  of  a  neuron \nVj  =  f ( ~ajjvj - OJ).  The  summation  represents  the  net  input  received  and  will \nbe  called  activation.  The  neuron  thresold  is  OJ.  A  sigmoid-like  function  f  is \napplied  to  find  the value.  The weight  of the link  from  neuron j  to neuron i is  ajj. \nSince  input  patterns  are  the  values  of  the  I  layer,  only  neuron  values  and \nactivations  of the Hand 0  layers  must  be  computed.  In BP, the  activation error \nand  the value  error  of the Hand 0  layers  are  calculated  and  used  to  change  the \nweights. \n\nIn  a  conventional  computer,  the  execution  time  of BP  is  approximately  the  time \nspent  in  finding  the  activations,  back-propagating  the  activation  error  of the  0 \nlayer,  and  modifying  the  I-H  and  H-O  weights.  The  result  is:  (21 + 30)Htm' \nwhere  tm  is  the  time  required  to  perform  a  multiply/accumulate operation.  Since \nthe net has (I + O)H connections, the learning rate in connections per second  is: \n\nf-\nNBP  =  (21 + 30)tm \n\n1+  0 \n\nCPS \n\nIn  the  MFT  algorithm,  only  from  the  neuron  values  in  equilibrium  at  the  end  of \nthe  clamped  and  free  annealing  phases we  can  compute the weight increments.  It \nis  assumed  that  in  both  phases  there  are  A  annealing  temperature~  ~nd that  E \niterations  are  enough  to  reach  equilibrium  at  each  temperature4,5.  With  these \nchanges,  MFT  is  now  a  deterministic  algorithm  where  the  anne f t  ling  phases  are \ncomposed  of AE sweeps.  The  MFT  execution  time  can  be  apprl\u00b7\"jmated  by  the \ntime  spent  in  computing  activations  in  the  annealing  loops.  T  J,ing  into  account \nthat  in. the  clamped  phase  only  the H  layer  is  updated,  and  tha ',  in  the free  phase \nboth,  the  Hand  0  layers  change  their  values,  the  MFT  leaning  performance  is \nfound  to be: \n\ntMFT =  tBP \nAE \n\nCPS \n\nMFT  is  AE times  more  expensive  than  BP.  However,  the  learning  qualities  of \nboth algorithms are different  and such  a  direct cOP'tJarison is  simplistic. \n\n\f812 \n\nNunez and Fortes \n\n3  2-D  SIMD PROCESSOR ARRAYS \nTwo-dimensional  single  instruction  multiple  data  stream  (2-D  SIMD)  computers \nare  very  efficient  in  the simulation  of NN  learning  algorithms.  They  can  provide \nmassive  parallelism  at  low  cost.  An  SIMD  computer  is  an  array  of  processing \nelements  (PEs)  that execute  the same instruction  in  each  cycle.  There  is  a  single \ncontrol  unit  that  broadcasts  instructions  to  all  the  PEs.  SIMD  architectures \noperate in  a  synchronous,  lock-step  fashion 3 \u2022  They are  also called  array procesors \nbecause their  raison  cfetre is to operate on vectors  and matrices. \n\nExample  SIMD  computers  are  the  Illiac-IV,  the  Massively  Parallel  Processor \n(MPP),  the  Connection  Machine  (CM),  and  the  Distributed  Array  Processor \n(DAP).  With  the  exception  of the  CM,  whose  PE  interconnection  topology  is  a \nhypercube,  the  other  three  machines  are  2-D  SThAD  arrays  because  their  PEs  are \ninterconnected  by  a 2-D  mesh with wrap-around links (figure  1). \n\nCONTROL \n\nUNIT \n\n1----4  pp \n\nFigure 1:  A 2-D SIMD Processor Array \n\nEach PE has its own local memory.  The instruction has  an  address  field  to access \nit.  The  array  memory  space  can  be  seen  as  a  3-D  volume.  This  volume  is \ngenerated  by  the  PE  plane,  and  the  depth  is  the  number  of  memory  words  that \neach  PE  can  address.  When  the  control  unit  issues  an  address,  a  plane  of  the \nmemory volume is  being referenced.  Then, square blocks of PxP elements  are the \nnatural  addressing  unit  of  2-D  SThAD  processor  arrays.  There  is  an  activity  bit \nregister  in  each  PE  to  disable  the  execution  of  instructions.  This  is  useful  to \nperform  operations  with  a  subset  of  the  PEs.  It  is  assumed  that  there  is  no \n\n\fPerformance of Connectionist Learning Algorithms \n\n813 \n\noverlapping between  data processing  an  data  moving  operations.  In  other  words, \nPEs can  be  either  performing some  operation  on  data (this  includes  accessing  the \nlocal memory) or exchanging data with other processors. \n\n3.1  MAPPING THE TWO BASIC OPERATIONS \nIt is  characteristic of array processors  that the  way  data is  allocated  into the PEs \nmemories  has  a  very  important  effect  on  performance.  For  our  purposes,  two \ndata structures  must be considered:  vectors  and  matrices.  The  storage  of vectors \nis  illustrated  in  figure  2-a.  There  are  two  modes:  row  and  column.  A  vector  is \nsplit  into  P-element  subvectors  stored  in  the  same  memory  plane.  Very  large \nvectors  will  require  two  or  more  planes.  The  storage  of  matrices  is  also  very \nsimple.  They  must be  divided  into  square  PXP  blocks  (figure  2-b).  The shading \nin  figure  2  indicates  that,  in  general,  the  sizes  of vectors  and  matrices  do  not  fit \nthe array dimensions perfectly. \n\np \n\n(b) \n\n(a) \n\n~P  \u00a7 \n[IIJ \n\nrow \n\ncolumn \n\nFigure 2:  (a) Vector and (b) Matrix Storage \n\nand \n\nin \n\n(MVM) \n\nin  matrix-vector  multiply \n\nThe  execution  time  of BP  and  MFT  in  a  2-D  SIMD  computer  is  spent,  almost \ncompletely, \nvector \nouter \nmultiply/accumulate  (VOM)  operations.  They  can  be  decomposed \nthe \nfollowing simpler operations involving PxP blocks. \na) Addition (+):  C = A  +  B such  that  eij  = aij  +  bij. \nb) Point multiply/accumulate (-):  a =  C + A-B such  that  e'ij  =  eij + aijbij\u2022 \nc)  Unit  rotation:  The  result  block  has  the  same  elements  than  the  original,  but \nrotated one place in  one of the four  possible directions  (N, E, W, and S). \nd)  Row  (column)  broadcast:  The result of the row (column)  broadcast of a  vector \nx stored in  row (column) mode is  a block X such that  xii  =  Xj  (  =  Xi). \nThe  time  required  to  execute  a,  b,  c,  and  d  will  be  denoted  as  tll'  tm ,  t,.,  and  t6 \nrespectively.  Next,  let  us  see  how  the  operation  y  =  Ax (MVM)  is  decomposed  in \nsimpler  steps  using  the  operations  above.  Assume  that  x  and  yare  P-element \nvectors,  and A is  a PXP block. \n\n\f814 \n\nNunez and Fortes \n\n1) Row-broadcast vector  x. \n2) Point multiply  Y = A\u00b7X. \n3) Row addition of block Y,  Yi  =  f'llij =  t aijxj'  This requires  flOg2pl  steps.  In \n\nthat \n\neach  step  multiple rotations and one  addition are performed.  Figure 3  shows how \neight  values  in  the  same  row  are  added  using  the  recursive  doubling  technique. \nNote \nis: \nPtr  + log2Pto'  Row  addition  is  an  inefficient  operation  because  of the  large  cost \ndue  to  communication.  Fortunately,  for  larger  data  its  importance  can  be \ndiminished by  using  the scheduling described  nextly. \n\nthe  number  of  rotations  doubles \n\nin  each  step.  The  cost \n\nj=1 \n\nj-l \n\n....-\n+ \n\n00000000 \n....-\n+ \n\u2022 \n.. \n+ \n\n....-\n.. \n+ \n\n....-\n+ \n\n+ \n\n+ \n\nFigure 3: Recursive Doubling \n\nSuppose  that  x,  y,  and  A  have  dimensions  m  =  MP,  n  =  NP,  and  nxm \nrespectively.  Then,  y =  Ax  must  be  partitioned  into  a  sequence  of  non(cid:173)\npartitioned block operations as  the one explained  above.  We can write: \n\nM \n\nM \n\nyi =  ~Aijxj =  ~(Aij\u00b7Xj)u = (~Aij.Xj)u \n\nM \n\nj=1 \n\nj=1 \n\nj=1 \n\nIn  this  expression,  yi  and  x j  represent  the  i-th  and  i-th  P-element subvector  of  y \nand  x respectively,  and  A ij  is  the  PxP block  of A  with indices  i and i.  Block  Xi \nis  the  result  of  row-broadcasting  xj  (x  is  stored  in  row  mode.)  Finally,  u  is  a \nvector with  all  its  P-elements equal to 1.  Note that in  the second  term  M column \nadditions  are  implicit,  while  only one  is  required  in  the  third term because  blocks \n'II  has  N  subvectors,  and  the  M \ninstead  of  vectors  are  accumulated.  Since \nsubvectors of x are broadcast only once,  the total cost of the MVM operation is: \n\nMter a similar development,  the cost of the YOM ( At =  A  + yx T  )  operation is: \n\n\fPerformance of Connectionist Learning Algorithms \n\n815 \n\nIf the number of neurons in each layer  is  not  an  integer  multiple of P,  the storage \nand execution efficiencies  decrease.  This effect is  less  important in large networks. \n\n4  LEARNING RATES ON 2-D  SIMD  COMPUTERS \n\n4.1  BACK-PROPAGATION \nThe  neuron  val~es,  activations,  value  errors,  activation  errors,  and  thresolds  of \nthe  Hand  0  layers  are  organized  as  vectors.  The  weights  are  grouped  into  two \nmatrices:  I-H  and H-O.  Then,  the  scalar  operations  of the  original  algorithm  are \ntransformed into matrix-vector operations. \nFrom now on, the size  of the input,  hidden,  and output layers  will  be  IP,  HP,  and \nOP. \n.A13  commented  before,  the  execution  time  is  mostly  spent  in  computing \nactivations,  values,  their  errors,  and  in  changing  the  weights.  To  compute \nactivations,  and  to  back-propagate  the  activation  error  of  the  0 \nlayer  MVM \noperations are performed.  The change of weights requires YOM operations.  Alter \nsubstituting  the  expressions  of  the  previous  section,  the  time  required  to  learn  a \npattern simulating BP on a  2-D  SIMD  computer is: \n\nThe  time  spent  in  data  communication  is  given  by  the  factors  in  tr  and  t,.  The \nlarger  they  are,  the  smaller  is  the  efficiency.  For  array  processors  with  fast \nbroadcast  facilities,  and  for  nets  large  enough  in  terms  of  the  array  dimensions, \nthe  efficiency  grows  since  a  smaller  fraction  of  the  total  execution  time  is \ndedicated  to  moving  data.  Since  the  net  has  (I +  O)HP2  connections,  the \nlearning rate is  p2  times greater than using a  single PE: \n\n(I + O)p2 \nf.. \nNSIMD-BP =  (21 + 30)tm \n\nCPS \n\n4.2  MEAN FIELD THEORY \nThe operations  outside  the  annealing  loops  can  be  neglected  with  small error.  In \nconsequence,  only  the  computation  of  activations  in \nthe  clamped  and  free \nannealing phases is  accounted for: \n\nAE((21 + 30)Htm  +  {21 + H + 20)t, +  (2H +  O)(Ptr + log2Pta)) \n\nUnder the same favorable conditions above mentioned, the learning rate is: \n\n_ \n\n(I + O)P2 \n\n!:SIMD-MFT  - AE(21 + 30)tm  CPS \n\n\f816 \n\nNunez and Fortes \n\n()  LEARNING PERFORMANCE ON THE DAP \nThe  DAP  is  a  commercial  2-D  SIMD  processor  array  developed  by  lCL.  It is  a \nmassively  parallel  computer  with  bit-level  PEs  built  around  a  single-bit  full \nadder.  In addition to the 2-D  PE interconnection mesh,  there are row and column \nbroadcast  buses  that  allow  the  direct  transfer  of data from  any  processor  row  or \ncolumn to an edge  register.  Many instructions require  a  single clock  cycle  leading \nto  very  efficient  codings  of loop  bodies.  The  DAP-510  computer  features  25x25 \nPEs  with  a  maximum  local  memory  of  1Mbit  per  PE.  The  DAP-610  has  26x2 6 \nPEs,  and  the  maximum  local  memory  IS  64Kbit.  The  clock  cycle  in  both \nmachines is  100 nsl. \n\nWith  bit-level  processors  it  is  possible  to  tailor  the  preCISIon  of  fixed-point \ncomputations  to  the  minimum  required  by  the  application.  The  costs  in  cycles \nrequired  by  several  basic  operations  are  given  below.  These  expressions  are \nfunction  of the  number  of bits  of the  operands,  that  has  been  assumed  to  be  the \nsame for  all of them:  b bits. \n\ntime \n\nthe  DAP \n\nrequired  by \n\nto  perform  a  block  addition,  point \nThe \ntm  =  2b 2 ,  and  t6  =  8b \nmultiplication/accumulation,  and  broadcast  is \nclock  cycles  respectively.  On  the other  hand,  P  + 2b  log2P cycles  is  the  duration \nof  a  row  addition.  Let  us  take  b = 8  bits,  and  AE = 24.  This  values  have  been \nfound  adequate  in  many  applications.  Then,  the  maximum  learning  rates  of the \nDAP-610 (P = 64)  are: \n\nto  =  2b, \n\nBP:  100-160  MCPS \n\nMFT:  4.5-6.6  MCPS \n\nwhere MCPS = 106 CPS.  These  figures  are 4  times  smaller  for  the DAP-510.  It is \nworth  to  mention  that  the  performance  decreases  quadratically  with  b.  The  two \nlearning rates of each  algorithm correspond to the worst and best case  topology. \n\n6.1  EXAMPLES \nLet  us  consider  a  one-thousand neuron  net with  640,  128,  and  256  neurons  in  the \ninput,  hidden,  and  output  layer.  For  the  DAP-610  we  have  1= 10,  H  =  2,  and \no =  4.  The other  parameters  are  the  same  than  used  above.  After  substituting, \nwe  see \ntotal, \ndemonstrating the efficiency of the DAP  in  this  type of applications.  The learning \nrates  are: \n\nthe  communication  costs  are \n\nthan  10%  of  the \n\nthat \n\nless \n\nBP:  140  MCPS \n\nMFT:  5.8  MCPS \n\nis \n\nin  order \n\nto  compare \n\nfrequently  used  as  a  benchmark \n\nthe \nNETtalk 10 \nperformance  achieved  on  different  computers.  Here,  a  network  with  similar \ndimensions  is  considered:  224  input,  64  hidden,  and  32  output  neurons.  These \ndimensions  fit  perfectly  into  the  DAP-510  since  P  =  32.  ~ before,  a  data \nprecision  of 8  bits  has  been  taken.  However,  the fact  than  the  input  patterns  are \nbinary has been exploited to obtain some savings. \nThe  performance  reached  in  this  case  is  50  MCPS.  Even  though  NETtalk  is  a \nrelatively  small  network,  only  30%  of  the  total  execution  time  is  spent  in  data \ncommunication.  If the DAP-610 were  used, somewhat less  than 200  MCPS would \nbe learnt since the output layer is  smaller than P what causes some inefficiency. \n\n\fPerformance of Connectionist Learning Algorithms \n\n817 \n\nFinally,  BP  learning  rates  of  the  DAP-610  with  8- and  16-bit  operands  are \ncompared to those obtained by other machines below2,6: \n\nCOMPUTER \n\nMCPS \n\nVAX 780 \nCRAY-2 \nCM (65K PEs) \nDAP-610 (8  bits) \nDAP-610 (16  bits) \n\n0.027 \n7 \n13 \n100-160 \n25-40 \n\n6  CONCLUSIONS \nTwo-dimensional  SThfl)  array  processors  are  very  adequate  for  the  simulation  of \nconnectionist  learning  algorithms  like  BP  and  :MFT.  These  architectures  can \nexecute them at nearly optimum speed if the network is  large enough,  and there is \nfull  connectivity  between  layers.  Other  much  more  costly  parallel  architectures \nare outperformed. \nThe  mapping  approach  described  in  this  paper  can  be  easily  extended  to  any \nnetwork \nits  global  interconnection  matrix. \nHowever,  it is  obvious  that  2-D  SIMD  arrays  are  not  a  good  option  to  simulate \nnetworks with random sparse connectivity. \n\ntopology  with  dense  blocks \n\nin \n\nAcknow ledgements \nThis work has been supported by the Ministry of Education and Science of Spain. \n\nReferences \n[1]  (1988) AMT DAP Series,  Technical Overview.  Active Memory  Technology. \n[2]  G.  Blelloch  &  C.  Rosenberg.  (1987)  Network  Learning  on  the  Connection \nMachine.  Proc.  10th Joint  Coni.  on Artificial Intelligence, IJCA Inc. \n[3]  K.  Hwang  &  F. Briggs.  (1984)  Computer Architecture  and Parallel Processing, \nMcGraw-Hill. \n[4]  C. Peterson  &  J.  Anderson.  (1987)  A  Mean  Field  Theory  Learning  Algorithm \nfor  Neural Networks.  Complex Systems,  1:995-1019. \n[5]  C. Peterson &  B. Soderberg.  (1989)  A  New Method For Mapping  Optimization \nProblems onto Neural Networks.  Int'/ J.  01 Neural Systems,  1(1):3-22. \n[6]  D.  Pomerleau,  G.  Gusciora,  D.  Touretzky  &  H.T.  Kung.  (1988)  Neural \nNetwork  Simulation  at  Warp  Speed:  How  We  Got  17  Million  Connections  per \nSecond.  Proc.  IEEE Int'l  Coni.  on Neural Networks,  11:143-150. \n[7]  D. Rumelhart,  G.  Hinton  &  R.  Williams.  (1986)  Learning  Representations  by \nBack-Propagating Errors.  Nature,  (323):533-536. \n[8]  T.  Sejnowski  &  C.  Rosenberg.  (1987)  Parallel  Networks  that  Learn  to \nPronounce English Text.  Complex Systems,  1:145-168. \n\n\f", "award": [], "sourceid": 256, "authors": [{"given_name": "Fernando", "family_name": "Nu\u00f1ez", "institution": null}, {"given_name": "Jos\u00e9", "family_name": "Fortes", "institution": null}]}