{"title": "An Efficient Implementation of the Back-propagation Algorithm on the Connection Machine CM-2", "book": "Advances in Neural Information Processing Systems", "page_first": 801, "page_last": 809, "abstract": null, "full_text": "An Efficient Implementation of the Back-propagation Algorithm \n\n801 \n\nA n  Efficient Implementation  of \n\nthe  Back-propagation  Algorithm on \n\nthe  Connection  Machine  CM-2 \n\nXiru  Zhang!  Michael Mckenna  Jill P.  Mesirov  David L.  Waltz \n\nThinking Machines  Corporation \n\n245 First Street,  Cambridge, MA  02142-1214 \n\nABSTRACT \n\nIn this paper, we  present a novel implementation of the widely used \nBack-propagation neural net learning algorithm on the Connection \nMachine  CM-2  - a  general  purpose,  massively  parallel  computer \nwith a hypercube topology.  This implementation runs at about 180 \nmillion  interconnections per second  (IPS)  on a  64K  processor  CM-\n2.  The  main  interprocessor  communication operation  used  is  2D \nnearest  neighbor  communication.  The  techniques  developed  here \ncan  be  easily  extended  to implement  other algorithms for  layered \nneural nets on  the CM-2,  or on other massively parallel computers \nwhich have 2D or higher degree connections among their processors. \n\n1 \n\nIntroduction \n\nHigh-speed simulation of large artificial  neural  nets  has  become  an important tool \nfor  solving  real  world  problems  and  for  studying  the  dynamic  behavior  of large \npopulations of interconnected processing  elements  [3,  2].  This  work is  intended  to \nprovide such a  simulation tool for  a  widely  used  neural net learning algorithm - the \nBack-propagation (BP)  algorithm.[7] \n\nThe  hardware we  have  used  is  the  Connection  Machine\u00ae  CM-2.2  On  a  64K  pro(cid:173)\ncessor  CM-2  our  implementation  runs  at  40  million  Weight  Update  Per  Second \n\n1 This author is also a  graduate student at  Computer Science Department, Brandeis University, \n\nWaltham,  MA  02254-9110. \n\n2  Connection Machine is a  registered trademark of Thinking  Machines  Corporation. \n\n\f802 \n\nZhang, Mckenna, Mesirov and Waltz \n\n(WUPS)3 for  training, or 180 million  Interconnection Per Second (IPS) for forward(cid:173)\npass,  where IPS is  defined in  the  DARPA  NEURAL  NETWORK  STUDY  [2]  as  \"the \nnumber  of multiply-and-add  operations  that  can  be  performed  in  a  second\"  [on  a \nBack-propagation  network).  We believe  that the techniques developed here  can  be \neasily extended to implement other algorithms for  layered neural nets on the CM-2, \nor  other  massively  parallel  machines which  have  2D  or  higher  degree  connections \namong  their  processors. \n\n2  The  Connection  Machine \n\nThe Connection Machine CM-2 is  a  massively parallel computer with up to 65,536 \nprocessors.  Each  processor  has  a  single-bit  processing  unit  and  64K  or  256K  bits \nof local  RAM.  The  processors  run  in  SIMD  mode.  They  are  connected  in  an  n(cid:173)\ncube  topology,  which  permits  highly  efficient  n  dimensional  grid  communications. \nThe  system  software  also  provides  scan  and  spread  operations  - e.g.,  when  n\u00b7m \nprocessors  are  connected  as  an  n  x  m  2D  grid,  the  summation  (product,  max, \netc.)  of a  \"parallel variable\"  value in all  the processors  on  a  row  of the grid4  takes \nonly  O(logm)  time.  It is  possible  to turn  off any subset  of the  processors so  that \ninstructions  will  only  be  performed  by  those  processors  that  are  currently  active. \nOn  the CM-2,  every  32  processors share  a  floating point  processing unit;  and a  32 \nbit  number can  be stored across  32  processors  (Le.,  one  bit  per  processor).  These \n32  processors  can  each  access  this  32-bit  number  as  if it  were  stored  in  its  own \nmemory.  This is  a  way  of sharing  data among  processors  locally.  The  CM-2  uses \na  conventional  computer  such  as  a  SUN-4,  VAX  or  Symbolics  Lisp  Machine  as  a \nfront-end machine.  Parallel extensions to the familiar programming languages LISP, \nC,  and  FORTRAN,  via  the  front-end,  allow  the  user  to  program  the  Connection \nMachine and the front-end system. \n\n3  The  Back-propagation Algorithm \nThe  Back-propagation  [7]  algorithm  works on  layered,  feed-forward  networks  (BP \nnet for  short in the following discussion),  where the processing units are arranged in \nlayers - there are an input layer,  an output layer, and one or  more  \"hidden layers\" \n(layers  between  the  input  and  output  layers).  A  BP  net  computes  its  output  in \nthe following fashion:  first  an input pattern is  set as  the output of the units at the \ninput  layer;  then  one  layer  at  a  time,  from  the  input  to  hidden  to  output  layer, \nthe  units compute their outputs by applying an activation function  to the weighted \nsum of their inputs (which are  the outputs of the unit at the lower layer(s)  that are \nconnected to  them}.  The weights come from  the links  between the units. \n\nThe  Back-propagation  algorithm  \"trains\"  a  BP  net  by  adjusting  the  link  weights \nof the  net  using  a  set  of  \"training examples.\"  Each  training  example  consists  of \n\n3  This  includes  the  time required  to  read in the  input  pattern,  propagate  activation  forward \nthrough the network, read in the ideal output pattern, propagate the error signal backward through \nthe  network, compute  the  weight  changes,  and change  the  weights. \n\nt  That is,  to  add together  one  value  from  each processor on a  row  of the  grid  and distribute \n\nthe  sum into all  the processors on the  same row . \n\n\fAn Efficient Implementation or the Back-propagation Algorithm \n\n803 \n\nOutput Layer \n\nHidden Layer \n\nInput Layer \n\no \n\n\u2022  \u2022  \u2022 \n\n\u2022  \u2022  \u2022 \n\nm-1 \n\nJ \n\nFigure 1:  A 3-layer, fully-connected  Back-propagation network that has  the same num(cid:173)\nber (m)  of nodes at each layer. \n\nan  input  pattern  and  an ideal  output  pattern  that  the  user  wants  the  network  to \nproduce for  that input.  The  weights are  adjusted  based  on  the difference  between \nthe  ideal  output and  the  actual output  of the  net.  This  can  be  seen  as  a  gradient \ndescen t  process  in  the weight  space. \n\nAfter  the  training is  done,  the  BP net can  be  applied  to inputs that  are not in the \nset of training examples.  For a  new input pattern IP,  the  network tends  to produce \nan output similar  to the training example whose input is  similar to IP.  This can  be \nused  for  interpolation,  approximation,  or generalization  from  examples  depending \non  the goal of the user  [4]. \n\n4  The Implementation \n\nIn  this  section,  we  explain  our  implementation  by  presenting  a  simple  example  -\na  three-layer  fully-connected  BP  network  that  has  the  same  number  of nodes  at \neach layer.  It is  straightforward to extend it  to general  cases.  For a  more  detailed \ndiscussion,  see reference  [8]. \n\n4.1  A  Simple  Case \n\nFigure  1  shows a  fully-connected  3-layer BP  network with m  nodes  on each layer. \nIn  the following  discussion,  we  will  use  N i ,;  to denote  the  jth node  (from  the left) \nI};  ~,{ is  the  weight  of the  link  from \non  layer  i,  i  E  {O, 1, 2}, j  E  {O, 1, ... , m  -\nnode  Nk,h  to node  Ni,j,  and bi ,;  is  the error  at  node  N i ,;. \n\nFirst,  assume  we  have exactly m  processors.  We  store  a  \"column\"  of the  network \nin  each processor.  That  is,  processor  j  contains nodes  No,j,  N1,j  and  N 2,j.  It also \ncontains the weights of the  links  going into Nl,j  and N2,;  (i.e.,  W~\"t and  W{,t  for \n\n\f804 \n\nZhang, Mckenna, Mesirov and Waltz \n\n......... \n\n# \n\n...... \n\n,,~-\n\n...... \n\n.... \nOutput Nodes \n\n- JIII..._ \n\n......  098  \u00a9 \n, , -\n: \n\u2022 \n--\n\n~~ \u00ae~ \n{ 1:1 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n-\n-\n\u2022 \n\u2022 \n@~ ~ -\n-\n... \n=- -\n{ -\u2022 \n...... - ...... \n\u2022 \n\u2022 \nHidden  Nodes \n\u2022 \n-\n\u2022 \n\u2022 \n--\nInput Nodes \n\u00ae~ \u00aeA \n.G><E) \n\n/ \n\n...... \n\n~-\n\nf-\nI-\n\nLink \nWeigh ts \nW 2\u2022k \n'.1 \n\n'5 \n\nLink \nWeigh \nW ,\u00b7k \n\n0.1 \n\nts \n'5 \n\n'lr.t~#m{ ~ \n\nMultiply-accum ulate-rotate \n\nFigure 2:  The layout of the example network. \n\nk  E  {o, 1, ... , m  -\nI}).  See  Figure  2.  The  Back-propagation  algorithm  consists \nof  three  steps:  (1)  forward  pass  to  compute  the  network  output;  (2)  backward \npropagation  to compute  the  errors  at each  node;  and  (3)  weight  update to  adjust \nthe  weights based on  the errors.  These steps  are  implemented as follows: \n\n4.1.1  Forward Pass:  Output(Ni \u2022j )  =  F(2:;;';ol  Wii~l.k \u00b7Output(Ni _1 \u2022k)) \nWe implement forward pass as follows: \n\n1.  Set the input  node values;  there is  one  input node  per processor. \n\n2.  In each processor, multiply the input node value by the link weight between the \ninput node and the hidden node that is in the same processor;  then accumulate \nthe product in the  hidden  node. \n\n3.  Rotate  the input  node  values - each  processor  sends  its  input  node  value  to \nits  nearest  left  neighbor  processor,  the  leftmost  processor  sends  its  value  to \nthe rightmost processor;  i.e.,  do a  left-circular-shift. \n\n4.  Repeat the multiply-accumulate-rotate cycles in  the above two steps  (2-3)  m \ntimes;  every  hidden  node  N 1.j  will  then  contain 2:;;;01 W~!k \u00b7Output(NO.k)' \nNow  apply the activation function  F  to that sum.  (See  Figure  2.) \n\n5.  Repeat steps 2-4 for  the output layer,  using  the hidden  layer as  the input. \n\n\fAn Efficient Implementation of the Back-propagation Algorithm \n\n80S \n\n4.1.2  Backward Propagation \n\nFor the output layer,  62,k,  the error at each node  N2,k, is  computed  by \n62,k  = Output(N2,k) . (1  - Output(N2,k)) . (Target(N2,k) - Output(N2,k)), \nwhere Target(N2,k) is  the ideal output for  node  N 2,k.  This error can be computed \nin place, i.e.,  no inter-processor communication is  needed.  For  the hidden layer, \n\n61,; = Output(N1,;) \u2022 (1 - Output (N1,; )) \u2022 E:=-ol w;,t . 62,k \nTo  compute  E:;OI w;,t . 62 ,k \n\nfor  the  hidden  nodes,  we  perform  a  multiply(cid:173)\n\naccumulate-rotate  operation  similar  to  the  forward  pass,  but  from  the  top  down. \nNotice that the weights  between a  hidden  node and the output nodes  are in  differ(cid:173)\nent processors.  So, instead of rotating 62,k 's at the output layer, we rotate the partial \nsum of products for  the hidden nodes:  at the beginning every  hidden node  N 1 ,j  has \nan accumulator  A;  with initial value = 0  in  processor j.  We do a  left-circular-shift \non  the  Aj's.  When  Aj  moves  to  processor  k,  we  set  Aj  ~ Aj + W12,jk  \u2022 62,k.  After \nm rotations,  Aj will  return to processor j  and its  value will  be E:=-OI  W1\n2,jk  \u2022 62 ,k. \n\n4.1.3  Weight  Update:  ~W~:{ =  T}.  6i ,j .Output(Nk,h) \n\n~ W~:{ is the weight increment for  W~:{, T}  is the \"learning rate\" and 6i,i  is the error \nfor  node Ni,;, which is  computed in  the  backward propagation step and is  stored in \nprocessor  j.  The  weight  update step is  done  as  follows: \n\n1.  In each processor j, for  the weights between the input layer and hidden layer, \ncompute weight  update  ~Wo,'~ = T}. 61 ,j  .  Output(No,k),S  and  add  ~Wo,'~ to \nw.1,j  .6 \n\n1  . \n\n1  . \n\nO,k \n\n, \n\n2.  Rotate the input  node  values as in step 3 of the forward  pass. \n\n3.  Repeat the above two steps  m  times,  until all  the weights  between  the input \n\nlayer and the hidden layer are updated. \n\n4.  Do  the above for  weights between the hidden layer and the output layer also. \n\nWe  can  see  that  the  basic  operation  is  the  same  for  all  three  steps  of the  Back(cid:173)\npropagation  algorithm,  i.e.,  multiply-accumulate-rotate.  On  the  CM-2,  multiply, \nadd  (for  accumulate)  and  circular-shift  (for  rotate)  take  roughly  the same  amount \nof time,  independent of the  size  of the  machine.  So  the  CM-2  spends  only  about \n1/3 of its  total time  doing communication in  our implementation. \n\n6 \n\ngeneral. \n\nInitially k  = j, but the  input  node  values will be rotated around in later steps,  so  k  '#  j  in \n6  W;\"t  is in the sa.m.e processor as ~ W~\"t - all the weights going into node N1 ,]  are in processor \nj.  Also  we  can  accumulate  ~ W~:t for  several  training patterns  instead of updating W::t  every \n\ntime.  We  can  also  keep  the  previous  weight  change  and  add  a  \"momentum\"  term  here.  (Our \nimplementation actually  does all  these.  They  are  omitted here  to simplify  the  explanation of the \nbasic ideas.) \n\n\f806 \n\nZhang, Mckenna, Mesirov and Waltz \n\n4.2  Replication of Networks \n\nUsually,  there  are  more  processors  on  the  CM-2  than  the  width  of a  BP  network. \nSuppose the network width is m  and there are n\u00b7m processors; then we make n copies \nof the  network  on  the  CM-2,  and  do  the  forwa.rd  pass and  backward propagation \nfor  different training  patterns on  each copy of the network.  For the  weight  update \nstep,  we  can  sum  up  the weight  changes  from  different  copies  of the  network  (i.e. \nfrom  different  training  patterns),  then update  the  weights in  all  the copies  by  this \nsum.  This is equivalent to updating the weights after n  training patterns on a single \ncopy of the BP network. \n\nOn  the  CM-2,  every  32  processors  can  share  the  same  set  of data  (see  section  2). \nWe  make  use  of this  feature  and  store  the  BP  network  weights  across  sets  of 32 \nprocessors.  Thus  each  processor  only  needs  to  allocate  one  bit  for  each  weight. \nAlso,  since  the  weight  changes from  different  training  patterns  are  additive,  there \nis no need  to add  them  up in advance - each copy of the network can  update  (add \nto)  the  weights separately, as long as  no  two or  more copies of the network  update \nthe  same  weight  at  the same  time.  (Our implementation  guarantees  that  no  such \nweight  update conflict can occur.)  See  Figure  3. \n\nWe  call  the  32  copies  of the  network  that  share  the  same  set  of weights  a  block. \nWhen  the  number  of copies  n  >  32,  say  n  =  32  . q,  then  there  will  be  q  blocks \non  the  CM-2.  We  need  to sum  up the  weight changes from  different  blocks  before \nupdating the  weights in each block.  This summation takes a  very small portion  of \nthe  total  running  time  (much  less  than  1%).  So  the  time  increase can  usually  be \nignored  when  there  is  more  than  one  block. 7  Thus,  the  implementation speeds  up \nessentially linearly  as  the  number of processors increases. \n\n5  An  Example:  Character Image Recovery \nIn  this  example,  a  character,  such  as  A,  is  encoded  as  a  16  x  16  pixel  array.  A \n3-layer fully-connected  network  with  256  input  nodes,  128  hidden  nodes  and  256 \noutput  nodes is  trained  with  64  character pixel  arrays,  each of which is  used  both \nas  the  input  pattern  and  the  ideal  output  pattern.  After  the  training  is  done \n(maximum_error < 0.15),8  some  noisy  character images  are fed  into the  network. \nThe network is  then used to remove  the  noise  (to recover  the images).  We can also \nuse  the  network recursively - to feed  the network  output  back as  the input. \n\nFigure  4a shows  the  ideal  outputs  (odd  columns)  and  the  actual  outputs  (even \ncolumns)  of the  network  after  the  training.  Figure  4b  shows  corrupted  character \nimage  inputs  (odd  columns)  and  the  recovered  images  (even  columns).  The  cor(cid:173)\nrupted  inputs  have  30%  noise,  i.e.,  30%  of the  pixels  take  random  values  in  each \nimage.  We  can see  that  most of the characters are recovered. \n\n7The  summation  is  done  using  the  scan  and  spread  operations  (see  section  2),  so  its  time \nincreases only logarithmically in proportion to the number of blocks.  Usually there  are only a  few \nblocks,  thus we  could use  the  nearest  neighbor  communication here  instead  without much loss  of \nperformance. \n\n8  This training took about  400 cycles. \n\n\fAn Efficient Implementation of the Back-propagation Algorithm \n\n807 \n\nParallel weight-update \n{\\ \n\n}  Shared \nweights \n\n}  Shared \nwe ights \n\n, \nNetwork N \n\nI:'I'J \n\nloS \n\n,. \n\u00b7 \u2022 \n\u00b7 \n-\n\u00b7 \u00b7 , \n\n,...t \n\nY.:II \n\n8 \n~-\n\n0 \n\n0 \n\n~ -Output Nodes \n\n(;!IiI \n\n~  -\n-\n\n(. \n\nlUI \n\n'-\n\n0 \n\n0 \n\n0 \n\n~ -\n--:-Input Nodes \n\n-\n\n'\\. \n\nv \nm \n\n/ \n\n\\ \n\n\\ \n\n\\ \n\n, \n, \nNetwork 2 \n\nNetwork 1 \n\nFigure 3:  Replication  of a  BP  network  and  parallel update  of network  weights.  In the \nweigbt  update step,  the  nodes  in  each copy  of the  BP  network loop  through  the  weights \ngoing  into  them  in  the  following  fashion:  in  the  first  loop,  Network  1  updates  the  first \nweight,  Network  2 updates  the  second  weight  ...  Network  N  updates  the  Nth  weight;  in \ngeneral,  in  the  Jth loop,  Network  I  updates  [M od(I + J, N)]th  weight.  In  this  way,  it  is \nguaranteed  that  no  two  networks  update  the  same  weight  at  the  same  time.  When  the \ntotal number of weights going  into each node is greater than N, we repeat the above loop. \n\nAAaaBBbbTTttUUuu \nCGcoDDddVVvvXXXX \nEEeeFFffYYyyZZzz \nGG9gHHhh00112233 \nI  I  i  l'  KKkk44556677 \nLLII  NNnh8899\u00ab\u00bb \nOOOOPRPP??$$AA&& \nRRrrSSss**++==-\"':' \n\n(a) \n\n(b) \n\nFigure 4:  (a)  Ideal outputs  (in  odd  columns)  and the  actual after-training  outputs  (in \neven columns) of a  network with 256 input  nodes,  128  hidden nodes and 256  output nodes \ntrained  with character images.  (b)  Noisy  inputs  (in  odd columns)  and  the  corresponding \noutputs (\"cleaned-up\"  images) produced by the network. \n\n\f808 \n\nZhang, Mckenna, Mesirov and Waltz \n\nComputer \n\nBP performance (IPS) \n\nCM-2 \nCray X-MP \nWARP  (10) \nANZA plus \nTRW MK  V  (16) \nButterfly (64) \nSAle SIGMA-l \nTIOdyessy \nConvex C-1 \nVAX 8600 \nSUN  3 \nSymbolics 3600 \n\n180  M \n50  M \n17  M  (WUPS) \n10  M \n10  M \n8M \n5-8  M \n5M \n3.6  M \n2M \n250  K \n35  K \n\nTable 1:  Comparison of BP implementations on different computers. \n\nIn this  example,  we  used  a  4K  processor  CM-2.  The BP  network  had  256  x  128 + \n128x 256 = 65,536 weights.  We made 64 copies of the network on the CM-2, so there \nwere 2  blocks.  One weight  update cycle9  took  1.66 seconds.  Thus the  performance \nis:  (65,536  x  64) -;- 1.66  ::::.::  2,526,689  weight  update  per  second  (WUPS).  Within \nthe  1.66  seconds,  the communication between the two  blocks took  0.0023  seconds. \nIf we  run  a  network  of the  same  size  on  a  64K  processor  CM_2,10  there  will  be \n32  blocks,  and  the  inter-block  communication  will  take  0.0023  x  I~ogg 322  =  0.0115 \nsecond. 11  And  the  overall performance  will  be: \n(16  x  65,536  x  64) -;- (1.66 + 0.0115) = 40,148,888 WUPS \nForward-pass took  22%  of the  total  time.  Thus if we  ran  the  forward  pass  alone, \nthe speed would be 40,148,888 -;- 0.22::::.::  182,494,940 IPS. \n\n6  Comparison With Other Implementations \n\nThis  implementation  of the  Back-propagation  algorithm  on  the  CM-2  runs  much \nmore efficiently than previous CM implementations  (e.g.,  see  [1],  [6]).  Table 1 lists \nthe  speeds  of Back-propagation on  different  machines  (obtained from  reference  [2] \nand  [5]). \n\n9  See  footnote  3 for  definition. \n10  Assume  we have enough training patterns to fill up the CM-2. \n11  We use  scan and  spread operations here, so  the  time used increases  logrithmatically. \n\n\fAn Efficient Implementation of the Back-propagation Algorithm \n\n809 \n\n7  Summary \n\nIn this  paper, we  have  shown an example  of efficient  implementation of neural  net \nalgorithms on the Connection Machine CM-2.  We used Back-propagation because it \nis the most widely implemented, and many researchers have used it as a benchmark. \nThe techniques developed here can be easily adapted to implement other algorithms \non layered neural nets. \n\nThe main communication operation used in this work is the 2D grid nearest neighbor \ncommunication.  The facility for  a  group of processors on the CM-2 to share data is \nimportant in  reducing  the amount  of space required  to  store  network  weights  and \nthe communication between different copies of the network.  These points should be \nkept in mind when one tries to use the techniques described here on other machines. \n\nThe  main  lesson  we  learned  from  this  work  is  that  to  implement  an  algorithm \nefficiently on a massively parallel machine often requires re-thinking of the algorithm \nto explore  the  parallel  nature  of the  algorithm,  rather  than just  a  straightforward \ntranslation of serial implementations. \n\nAcknowledgement \n\nMany  thanks  to  Alex  Singer,  who  read  several  drafts  of this  paper  and  helped \nimprove it.  Lennart  J ohnsson  helped us  solve a  critical  problem.  Discussions  with \nother members of the Mathematical and Computational Sciences Group at Thinking \nMachines  Corporation also helped in  many ways. \n\nReferences \n[1]  Louis  G.  Ceci,  Patrick Lynn,  and  Phillip  E.  Gardner.  Efficient  Distribution of Back(cid:173)\n\nPropagation Models  on  Parallel Architectures.  Tech.  Report  CU-CS-409-88,  Dept.  of \nComputer  Science,  University of Colorado,  September  1988. \n\n[2]  MIT Lincoln  Laboratory.  Darpa  Neural Network Study.  Final Report,  July  1988. \n[3]  Special Issue on Artificial Neural Systems.  IEEE  Computer, March 1988. \n[4]  Tomaso  Poggio  and  Federico  Girosi.  A  Theory  of Networks  for  Approximation  and \n\nLearning.  A.I.Memo  1140,  MIT AI Lab,  July  1989. \n\n[5]  Dean A.  Pomerleau, George  L.  Gusciora David S. Touretzky, and H.  T. Kung.  Neural \nNetwork Simulation at Warp Speed:  How We  Got 17  Million Connections Per Second. \nIn  IEEE Int.  Conf.  on  Neural  Network&,  July  1988.  San Diego,  CA. \n\n[6]  Charles  R.  Rosenberg  and Guy  Blelloch.  An  Implementation of Network Learning  on \nthe  Connection  Machine.  In  Proceeding&  of the  Tenth  International Joint  Conference \non  Artificial Intelligence,  Milan, Italy, 1987. \n\n[7]  D. E. Rumelhart,  G.  E. Hinton, and R.  J. Williams.  Learning internal representations \nby  error propagation.  In  Parallel  Di&tributed  Proceuing, chapter 8.  MIT Press,  1986. \n[8]  Xiru  Zhang,  Michael  Mckenna,  Jill  P.  Mesirov,  and  David  L.  Waltz.  An  Efficient \nImplementation of The Back-Propagation Algorithm On the Connection Machine CM-\n2.  Technical Report RL-89-1, Thinking Machines Corp.,  245  First St. Cambridge, MA \n02114,  1989. \n\n\f", "award": [], "sourceid": 281, "authors": [{"given_name": "Xiru", "family_name": "Zhang", "institution": null}, {"given_name": "Michael", "family_name": "McKenna", "institution": null}, {"given_name": "Jill", "family_name": "Mesirov", "institution": null}, {"given_name": "David", "family_name": "Waltz", "institution": null}]}