{"title": "An Analog VLSI Splining Network", "book": "Advances in Neural Information Processing Systems", "page_first": 1008, "page_last": 1014, "abstract": null, "full_text": "An Analog VLSI Splining Network \n\nDaniel B.  Schwartz and Vijay K.  Samalam \n\nGTE Laboratories, Inc. \n\n40  Sylvan Rd. \n\nWaltham, MA  02254 \n\nAbstract \n\nWe  have produced  a  VLSI  circuit capable of learning  to approximate ar(cid:173)\nbitrary  smooth  of a  single  variable  using  a  technique  closely  related  to \nsplines.  The circuit effectively  has  512 knots space on  a  uniform grid and \nhas full support for  learning.  The circuit  also can be used  to approximate \nmulti-variable functions as sum of splines. \n\nAn  interesting,  and  as  of yet,  nearly  untapped set of applications for  VLSI  imple(cid:173)\nmentation of neural network learning systems can be found in adaptive control and \nnon-linear signal processing.  In  most such  applications,  the  learning  task  consists \nof approximating a  real  function  of a  small number  of continuous  variables  from \ndiscrete  data points.  Special purpose hardware is especially  interesting for  applica(cid:173)\ntions of this  type since  they  generally  require  real  time on-line learning  and  there \ncan  be stiff constraints on  the power  budget and size  of the hardware.  Frequently, \nthe already difficult  learning problem is made more complex by the non-stationary \nnature of the underlying process. \nConventional feed-forward  networks  with sigmoidal units are clearly  inappropriate \nfor applications of this type.  Although they have exhibited remarkable performance \nin some types of time series  prediction  problems (for example, Wiegend,  1990  and \nAtlas, 1990), their learning rates in general are too slow for on-line learning.  On-line \nperformance can be improved most easily by using networks with more constrained \narchitecture, effectively making the learning problem easier by giving the network a \nhint about the learning task.  Networks that build local representations of the data, \nsuch  as  radial  basis  functions,  are excellent  candidates for  these  type of problems. \nOne  great  advantage  of such  networks  is  that  they  require  only  a  single  layer  of \nunits.  If the position and width of the units are fixed,  the learning problem is linear \n\n1008 \n\n\fAn Analog VLSI Splining Network \n\n1009 \n\nin the coefficients  and local.  By local we  mean the computation of a  weight change \nrequires only information that is locally available to each weight, a  highly desirable \nproperty  for  VLSI  implementation.  If the  learning  algorithm  is  allowed  to  adjust \nboth  the  position  and  width  of the  units  then  many of the  advantages  of locally \ntuned units are lost. \nA  number  of techniques  have  been  proposed  for  the  determination  of the  width \nand  placement  of the  units.  One  of the  most direct  is  to  center  a  unit  at  every \ndata  point  and  to  adjust  the  widths  of the  units  so  the  receptive  fields  overlap \nwith  those  of neighboring  data points  (  B room head ,  1989).  The proliferation  of \nunits can be limited by using unsupervised clustering techniques to clump the data \nfollowed  by  the  allocation of units  to fit  the  clumps  (Moody,  1989).  Others  have \nadvocated assigning new units only when the error on a new data point is larger than \na threshold and otherwise making small adjustments in the weights and parameters \nof the  existing  units (Platt,  1990).  All  of these  methods suffer  from  the  common \nproblem  of requiring  an  indeterminate  quantity  of resources  in  contrast  with  the \nfixed  resources  available  from  most  VLSI  circuits.  Even  worse,  when  used  with \nnon-stationary  processes  a  mechanism  is  needed  to deallocate  units  as  well  as  to \nallocate them.  The resource  allocation/deallocation problem is a serious  barrier to \nimplementing these  algorithms as  autonomous VLSI  microsystems. \n\nA  Splining Network \n\nTo  avoid  the  resource  allocation  problem  we  propose  a  network  that  uses  all  of \nits weights  and  units  regardless  of the  problem.  We  avoid  over  parameterization \nof the  training data by  building constraints on smoothness  into the  network,  thus \nreducing  the  number  of degrees  of freedom  available  to  the  training  process.  In \nits simplest guise,  the network  approximates arbitrary  I-d smooth functions with a \nlinear superposition  of locally tuned units spaced on  a  uniform grid, \n\ng(z) = LWifC7(z - i~z) \n\ni \n\n(1) \n\nwhere u is  the radius of the unit's receptive field  and the Wi  are the weights.  fC7  is a \nbump of width u such as a gaussian or a cubic spline basis function.  Mathematically \nthe network is closely related to function approximation using B-splines (Lancaster, \n1986)  with uniformly spaced  knots.  However,  in  B-spline interpolation the overlap \nof the  basis functions  is  normally determined  by  the  degree  of the spline  whereas \nwe  use  the  degree  of overlap  as  a  free  parameter  to  constrain  the  smoothness  of \nthe  network's  output.  As  mentioned  earlier,  the  network  is  linear  in  its  weights \nso  gradient  descent  with  a  quadratic  cost  function  (LMS)  is  an  effective  training \nprocedure. \nThe weights needed  for  this network can easily  be implemented in CMOS  with an \narray of transconductance  amplifiers.  The amplifiers are wired  as voltage followers \nwith  their  outputs  tied  together  and  the  weights  are  represented  by  voltages  lti \nat  the  non-inverting  inputs of the  amplifiers.  If the  outputs  of the  locally  tuned \nunits are represented  by unipolar currents Ii these  currents can be used  to bias the \n\n\f1010 \n\nSchwartz and Samalam \n\ntransconductance  amplifiers and the result is  (Mead,1989) \n\n_  Ei IiYi \nt7 \nYou'  - ~ \nL\"i Ii \n\nprovided that care is taken to control the non-linearities of the amplifiers.  However, \nwhile the weights have a simple implementation in analog VLSI circuitry, the input \nunits du not.  A number of circuits exist whose transfer characteristics can be shaped \nto be a  suitable bump but none  of those  known  to the authors allow  the width of \nthe bump to be adjusted over a wide range without the use of resistors. \n\nGenerating the Receptive Fields \n\nInput units with tunable receptive fields can be generated quite efficiently by break(cid:173)\ning them up into two layers of circuitry as shown in figure  1.  The input layer place \nencodes the input signal - i.e.  only one or perhaps a small cluster of units is  active \nat  a  time.  The output  of the  place  encoding  units  either  injects  or controls  the \n\noutput \n\nInput \n\nweight \n\nspreading \n\nlayer \n\nplace \n\nencoding \n\nFigure 1:  An  architecture that allows the width and shape of the receptive fields  to \nbe varied over  a wide  range.  The elements of the 'spreading layer'  are passive  and \ncan sink current to ground. \n\ninjection of current  into  the  laterally  connected  spreading  layer.  The elements  in \nthe spreading layer  all contain ground  terminals and  the current sunk by each one \ndetermines the bias current  applied to the associated weight.  Clearly, the distribu(cid:173)\ntion of currents flowing  to ground through the spreading layer form a smooth bump \nsuch  that when excitation is applied to tap j  of the spreading layer, \n\nIi  = 10 1(1(; - j} \n\nwhere  I(I(;}  is  the  bump  called  for  by  equation  1.  In  our  earliest  realizations  of \nthis  network  the  input  layer  was  a  crude  flash  A-to-D  converter  and  the  input \nto  the  circuit  was  analog.  In  the  current  generation  the input is  digital  with  the \nplace  encoding  performed  by  a  conventional  address  decoder.  If desired,  input \nquantization  can  be  avoided  by  using  a  layer  of amplifiers  that  generate  smooth \nbumps of fixed  width to generate  the input place encoding. \n\n\fAn Analog VLSI Splining Network \n\n1011 \n\nThe simplest  candidate  to implement  the  spreading layer  in  conventional  CMOS \nis  a  set of diode connected  n-channel  transistors  laterally  connected  by  n-channel \npass  transistors.  The  gate  voltages  of the  diode  connected  transistors  determine \nthe  bias  currents  Ii  of the  weights.  Ignoring  the  body  effect  and  assuming  weak \ninversion in the current sink, this type of networks tends to gives bumps with rather \nsharp peaks,  Ii  ~ E j  Ioe-aul , where  Iii is  the  distance from  the point where  the \nexcitation  is  applied.  Figure  2 shows  a  more sophisticated  version  of this  circuit \nin which  the output of the place encoding units applies excitation to the spreading \nnetwork through a p-channel transistor.  The shape of the bumps can be softened by \n\nto weights \n\nbias \n\nvoltages \n\nfrom place encoder \n\nFigure  2:  A schematic of a  section  of the spreading  layer.  Roughly speaking,  the \nn-channel  pass  transistor  controls  the  extent  of the  tails  of the  bumps  and  the \np-channel pass transistor  and  the cascode transistor control its width. \n\nlimiting the amount of current drawn by the current sinks with an n-channel cascode \ntransistor in series with the current sink.  Some experimental results for  this type of \ncircuit are shown in figure 3a.  More control can be obtained by using complementary \npass  transistors.  The use  of p-channel  pass  transistors  alone  unexpectedly  results \nin  bumps that  are  nearly  square  (figure  3b).  These  can  be  smoothed  by using  a \nusing both flavors of pass transistor simultaneously (figure 3c). \n\nThe Weights \n\nAs  described  earlier,  the  implementation of the  output  weights  is  based  on  the \ncomputation of means by  the well  known follower-aggregation circuit.  With typical \ntransconductance  amplifiers,  this averaging is  linear only when  the  voltages  being \naveraged are distributed over a voltage range of no more than a few  time UQ  = kT/e \nin weak inversion.  In the circuits described  here  the linear range has been widened \nto nearly a volt by reducing the transconductance of the readout amplifiers through \nthe combination of low  width  to length  ratio input  transistors  and  relatively large \ntail currents. \nThe  weights  Vi  are stored  on  MOS  capacitors  and  are  programmed  by  the  gated \ntransconductance  amplifier  shown  in  figure  4.  Since  this  amplifier  computes  the \n\n\fc CD \nt: :s o \n\nI \n\nI , , , , , , ., \n\n,:  I \n,: :, \nd.1 \n.i 1, \n,!  : I \n,:  :., \nI!  \\. \n,: \n\\  \\ \n,:  :  \\ \n,,\" ...... ... \n' ........ ::-.::..--\n\n\"  \u2022... l \n\n.. -.~ ... ~ \no  50  100  150  200  250 \n\n,  II \nII \nI: \nI, \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \n'-\n\n! \n\n1012 \n\nSchwartz and Samalam \n\nb \n\no  10  20  30  40  50 \n\no  10  20  30  40  50 \n\nTap Number \n\nFigure 3:  Experimental measurements of the receptive  field  shapes obtained from \ndifferent  types  of networks.  (a)  n-channel  transistors for  several  gate voltages.  (b) \np-channel  transistors  for  several  gate voltages.  ( c)  Both  n-channel  and  p-channel \npass transistors. \n\nexdmkm  >-~+----------+--------~ \n\nFigure 4:  Schematic of an output weight including the circuitry  to generate weight \nupdates.  To minimize leakage  and  charge  injection simultaneously, the  pass  tran(cid:173)\nsistors used  to gate the weight change amplifier are of minimum size and a separate \ntransistor turns off the output transistors of the amplifier. \n\ndifference  between  the  target  voltage  and  the  actual  output  of the  network,  the \nlearning rule is just LMS, \n\nwhere C is the capacitance of the storage capacitor and  T  is  the duration of weight \nchanges.  The transconductance  gi  of the weight  change amplifier is  determined by \nthe strength of excitation current from the spreading layer, gi  oc  Ii  in weak inversion. \nSince the weight changes are governed by strengths of the excitation currents from \nthe spreading  layer,  clusters  of weights  are  changed  at a  time.  This enhances  the \nfault  tolerance of the circuit since  the group of weights surrounding a  bad one can \ncompensate for  it. \n\n\fAn Analog VLSI Splining Network \n\n1013 \n\nExperimental Evaluation \n\nSeveral different  chips have been fabricated in 21' p-well CMOS and tested to evalu(cid:173)\nate the principles described here.  The most recent of these has 512 weights arranged \nin  a  64  x  8 matrix connected  to form  a one dimensional array.  The active area of \nthis  chip  is  4.1mm x  3.7mm.  The input signal  is  digital  with  the  place  encoding \nperformed  by  a  conventional  address  decoder.  To maximize the flexibility  of the \nchip,  the excitation  is  applied  to  the spreading layer  by  a  register  located  in  each \ncell.  By  writing to multiple registers between resets,  the spreading layer can be ex(cid:173)\ncited  at multiple points simultaneously.  This feature  allows  the chip  to be treated \nas  a  single  I-dimensional spline  with  512  weights  or,  for  example,  as  the  sum  of \nfour  distinct  I-dimensional splines each  made up  of 128  weights.  One of the  most \nnoticeable  virtues of this design  is  the simplicity of the layout due  to the absence \nof any dear distinction between  'weights' and 'units'.  The primitive cell consists of \na  register,  a  piece  of the spreading  network,  a  weight  change  amplifier,  a  storage \ncapacitor  and  output  amplifier.  All  but  a  tiny  fraction  of the  chip  is  a  tiling of \nthis primitive cell.  The excess  circuitry  consists of the address  decoders,  a  timing \ncircuit to control  the duration of weight changes  and some biasing circuitry for  the \nspreading layer. \nTo execute  LMS  learning, the user  need only provide a sequence  of target voltages \nand  a  current  proportional  to the  duration  of weight  changes.  Under  reasonable \noperating conditions a  weight  updates cycle  takes less  than  11'8  implying a  weight \nchange  rate  of 5 x  108  connections/second.  The  response  of the  chip  to  a  single \nweight  change after initialization is  shown in  in  figure  5a.  One feature of this plot \nis striking - even  though the distribution of offsets in the individual amplifiers has \na variance of 13mV, the ripple in  the output of the chip is about a  ImV. For some \ncomputations, it appears the limiting factor on the accuracy of the chip  is the rate \nof weight  decay,  about  IOmV/s. \nAs  a  more strenuous  test  of the functionality  of the  chip  we  trained  it  to predict \nchaotic time series  generated  by  the well  know  logistic equation, \n\nXt+l  = 4axt{1 - x,),  a  < 1. \n\nSome  experimental  results  for  the  mean  prediction  error  are  shown  in  figure  5b. \nIn  these  experiments,  a  mean  prediction  error  of 3%  is  achieved,  which  is  well \nabove  the  intrinsic  accuracy  of the  circuit.  A  detailed  examination  of the  error \nrate as  a  function  of the size  and  shape  of the  bumps indicates  that  the problem \nlies  in  the  long  tails  exhibited  by  the  spreading  layer  when  the  n-channel  pass \ntransistors  are  turned  on.  This  tail  falls  off  very  slowly  due  to  the  body  effect. \nOne  remedy  to  this  problem  is  to  actively  bias  the  gates  of the  n-channel  pass \ntransistors to be a programmed offset  above their source voltages (Mead,  1989).  A \nsimpler solution is  to subtract a fixed  current from each of the bias current defined \nby  the spreading layer.  This solution costs  a mere 4 transistors and has the added \nbenefit of guaranteeing that the  bumps will always have a finite support. \n\nConclusion \n\nWe have demonstrated that neural network learning can be efficiently mapped onto \nanalog  VLSI  provided  that  the  network  architecture  and  training  procedure  are \n\n\f1014 \n\nSchwartz and Samalam \n\nco \nci \n\nb \n\nII) \nN \n\n(') \nN \n\n..... \nN \n\n~ ..... \n\n... \n\nII) \nci \n\n0 \nt:: \nQ)  ~ \nci \nc:: \n~  (') \n'2  ci \na.  \"! \nc:: as \n0 \nQ) \n..... \nE \nci \n\n0 \n\n1 0  20 \n\n30  40 \n\n50 \n\n60 \n\n70 \n\n0 \n\n200 \n\n400 \n\n600 \n\n800 \n\n1000 \n\ninput value \n\ntime \n\n0 \nci \n\nFigure 5:  Some experimental  results  from  a  splining circuit.  (a)  The response  of \nthe circuit to learning one data point after initialization of the weights to a constant \nvalue.  (b)  Experimental mean prediction while learning a chaotic  time series. \n\ntailored  to  match  the  constraints  imposed  by  VLSI.  Besides  the  computational \nspeed and low power consumption ( 300pA ) that follow directly from this mapping \nonto VLSI,  the circuit  also  demonstrates intrinsic fault  tolerance  to defects  in  the \nweights. \n\nAcknowledgements \n\nThis work was initially inspired by  a discussion  with A.  G.  Barto and R. S. Sutton. \nA discussion  with J.  Moody was  also helpful. \n\nReferences \n\n[1]  L.  Atlas,  R.  Cole,  Y.  Muthusamy,  A.  Lippman,  J.  Connor,  D.  Park,  M.  EI(cid:173)\n\nSharkawi, and R.  J. Marks II.  A performance comparison of trained multi-layer \nperceptrons  and  trained classification trees.  IEEE Proceedings,  1990. \n\n[2]  D.  S.  Broomhead and D.  Lowe.  Multivariable function  interpolation and adap(cid:173)\n\ntive networks.  Complex Systems, 2:321-355, 1988. \n\n[3]  P.  Lancaster  and  K.  Salkauskas.  Curve  and  Surface  Fitting.  Academic  Press, \n\n1986. \n\n[4]  C.  Mead.  Analog  VLSI and  Neural Systems.  Addison-Wesley,  1989. \n[5]  J. Moody and C.J. Darken.  Fast learning in networks oflocally-tuned processing \n\nunits.  Neural Computation,  1(2), 1989. \n\n[6]  J.  Platt.  A  resource-allocating  neural  network  for  function  interpolation.  In \nRichard  P.  Lippman, John  Moody,  and  David S.  Touretzky,  editors,  Advances \nin  Neural Information  Processing Systems 9,  1991. \n\n[7]  A. S. Weigend,  , B.  A.  Huberman, and D.  E.  Rummlehart.  Predicting the future \n:  A connectionist approach.  International Journal  of Neural Systems,  3,  1990. \n\n\f", "award": [], "sourceid": 342, "authors": [{"given_name": "Daniel", "family_name": "Schwartz", "institution": null}, {"given_name": "Vijay", "family_name": "Samalam", "institution": null}]}