{"title": "Learning in Higher-Order \"Artificial Dendritic Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 490, "page_last": 497, "abstract": null, "full_text": "490 \n\nBell \n\nLearning  in  higher-order' artificial dendritic  trees' \n\nTony Bell \n\nArtificial  Intelligence Laboratory \n\nVrije Universiteit Brussel \n\nPleinlaan  2,  B-1050 Brussels, BELGIUM \n\n(tony@arti.vub.ac.be) \n\nABSTRACT \n\nIf neurons  sum  up  their  inputs  in  a  non-linear  way,  as  some  simula(cid:173)\ntions  suggest,  how  is  this  distributed  fine-grained  non-linearity  ex(cid:173)\nploited  during  learning?  How  are  all  the  small  sigmoids  in  synapse, \nspine  and  dendritic  tree  lined  up  in  the  right areas  of their respective \ninput  spaces?  In  this report,  I show  how  an  abstract atemporal  highly \nnested  tree  structure  with  a  quadratic  transfer  function  associated \nwith  each  branchpoint,  can  self organise  using  only  a  single  global \nreinforcement  scalar,  to  perform  binary  classification  tasks.  The  pro(cid:173)\ncedure  works  well,  solving  the 6-multiplexer and  a difficult phoneme \nclassification  task  as  well  as  back-propagation  does,  and  faster. \nFurthermore, it does not calculate an  error gradient, but uses  a  statist(cid:173)\nical  scheme  to  build  moving  models of the  reinforcement signal. \n\n1.  INTRODUCTION \nThe computational territory  between  the linearly  summing McCulloch-Pitts  neuron  and \nthe  non-linear  differential  equations  of Hodgkin  &  Huxley  is  relatively  sparsely  popu(cid:173)\nlated.  Connectionists  use  variants  of  the  former  and  computational  neuroscientists \nstruggle  with  the  exploding  parameter  spaces  provided  by  the  latter.  However,  evi(cid:173)\ndence  from  biophysical  simulations  suggests  that  the  voltage  transfer  properties  of \nsynapses,  spines  and  dendritic  membranes  involve  many  detailed  non-linear  interac(cid:173)\ntions,  not  just  a  squashing  function  at  the  cell  body.  Real  neurons  may  indeed  be \nhigher-order  nets. \nFor  the  computationally-minded,  higher  order  interactions means,  first  of all, quadratic \nterms.  This  contribution  presents  a  simple  learning  principle  for  a binary  tree  with  a \nlogistic/quadratic  transfer  function  at  each  node.  These  functions,  though  highly \nnested,  are  shown  to  be  capable  of changing  their  shape  in  concert.  The  resulting  tree \nstructure  receives  inputs  at  its  leaves,  and  outputs  an  estimate  of the  probability  that \nthe  input pattern  is a member  of one of two  classes at the  top. \n\n\fLearning in Higher-Order' Artificial Dendritic Trees' \n\n491 \n\nA  number  of other  schemes  exist  for  learning  in  higher-order  neural  nets.  Sigma-Pi \nunits,  higher-order  threshold  logic  units  (Giles  &  Maxwell,  87)  and  product units  (Dur(cid:173)\nbin  & Rumelhart,  89)  are  all  examples  of units  which  learn  coefficients  of non-linear \nfunctions.  Product unit networks,  like Radial  Basis Function  nets,  consist of a layer of \nnon-linear  transformations,  followed  by  a  normal  Perceptron-style  layer.  The  scheme \npresented  here  has  more  in  common  with  the  work  reviewed  in  Barron  (88)  (see  also \nTenorio  90)  on  polynomial  networks  in  that it uses  low  order polynomials  in  a  tree of \nlow  degree.  The  differences lie in  a global  rather  than  layer-by-Iayer  learning  scheme, \nand  a transfer function  derived  from  a gaussian discriminant function. \n\n2.  THE ARTIFICIAL DENDRITIC TREE (ADT) \nThe  network  architecture  in  Figure  I(a)  is  that  of a binary  tree  which  propagates  real \nnumber  values  from  its  leaf  nodes  (or  inputs)  to  its  root  node  which  is  the output.  In \nthis  simple  formulation,  the  tree  is  construed  as  a  binary  classifier.  The  output  node \nsignals  a  number  between  1  and  0  which  represents  the  probability  that  the  pattern \npresented  to  the  tree  was  a  member  of the  positive  class  of patterns  or  the  negative \nclass.  Because  the  input  patterns  may  have  extremely  high  dimension  and  the  tree  is, \nat  least  initially,  constrained  to  be  binary,  the  depth  of the  tree  may  be  significant,  at \nleast  more  than  one  might like  to  back-propagate  through.  A transfer function  is asso(cid:173)\nciated  with  each  'hidden'  node  of the  tree  and  the  output  node.  This  will  hereafter be \nreferred  to  as  a  Z{unction,  for  the  simple  reason  that  it  takes  in  two  variables  X  and \nY,  and  outputs  Z.  A cascade  of Z-functions  performs  the  computation  of the  tree  and \nthe  learning  procedure  consists  of changing  these  functions.  The  tree  is  referred  to  as \nan  Artificial Dendritic  Tree  or  ADT  with  the  same degree  of licence  that one may  talk \nof Artificial  Neural Networks, or ANNs. \n\n(a) \n\nz (x) \n\nz (x ,y)  1.0 \n\n(b) I (c)  A \n\n(d) \n\nlnput  nodes \n\nx \n\nX \n\nY \n\nFigure  1:  (a)  an  Artificial  Dendritic  Tree,  (b)  a  ID Z-node  (c)  a 2D  Z-node  (d) \nA  ID Z-function  constructed from2  gaussians (e)  approximating a step function \n\n2.1.  THE TRANSFER FUNCTION \nThe  idea  behind  the  Z-function  is  to  allow  the  two  variables  arriving  at  a  node  to \ninteract  locally  in  a non-linear  way  which  contributes  to  the global  computation  of the \ntree.  The  transfer  function  is  derived  from  statistical considerations.  To  simplify,  con(cid:173)\nsider  the  one-dimensional  case  of a  variable  X  travelling  on  a  wire as  in  Figure  1 (b). \nA  statistical  estimation  procedure  could  observe  the  distribution  of values  of X  when \nthe  global  pattern  was  positive  or  negative  and  derive  a  decision  rule  from  these.  In \nFigure  I(d), the  two  density  functions  f+(x)  and  f-(x)  are  plotted.  Where  they  meet, \nthe  local computation  must answer  that,  based  on  its  information,  the  global  pattern  is \npositively  classified  with  probability  0.5.  Assuming  that  there  are  equal  numbers  of \npositive  and  negative  patterns  (ie:  that  the  a priori probability  of positive is  0.5),  it  is \neasy  to  see  that  the  conditional  probability  of being  in  the  positive  class  given  our \nvalue  for  X,  is given  by equation  (1). \n\n\f492 \n\nBell \n\nz (x) = P [class=+ve Ix] = \n\n[+ex) \n\n[+(x)+[-(x) \n\n(1) \n\nThis can  be  also  derived  from  Bayesian  reasoning  (Therrien,  89).  The  fonn  of z (x)  is \nshown  with  the  thick  line  in  Figure  l(d)  for  the  given [+(x) and [-(x).  If [+(x) and \n[-ex)  can  be  usefully  approximated  by  normal  (gaussian)  curves  as  plotted  above, \nthen  (1)  translates  into  (2): \n\nz ex) = \n\n1. \n\nt \n1 +e -mp\" \n\n,input = ~-(x) - ~+(x) + In[ a:] \n\na \n\n(2) \n\nThis can  be obtained  by  substituting  equation  (4)  overleaf into  (1)  using  the  definitions \nof a  and  ~ given.  The  exact  form  a  and  ~ take  depends  on  the  number  of variables \ninput.  The  first  striking  thing  is  that  the  form  of  (2)  is  exactly  that  of  the  back(cid:173)\npropagation  logistic  function.  The  second  is  that  input  is  a  polynomial  quadratic \nexpression. For Z-functions with 2 inputs  (x ,y) using formulas  (4.2)  it takes the  fonn: \n\nw lX2+W2Y2+w~+w 4X+wsY+w6 \n\n(3) \nThe  w' s  can  be  thought  of as  weights  just  as  in  backprop,  defining  a  6D  space  of \ntransfer  functions.  However  optimising  the  w's directly  through  gradient  descent  may \nnot be  the  best idea (though  this is what Tenorio  does),  since for  any  error function  E, \naE law 4  = x aE law 1 = Y aE law 3.  That is,  the  axes of the  optimisation are  not indepen(cid:173)\ndent  of each  other.  There  are,  however,  two  sets  of 5  independent  parameters  which \nthe  w's in  (3)  are  actually  composed  from  if we  calculate  input  from  (4.2).  These are \nJl:,  cr;,  11;,  cr;  and  r+,  denoting  the  means,  standard  deviations  and  correlation \ncoefficient  defining  the  two-dimensional  distribution  of (x ,y)  values  which  should  be \npositively  classified.  The other  5 variables  define  the  negative distribution. \nThus  2  Gaussians  (hereafter  referred  to  as  the  positive  and  negative  models)  define  a \nquadratic  transfer  function  (called  the  Z{unction)  which  can  be  interpreted as express(cid:173)\ning  conditional  probability  of positive  class  membership.  The  shape  of these  functions \ncan  be  altered  by  changing  the  statistical  parameters  defining  the  distributions  which \nundedy  them.  In  Figure  l(d),  a  1-dimensional  Z-function  is  seen  to  be  sigmoidal \nthough  it need  not  be  monotonic  at  all.  Figure  2(b)-(h)  shows  a  selection  of 2D  Z(cid:173)\nfunctions.  In  general  the  Z-function  divides  its  N-dimensional  input  space  with  a  N-1 \ndimensional  hypersurface.  In  2D,  this  will  be  an  ellipse,  a  parabola,  a  hyperbola  or \nsome  combination  of  the  three.  Although  the  dividing  surface  is  quadratic,  the  Z(cid:173)\nfunction  is  still  a  logistic  or  squashing  function.  The  exponent  input  is  actually \nequivalent to  the  log likelihood ratio  or  In(j+(x)/j-(x\u00bb. commonly  used  in  statistics. \nIn  this  work,  2-dimensional  gaussians  are  used  to  generate  Z-functions.  There  are \ncompelling  reasons  for  this.  One  dimensional  Z-functions  are  of little  use  since  they \ndo  not  reduce  information.  Z-functions  of  dimension  higher  than  1 perform  optimal \nclass-based  information  reduction  by  propagating  conditional  probabilities  of  class \nmembership.  But 2D  Z-functions  using  2D  gaussians  are  of particular interest because \nthey  include  in  their  function  space  all  boolean  functions  of two  variables  (or  at  least \nanalogue  versions of these functions).  For example the gaussians which  would come to \nrepresent  the  positive and  negative  exemplar patterns  for  XOR  are  drawn  as  ellipses  in \nFigure 2(a).  They  have  equal  means  and  variances  but the  negative  exemplar  patterns \nare  correlated  while  the  positive  ones  are  anti-correlated.  These  models  automatically \ngive  rise  to  the  XOR  surface  in  Figure 2(b)  if put through  equation  (2).  An  interesting \n\n\fLearning in Higher-Order' Artificial Dendritic Trees' \n\n493 \n\nobservation  is  that  a  problem  of Nth  order  (XOR  is  2nd  order,  3-parity  is  3rd  order \netc)  can  be  solved by  a  polynomial  of degree  N (Figure  2d).  Since 2nd  degree polyno(cid:173)\nmials  like  (3)  are  used  in  our  system,  there  is  one  step  up  in  power  from  1st  degree \nsystems like the Perceptron.  Thus 3-parity is  to the  Z-function  unit what XOR  is  to  the \nPerceptron  (in  this case not quadratically separable). \n\nA  GAUSSIAN  IS: \n\nf  (x)=.le-IJ(%) \n\na \n\nin  one  dimension:  a=(21t) 1120'% \n\n~(x )  (x -Jl% )2 \n20'x 2 \n\nin  two  dimensions:  a=21tO'x O'y(l-r 2)112 \n\nin  n  dimensions: \n\nJl%=E [x] \nO';:E [x 2]-Jl% 2 \n\nE[xy]~%Jly \n\nr \n\n(4) \n\n(4.1.1) \n\n(4.1.2) \n\n(4.2.1) \n\n(4.2.2) \n\n2r (x -J,1x )(y -~ ) 1 \n\nO'x O'y \n\n(4.n.l) \n\n(4.n.2) \n\n1 \n\n[ (X-J,1x)2 \n\n(y -~ )2 \n0'% 2  +  0'/ \n\n~(x ,y)= 2(l-r2) \na=(21t)\"/2 IK  11/2 \n~<!)= ~ (!-mlK-1<!-m) \n\nis  the  expected  value  or  mean  of  x \nis  the  variance  of  x \n\nis  the  correlation  coef ficiem  of  a  bivariate  gaussian \n\nm=E [!] \nK=E [<!-m)<!-m)T]  is  the  covariance  matrix  of  a  multivariate\u00b7  gaussian \n\nis  the  mean  vector  of  a  multivariate  gaussian \n\nwith  IK 1 its  determinant \n\n(j)  /l\\. \n(k)  M \n\n(i) \n\nFigure  2:  (a)  two  anti-(;orrelated  gaussians seen  from  above  (b)  the  resulting  Z(cid:173)\nfunction  (c)-(h)  Some  other  20  Z-functions.  (i)  3-parity  in  a  cube  cannot  be \nsolved by  a  30 Z-function  (j) but yields  to  a cascade  of 20 ones  (k). \n\n2.2.  THE LEARNING PROCEDURE \nIf gaussians  are  used  to  model  the  distribution  of inputs  x  which  give  positive  and \nnegative  classification  errors,  rather  than  just the  distribution  of positively  and  nega(cid:173)\ntively  classified  x, then  it is  possible  to  formulate  an  incremental  learning  procedure \nfor  training  Z-functions.  This  procedure  enables  the  system  to deal  with  data which  is \nnot gaussianly distributed. \n\n\f494 \n\nBell \n\n2.2.1.  Without hidden units:  learning a step function. \nA simple example  illustrates this principle.  Consider a network consisting entirely of a \nI-dimensional  Z-function.  as  in  Figure  1(b).  The  input  is  a  real  number  from  0  to  1 \nand  the  output  is  to  be  a  step  function,  such  that  0.5-1.0  is  classed  positively  (output \n1.0)  and  0.0-0.5  should  output  0.0.  The  4  parameters  of the  Z-function  (Jl+ ,Jl-,cr+,crl \nare  initialised  randomly  and  example  patterns  are  presented  to  the  'tree'.  On  each \npresentation  t,  the  error  0  in  the  response  is  calculated  by  0,  ~ d, -0\" \nthe  desired \nminus  the  actual  output  at  time  t,  and  2  of the  parameters  are  altered.  If the  error  is \npositive,  the  positive  model  is  altered,  otherwise the negative  model  is  altered.  Chang(cid:173)\ning  a  model  consists  of  'sliding'  the  estimates  of  the  appropriate  first  and  second \nmoments  (E[x]  and E [x 2]) according  to a  'moving-average'  scheme: \n\nE [x],  ~ to,x,+(1-to,)E [x ]'-1 \n\n(5.1) \n(5.2) \nwhere t  is  a plasticity or learning rate, x,  is the  value input and  E [x ]'-1  was  the previ(cid:173)\nous estimate  of the  mean  value of x  for  the  appropriate  gaussian.  This rule  means  that \nat  any  moment,  the  parameters  determining  the  positive  and  negative  models  are \nweighted  averages  of recent inputs  which  have generated errors.  The influence  which a \nparticular  input  has  had  decays  over  time.  This  algorithm  was  run  with  \u00a3=0.1.  After \n100  random  numbers  had  been  presented,  with  error  signals  from  the  step-function \nchanging  the  models,  the  models  come  to  well  represent  the  distribution  of positive \nand  negative  inputs.  At this  stage  the  models and  their  associated  Z-function  are  those \nshown  in  Figure  l(d).  But now,  most of the  error reinforcement will be coming  from \na  small  region  around  0.5,  which  means  that  since  the  gaussians  are  modelling  the \nerrors,  they  will  be  drawn  towards  the  centre  and  become  narrower.  This  has  the \neffect,  Figure  l(e),  of increasing  the  gain  of the  sigmoidal  Z-function.  In  the  limit,  it \nwill  converge  to  a  perfect  step  function  as  the  gaussians  become  infinitesimally \nseparated  delta  functions.  This initial  demonstration  shows  the  automatic  gain  adjust(cid:173)\nment property  of the Z-function. \n\n2.2.2.  With  hidden units:  the 6-multiplexer. \nThe  first  example  showed  how  a  1D  Z-function  can  minimise  error  by  modelling  it. \nThis  example  shows  how  a  cascade  of  2-dimensional  Z-functions  can  co-operate  to \nsolve  a  3rd order problem.  A 6-multiplexer circuit  receives as input 6 bits, 4 of which \nare  data  bits  and  2  are  address  bits.  If the  address  bits are  00,  it must output  the  con(cid:173)\ntents  of the  first  data  bit,  if 01,  the  second,  10  the  third  and  11  the  fourth.  There  are \n64  different  input patterns.  Choosing  an  tree  architecture  is  a difficult problem  in  gen(cid:173)\neral,  but the  first  step  is to  choose  one which  we  know  can  solve  the problem.  This  is \nillustrated  in  Figure 3(a).  This  is  an  architecture  for  which  there exists a solution  using \nbinary  Boolean  functions. \nThe  tree's  solution  was  arrived  at  as  follows:  each  node  was  initialised  with  10  ran(cid:173)\ndom  values:  E[x]' E[y], E[x2], E[y2]  and  E[xy]  for  each of its  positive  and  negative \nmodels.  The  learning  rate  t  was  set  to  0.02  and  input  patterns  were  generated  and \npropagated  up  to  the  top  node,  where  an  error  measurement was  made.  The  error was \nthen  broadcast  globally  to  all  nodes,  each  one,  in  effect,  being  told  to  respond  more \npositively  (or  negatively)  should  the  same  circumstances  arise  again,  and  adjusting \ntheir  Z-functions  in  the  same  way  as  equations  (5).  This  time,  however,  5 parameters \n\n\fLearning in Higher-Order' Artificial Dendritic Trees' \n\n495 \n\nwere  adjusted  per  node  per  presentation.  instead  of 2.  Again.  which  model  (positive \nor  negative)  is adjusted depends on  the  sign  of the  error at the  top  of the  tree. \nThe  tree  learns  after  about  200  random  bit  patterns  are  presented  (7  seconds  on  a \nSymbolics).  After  300  presentations  (the  state  depicted  in  Figure  3a),  the  mean \nsquared  error  is  falling  steadily  to  zero.  An  adequate  back-propagation  network  takes \n6000 presentations  to  converge  on  a  solution.  The  solution  achieved is a rather messy \ncombination  of half-hearted  XORs  and  NXORs,  and  ambiguous  AND/ORs.  The  prob(cid:173)\nlem  was  tried  with  different  trees.  In  general  any  tree  of sufficient richness  can  solve \nthe  problem  though  larger  trees  take  longer.  Trees  for  which  no  nice  solutions  exist. \nie:  those  with  fewer  than  6  well-chosen  inputs  from  the  address  bits  can  sometimes \nstill  perform  rather  well.  A  tree  with  straight  convergence.  only  one  contact  per \naddress  bit,  can  still  quickly  approach  80%  performance,  but  further  training  is  des(cid:173)\ntructive.  Figure  3(b)  shows  a  tree  trained  to  output  1  if half or  more  of its  8  inputs \nwere  on. \n\nAl rr===---n \n\n(a) \n\n7  ... \n8  \"~_--'I \n\n(b) \n\nFigure 3:  Solving the 6-multiplexer (a)  and  the  8-majority predicate  (b) \n\n2.2.3.  Phoneme classification. \nA good  question  was  if such  a  tree  could  perform  well  on  a large  problem, so  a typi(cid:173)\ncal  back-propagation  application  was  attempted.  Space  does  not  permit  a  full  account \nhere.  but  the  details  appear  in  Bell  (89).  The  data  came  from  100  speakers  speaking \nthe  confusable  E-set  phonemes  (B,  D,  E  and  V).  This  was  the  same  data  as  that  used \nby  Lang  &  Hinton  (88).  Four  trees  were  built  out  of  192  input  units  and  the  trees \ntrained  using  a learning  schedule  of E  falling  from  0.01  to  0.001  over the  course of 30 \npresentations  of each  of 668  training  patterns.  Generalisation  to  a  test  set  was  88.5%, \n0.5%  worse  than  an  equivalently  simple  backprop  net  A  more  sophisticated  backprop \nnet,  using  time-delays  and  multiresolution  training  could  reach  93%  generalisation. \nThirty  epochs  with  the  trees  took  some  16  hours on  a Sun  3-260 whereas  the  backprop \nexperiments  were  performed  on  a  Convex  supercomputer.  The  conclusion  from  these \nexperiments  is  that  trees  some  8  levels  deep  are  capable  of almost  matching  normal \nback-propagation  on  a  large  classification  task  in  a  fraction  of  the  training  time. \nAttempts  to  build time-symmetry  into  the  trees  have not so far been  successful. \n\n3.  DISCUSSION \nEven  within  the  context of other connectionist leaming procedures,  there  is  something \nof an  air  of mystery  about this  one.  The  apparatus  of gradient descent,  either  for  indi(cid:173)\nvidual  units  or for  the  whole  tree  is  absent or at least hidden. \n\n\f496 \n\nBell \n\n3.1.  HOW DOES IT WORK? \nIt  is  necessary  to  reflect  on  the  effect  of modelling  errors.  Models  of errors  are  an \nattempt  to  push  a  node's  outputs  towards  the  edge  of its parent's  input square.  Where \nthe  model  is  perfect,  it is simple  for  the  node above  to  model  the  model  by applying  a \nsigmoid,  and  so on  to  the  top  of the  tree,  where  the  error disappears.  But  the  model(cid:173)\nling  is  actually  done  in  a  totally  distributed  and  collaborative  way.  The  identification \nof  1.0  with  positive  error  (top  output  too  small)  means  that  Z-functions  are  more \nlikely to be monotonic  towards (1,1)  the  further they  are  from  the inputs. \nTwo  standard  problems  are  overcome  in  unusual  ways.  The  first,  credit  assignment,  is \nsolved  because  different  Z-functions  are  able  to  model  different  errors,  giving  them \ndifferent  roles.  Although  all  nodes  receive  the  same  feedback,  some  changes  to  a \nnode's  model  will  be  swiftly  undone  when  the  new  errors  that  result  from  them  begin \nto  be  broadcast.  Other  nodes  can  change  freely  either because  they  are  not  yet essen(cid:173)\ntial  to  the  computation  or  because  there  exist  alterations  of their  models  tolerable  to \nthe  nodes  above.  The  second  problem  is  stability.  In  backprop,  the  way  the  error \ndiffuses  through  the  net  ensures  that  the  upper  weights  are  slaved  to  the  lower  ones \nbecause  the  lower  are  changing  more  slowly.  In  this  system,  the  upper  nodes  are \nslaved  to  the  lower  ones  because  they  are  explicitly  modelling  their  activities.  Con(cid:173)\nversely,  the  lower  nodes  will  never  be  allowed  to  change  too  quickly  since  the  errors \ngenerated  by  sluggish  top  nodes  will  throw  them  back  into  the  behaviour  the  top \nnodes expect  For a low  enough learning  rate e, the solutions are  stable. \nAmongst  the  real  problems  with  this  system  are  the  following.  First,  the  credit assign(cid:173)\nment  is  not  solved  for  units  receiving  the  same  input  variables,  making  many  normal \nconnectionist  architectures  impossible.  Second,  the  system  can  only  deal  with  2 \nclasses.  Third,  as with other algorithms, choice of architecture is  a 'black art'. \n\n3.2.  BIOPHYSICS &  REAL NEURONS \nThe  name' Artificial  Dendritic  Tree'  is  perhaps  overdoing  it.  The  tree  has  no dynamic \nproperties,  activation  flows  in  only  one  direction,  the branchpoints  of the  tree routinely \nimplement  XOR  and  the  'cell'  as  a  whole  implements  phoneme  recognition  (only  a \nsmall  step  from  grandmothers).  The  title  was kept because what drove  the  work  was  a \nsearch  for  a computational explanation  of how  fine-grained  local  non-linearities of low \ndegree  could  combine  in  a  learning  process.  Work  in  computational  neuroscience,  in \nparticular  with  compartmental  models  (Koch  & Poggio 87;  RaIl  & Segev 88;  Segev  et \nal  89,  Shepherd  &  Brayton  87)  have  shown  that  it  is  likely  that  many  non-linear \neffects  take  place  between  synapse  and  soma.  Synaptic  transfer  functions  can  be  sig(cid:173)\nmoidal,  spines  with  active  channels  may  mutually  excite  each  other  (even  implement \nboolean  computations)  and  inhibitory  inputs  can  'veto'  firing  in  a  highly  non-linear \nfashion  (silent  inhibition).  The  dendritic  membrane  itself  is  filled  with  active  ion \nchannels,  whose  boosting  or  quenching  properties  depend  in  a  complex  way  on  the \nintracellular  voltage  levels  or  Ca'Jn.  concentration  (itself  dependent  on  voltage).  Thus \nwe  may  be  able  to  consider  the  membrane  itself  as  a  distributed  processing  system, \nmeaning  that  the  synapses  are  no  longer  the  privileged  sites  of learning  which  they \nhave  tended  to  be since Hebb.  Active  channels  can  serve  to  implement threshold  func(cid:173)\ntions  just  as  well  at  the  dendritic  branchpoints  as  at  the  soma,  where  they  generate \nspikes.  There  are  many  different  kinds  of ion  channel  (Yamada  et aI,  89)  with  inho(cid:173)\nmogenous  distributions  over  the  dendritic  tree.  A  neuron's  DNA  may  generate  a  cer(cid:173)\ntain  'base  set'  of channel  proteins  that  span  a  non-linear  function  space  just  as  our \n\n\fLearning in Higher-Order' Arti ficial Dendritic Trees' \n\n497 \n\nparameters  span  the  Z-function  space.  The  properties  of a  part of dendritic  membrane \ncould  be  seen  as  a  point in  channel  space.  Viewed  this  way.  the  neuron  becomes  one \nlarge  computer.  When  one considers  the  Purkinje cell  of the  cerebellum  with  100.000 \ninputs,  as  many  spines.  a  massive  arborisation  full  of active  channels,  many  of them \nCa-permeable  or  Ca-dependent.  with  spiking  and  plateau  potentials  occurring  in  the \ndendritic  tree.  the  notion  that  the  cell  may  be  implementing  a  99.999  dimensional \nhyperplane  starts  to  recede.  here  is  an  extra  motivation  for  considering  the  cell  as  a \ncomplex  computer.  Algorithms  such  as  back-propagation  would  require  feedback  cir(cid:173)\ncuits  to  send  error.  If the  cell  is  the  feedback  unit,  then  reinforcement can  occur as a \nspike  at  the  soma  rein vades  the  dendritic  tree.  Thus  nerves  may  not  spike  just  for \naxonal  purposes.  but  also  to  penetrate  the  electrotonic  length  of the  dendrites.  This \nwas  thought  to  be  a  component  of Hebbian  learning  at the  synapses,  but  it  could  be \nthe  basis of more  if the dendritic  membrane computes. \n\n4.  Acknowledgements \nTo  Kevin  Lang  for  the  speech  data  and  to  Rolf Pfeifer  and  Luc  Steels  for  support. \nFurther credits  in  Bell  (90).  The author  is funded  by ESPRIT B.R.A.  3234. \n\n5.  References \nBarron  A & Barron R (88)  Statistical Learning Networks:  a  unifying  view,  in  Wegman \n\nE (ed)  Proc.  20th Symp.  on  Compo  Science  &  Statistics  [see also  this  volume] \n\nBell  T  (89)  Artificial  Dendritic  Learning.  in  Almeida  L.  (ed)  Proc.  EURASIP \nWorkshop  on  Neural  Networks.  Lecture  notes  in  Computer  Science.  Springer(cid:173)\nVerlag.  [also VUB  AI-lab  Memo 89-20]. \n\nDurbin  R  & Rumelhart  D  (89)  Product  Units:  A Computationally  Powerful  and  Bio(cid:173)\n\nlogically Plausible Extansion  to  Backpropagation Nets. Neural  Computation  J \n\nGiles  C.L.  &  Maxwell  T  (87)  Learning.  in variance  and  generalisation  in  high-order \n\nneural  networks. Applied Optics  vol  26.  no.  23 \n\nKoch  C  &  Poggio  T  (87)  Biophysics  of Computational  Systems:  Neurons,  synapses \n\nand  membranes.  in  G.  Edelman  et al  (eds).  Synaptic Function. John  Wiley. \n\nLang  K & Hinton G  (88)  The  Development of the Time-Delay Neural Network Archi(cid:173)\n\ntecture  for  Speech Recognition.  Tech Report CMU-CS-88-J52 \n\nRaIl  W  &  Segev  I  (88)  Excitable  Dendritic  Spine  Clusters:  non-linear  synaptic  pro(cid:173)\n\ncessing.  in R.Cotterill  (ed)  Computer Simulation  in Brain  Science. Camb.U.P. \n\nSegev  I.  Fleshman J  & Burke R.  (89) Compartmental  Models of Complex  Neurons.  in \n\nMethods in Neuronal Modelling \n\nShepherd  G & Brayton  R  (87)  Logic  operations  are  properties  of computer  simulated \ninteractions between  excitable dendritic  spines. Neuroscience. vol  21,  no.  1 1987 \nKoch  C & Segev  I (eds)  MIT press  1989 \n\nTenorio  M & Lee  W  (90)  Self-Organizing  Network  for  Optimal  Supervised  Learning, \n\nIEEE Transactions in  Neural Networks,  1990 [see also  this volume] \n\nTherrien  C (89) Decision Estimation  and Classification. \nYamada  W,  Koch  C  &  Adams  P  (89)  Multiple  Channels  and  Calcium  Dynamics,  in \n\nMethods in Neuronal Modelling  Koch  C & Segev I (eds)  MIT  press  1989. \n\n\f", "award": [], "sourceid": 202, "authors": [{"given_name": "Tony", "family_name": "Bell", "institution": null}]}