{"title": "A Network of Localized Linear Discriminants", "book": "Advances in Neural Information Processing Systems", "page_first": 1102, "page_last": 1109, "abstract": null, "full_text": "A Network of Localized Linear Discriminants \n\nMartin S.  Glassman \n\nSiemens Corporate Research \n\n755 College Road East \nPrinceton, NJ 08540 \n\nmsg@siemens.siemens.com \n\nAbstract \n\nThe localized linear discriminant network (LLDN) has been designed to address \nclassification problems containing relatively closely  spaced data from  different \nclasses  (encounter zones  [1], the accuracy problem  [2]).  Locally trained hyper(cid:173)\nplane segments are an effective way to define the decision boundaries for these \nregions [3].  The LLD uses a modified perceptron training algorithm for effective \ndiscovery of separating hyperplane/sigmoid units within narrow boundaries. The \nbasic unit of the network is the discriminant receptive field (DRF) which combines \nthe LLD function with Gaussians representing the dispersion of the local training \ndata with respect to the hyperplane.  The DRF implements a local distance mea(cid:173)\nsure [4], and obtains the benefits of networks oflocalized units [5].  A constructive \nalgorithm for the two-class case is described which incorporates DRF's into the \nhidden layer to solve local discrimination problems.  The output unit produces a \nsmoothed, piecewise linear decision boundary.  Preliminary results  indicate the \nability of the LLDN to efficiently achieve separation when boundaries are narrow \nand complex,  in  cases where  both the \"standard\" multilayer perceptron (MLP) \nand k-nearest neighbor (KNN) yield high error rates on training data. \n\n1  The LLD Training Algorithm and DRF Generation \n\nThe LLD is defined by the hyperplane normal vector V and its \"midpoint\" M (a translated \norigin  [1]  near  the  center  of gravity  of the  training  data  in  feature  space).  Incremental \ncorrections  to  V  and  M  accrue  for  each  training  token  feature  vector  Y j  in  the  training \nset,  as  iIlustrated in figure  1 (exaggerated magnitudes).  The surface of the hyperplane is \nappropriately moved either towards or away  from  Yj  by  rotating V,  and shifting M  along \n\n1102 \n\n\fA Network of Localized Linear Discriminants \n\n11 03 \n\nthe axis defined by  V~ M is  always shifted towards Yj  in the \"radial\" direction Rj (which is \nthe componerit of D j  orthogonal to V, where D j  = Yj  - M): \n\n! TOKEN ON CORRECT SIDE OF HYPERPLANE! \n\n! TOKEN ON WRONG SIDE OF HYPERPLANE  I \n\nV,..\"\". \n,.... \n\" \n\" \n\" \nv \n\nR. \n\n..' \n\n. \nJ  .' \n~. \n/T \". \n\" .\u2022 Vj \n_~ \n\n.6M  \":i/-.6M .' \n\n-\n\n.6~~M \n\nO\u00b7 \nJ \n\nV,..\"\". \n\n~ \n\"\" \n\n\"\"\" \n\" M\" \n\n.' \n\n..' \n\nR. \n\nJ  .' \n\n\":'\" \n\n~.Vj \n\nll.V  OJ \n\nFigure 1:  LLD incremental correction vectors associated with training token Y j  are shown \nabove, and the corresponding LLD update rules below: \n\nilV = ]L(n) Lil~ = ]L(n) L(-Se~e8j)0 \n\nj \n\nj \n\nIIDjl1 \n\nllMv  =  yen) L \nj \n\nllMVj  = yen) L( -SeWe8j)V \n\nj \n\nllMR  = f3(n)  L \n\nllMRj  = f3(n) L(We8j)~ \n\nj \n\nj \n\nThe batch mode summation is  over tokens  in  the local training set,  and  n  is the iteration \nindex.  The polarity of ilVj and ilMRj  is set by Se  (c = the class of Yj ), where Se  = 1 if Yj is \nclassified correctly, and Se  = -1 if not.  Corrections for each token are scaled by a sigmoidal \nerror term:  8j  = 1/(1 + exp \u00abse1J/ A)  I VTDj  I\u00bb, a function of the distance of the token to \nthe plane, the sign of Se, and a data-dependent scaling parameter:  A = I VT[B~ - B~] I, where \n1J  is  a fixed  (experimental) scaling parameter.  The scaling of the sigmoid is  proportional \nto an estimate of the boundary region width along the axis of V.  Be is  a weighted average \nof the class c token vectors:  Be(n + 1)  = (1  - a)Be(n) + aWe EjEe \u20acj.e(n)Yj(n),  where \u20acj.e \nis  a  sigmoid  with  the  same scaling  as  8j,  except that  it  is  centered on  Be  instead of M, \nemphasizing tokens of class c nearest the hyperplane surface.  For small1J's, Be will  settle \nnear the cluster center of gravity, and for large 1J's, Be will  approach the tokens closest to \nthe hyperplane surface.  (The rate of the movement of Be is limited by the value of a, which \nis  not critical.)  The inverse of the  number of tokens  in  class  c,  We,  balances the weight \nof the corrections from each class.  If a more Bayesian-like solution is  required, the slope \nof 8  can  be  made  class  dependent (for example,  replacing  1J  with  1J e  ex:  we).  Since  the \nslope of the sigmoid error term is  limited and distribution dependent, the use of We,  along \nwith  the nonlinear weighting  of tokens  near the hyperplane surface,  is  important for the \ndevelopment of separating planes in  relatively narrow boundaries (the assumption is  that \nthe distributions near these boundaries are non-Gaussian).  The setting of 1J simultaneously \n( for convenience) controls the focus on the \"inner edges\" of the class clusters and the slope \nof the sigmoid relative to the distance between the inner edges, with some resultant control \nover generalization performance.  This local scaling of the error also aids the convergence \nrate.  The range of good values for 1J  has been found to be reasonably wide, and identical \n\n\f1104 \n\nGlassman \n\nvalues  have  been  used  successfully  with  speech,  ecg,  and  synthetic  data;  it  could  also \nbe set/optimized using cross-validation.  Separate adaptive learning rates  (/L(n), yen), and \nf3(n\u00bb  are used in order to take advantage ofthe distinct nature of the geometric function of \neach component.  Convergence is also improved by maintaining M within the local region; \nthis  controls  the  rate  at  which  the  hyperplane  can  sweep  through  the  boundary region, \nmaking  the  effect of Ll V  more  predictable.  The  LLD  normal  vector  update  is  simply: \nV(n + 1)  =  (V(n) + LlV)/I!V(n) + LlVII  ,so that V is  always normalized to unit magnitude. \nThe midpoint is just shifted:  M(n + 1) =  M(n) + LlMR + ~v . \n\n+Vk \n\nI  Mk \n\n.L \n\nBk  .  I \n\no \n,c \n\nlambda  ___________ . .  _______ - - - - - - - - \u2022 \n\nT\n--L- .;gm~ C B~::\u00b7\u00b7::>-1\u00b7: \n\nO\u00b7  k \n\n~SigmaR~  ~I  i,k,c \n\nlambda:  estimate  of  the \nboundary  region  width \n\nsigma(V):  dispersion  of \nthe  training  data  in  the \ndiscriminant  direction \n\n(V) \n\nsigma(R):  dispersion  of \nthe  training  data  In  all \ndirections  orthogonal  to  V \n\nFigure 2:  Vectors and parameters associated with the DRF for class c, for LLD k \n\nDRF's are used to localize the response of the LLD to the region of feature space in which \nit was trained, and are constructed after completion ofLLD training.  Each DRF represents \none class, and the localizing component of the DRF is a Gaussian function based on simple \nstatistics of the training data for that class.  Two measures of the dispersion of the data are \nused:  O'v (\"normal\" dispersion), obtained using the mean average deviation of the lengths of \nPj,k,c, and O'R  (\"radial\" dispersion), obtained correspondingly using the 0  j,k,c'S.  (As shown, \nPj,k,c  is the normal component, and OJ,k,c the radial component of Y j  - Bk,c')  The output in \nresponse to an input vector Yj  from the class c DRF associated with the LLD k is cPj,k,c: \n\ncPj,k,c  =  Eh,c(Sj,k -0.5)/ exp( \n\nd2:. \nv J,k,c \n\n+d2:. \n\nR,j,k,c \n\n); \n\nTwo components of the DRF incorporate the LLD discriminant;  one is  the sigmoid error \nfunction  used  in training  the LLD  but shifted down to a  value of zero at the  hyperplane \nsurface.' The other is  E> k,c,  which  is  1  if Yj  is  on the  class  c  side of LLD k,  and zero  if \nnot.  (In  retrospect,  for  generalization performance,  it may not be desirable to  introduce \nthis  discontinuity  to  the  discriminant  component.)  The  contribution  of the  Gaussian  is \nbased on the normal and radial  dispersion  weighted distances  of the  input vector to B k,c: \ndVJ,k,C  = IIPj,k,cll/O'V,k,C'  and .  dRJ,k,c  = IIOj,k,cll/O'R,k,C' \n\n2  Network Construction \n\nSegmentation of the boundary between classes is accomplished by \"growing\" LLD's within \nthe boundary region.  An LLD is initialized using a closely spaced pair of tokens from each \nclass.  The LLD  is  grown by adding nearby tokens to the training set,  using the k-nearest \nneighbors to the LLD midpoint at each growth stage as candidates for permanent inclusion. \nCandidate DRF's are generated after incremental training of the LLD to accommodate each \n\n\fA Network of Localized Linear Discriminants \n\n1105 \n\nnew candidate token.  Two error measures are used to assess the effect of each candidate, the \npeak value of Bj over the local training set, and  'UJ',  which is  a measure of misc1assification \nerror due to  the receptive fields  of the candidate DRF's extending over the entire training \nset.  The candidate token with the lowest average 'UJ'  is permanently added, as long as both \nits  Bj and  'UJ'  are below fixed thresholds.  Growth the the LLD is halted if no candidate has \nboth error measures below threshold.  The B j  and 'UJ' thresholds directly affect the granularity \nof the DRF representation of the data; they need to be set to minimize the number of DRF's \ngenerated,  while  allowing  sufficient  resolution  of local  discrimination  problems.  They \nshould perhaps be adaptive so as  to encourage coarse grained solutions to develop before \nfine grain structure. \n\nFigure 3:  Four \"snapshots\" in the growth of an LLD/DRF pair.  The upper two are \"c1ose(cid:173)\nups.\" The initial LLD/DRF pair is shown in the upper left, along with the seed pair.  Filled \nrectangles and ellipses represent the tokens from each class in the permanent local training \nset at each stage.  The large markers are the B points, and the cross is the LLD midpoint. \nThe amplitude of the DRF outputs are coded in grey scale. \n\n\f1106 \n\nGlassman \n\nAt this point the DRF's are fixed and added to the network; this represents the addition of \ntwo new localized features  available for  use by  the network's output layer in  solving the \nglobal discrimination problem.  In this implementation, the output \"layer\" is  a single LLD \nused to generate a two-class decision.  The architecture is shown below: \n\nINPUT \nDATA \n\nLLD'S \n\n, \n~, \n, \n\n\"',\\j \n\n, \n\n0/ I \n\nI \n\nI \nI \n\nI \n\n,'~ \n\nSlGMA\n\n\"\n\na,(V,R) \n\nSIGMAIr,a,(V,R) \n\nLOCALIZED \nFEATURES \n\nOUTPUT \nDISCRIMINANT \nFUNCTION \n(LLD  WI  SIGMOID) \n\nv~ , \n, \nIS  \u2022  \\ \n__ A---\n\n, IJIJ \n\nS/GMAIr ,1,(V,R) \n\n, \n, , , , \n\nERROR MEASURE ON \nTRAINING  TOKENS \nUSED  TO  SEED  NEW \nLLD'S  OR  HALT \nTRAINING \n\nFigure 4:  LLDN architecture for a two-dimensional, two-class problem \n\nThe ouput unit is completely retrained after addition of a new DRF pair, using the entire train(cid:173)\ning set.  The output of the network to the input Yj  is:  'Pj  = 1/(1 +exp \u00ab 'Y)/ Ao)VT[<i>j - M]), \nwhere Ao  =  IVT[Bo  - Bdl, and <i>j  =  [cPj,}, .\u2022. , cPj,p]  is the p  dimensional vector of DRF \noutputs presented to the output unit.  V is the output LLD normal vector, M  the midpoint, \nand  Be's  the  cluster edge points in  the internal feature  space.  The output error for  each \ntoken is then  used to  select  a new seed pair for development of the next LLD/DRF pair. \nIf all tokens are classified with sufficient confidence, of course, construction of the LLDN \nis complete.  There are three possibilities for  insufficient confidence:  a token  is  covered \nby  a DRF of the wrong class,  it is not yet covered sufficiently by  any DRF's, or it is in  a \nregion of \"conflict\" between DRF's of different classes.  A heuristic is  used to prevent the \nrepeated selection of the same seed pair tokens, since there is no guarantee that a given DRF \nwill  significantly reduce the error for the data it covers after output unit retraining.  This \nheuristic alternates between the types of error and the class for selection of the primary seed \ntoken.  Redundancy in  DRF  shapes  is  also  minimized by  error-weighting the  dispersion \ncomputations so that the resultant Gaussian focuses more on the higher error regions of the \nlocal training data.  A simple but reasonably effective pruning algorithm was incorporated \nto further eliminate unnecessary DRF's. \n\n\fA Network of Localized Linear Discriminants \n\n1107 \n\nFigure  5:  Network  response  plots  illustrating  network  development.  The  upper  two \nsequences,  beginning with  the first  LLD/DRF  pair,  and  the  bottom  two  plots  show final \nnetwork responses  for  these two  problems.  A solution to  a harder version  of the nested \nsquares problem is on the lower left. \n\n3  Experimental Results \n\nThe first experiment demonstrates comparative convergence properties of the LLD and a \nsingle hyperplane trained by the standard generalized delta rule (GDR) method (no hidden \nunits,  single output unit \"network\" is  used)  on  14 linearly  separable,  minimal consonant \n\n\f1108 \n\nGlassman \n\npair data sets.  The data is 256 dimensional (time/frequency matrix, described in  [6]), with \n80 exemplars per consonant.  The results  compare the best performance obtainable from \neach technique.  The LLD converges roughly 12 times faster in iteration counts.  The GDR \noften  fails  to .completely  separate f/th,  f/v,  and  s/sh;  in  the  results  in  figure  6  it fails  on \nthe  f/th  data  set at  a plateau  of 25%  error.  In  both  experiments described  in  this  paper, \nnetworks  were  run  for  relatively  long  times  to  insure  confidence in  declaring  failure  to \n\n100K \n\nz \no \n~ a: \n~  10K \nw \n1/1 \nw \nIi:i \n...J a. \n::IE o \no ..... \n\n1000 \n\nU \n\n100 \n\nFigure 6: TRAINING A SINGLE HYPERPLANE \n\nFigure 7: ERROR RATES VS. GEOMETRIES \n\n(d06S  not separate) \n\n50 \n\n10 \n\n~ a: \na: \nw  1 \nffi u \na: w \na.  0 \n\n10 \n\n1/1 \nZ \nQ \n\n~ \nw \nt: \n\nD+-----~----~--~~--~ \nIII  N  >:J:  :J:  :J:  ~ :J:  ....  a:  a:  MINIMAL PAIR \n\nQ  CI  Z \nj::  ~~itill 11:1- 0U cui i:J \n\n1I:1Il~~a \n\n29  29  29  29  4A \nDOn  ~  Don  n  %~~~ \n\n1  %WlDTH \n\n1  4A \n\nsolve the problem.  The second experiment involves complete networks on synthetic two(cid:173)\ndimensional problems.  Two examples of the nested squares problem (random distributions \nof tokens near the surface of squares of alternating class, 400 tokens total)  are  shown in \nfigure 5.  Two parameters controlling data set generation are explored:  the relative boundary \nregion width, and the relative offset from the origin of the data set center of gravity (while \nkeeping the upper right comer of the outside square near the (1,1) coordinate);  all  data is \nkept within the unit square (except for geometry number 2).  Relative boundary widths of \n29%,  4.4%,  and  1 % are  used  with  offsets of 0%,  76%,  and 94%.  The  best results  over \nparameter settings are reported for each network for each geometry.  Four MLP architectures \nwere  used:  2:16:1,2:32:1, 2:64:1,  and  2:16:16:1;  all  of these converge to  a  solution  for \nthe easiest problem (wide boundaries, no offset), but all eventually fail  as  the  boundaries \nnarrow and/or the offset increases.  The worst performing net (2:64: 1) fails for 7/8 problems \n(maximum  error  rate  of 49%);  the  best  net  (2:16:16:1)  fails  in  3/8  (maximum  of 24% \nerror).  The LLDN is  1 to  3 orders of magnitude faster  in cpu time  when the  MLP does \nconverge,  even  though  it  does  not  use  adaptive learning  rates  in  this  experiment.  (The \naverage running time for the LLDN was  34 minutes;  for the MLP's it was  3481  minutes \n[Stardent 3040, single cpu],  but which  includes non-converging runs.  The 2:16:16:1  net \ndid, however, take 4740 minutes to solve problem 6, which was solved in 7 minutes by the \nLLDN.) The best LLDN's converge to zero errors over the problem set (fig.  6), and are not \ntoo sensitive to parameter variation, which  primarily affect convergence time and number \nof DRF's generated.  In contrast, finding good values for learning rate and momentum for \nthe MLP's for each problem was a time-consuming process.  The effect of random weight \ninitialization  in  the MLP is  not known  because of the long  running times  required.  The \nKNN error rate  was  estimated using  the leave-one-out method,  and yields  error rates  of \n0%,  10.5%,  and 38.75%  (for the  best k's)  respectively  for  the three  values  of boundary \nwidth.  The  LLDN  is  insensitive to  offset  and scale  (like the KNN)  because  of the  use \nof the  local  origin  (M)  and  error scaling  (A.).  While  global offset and scaling  problems \nfor the MLP can be ameliorated through normalization and origin translation, this method \ncannot guarantee elimination of local offset and scaling problems.  The LLDN's utilization \n\n\fA Network of Localized Linear Discriminants \n\n1109 \n\nofDRF's was reasonably efficient, with the smallest networks (after pruning) using 20,32, \nand 54 DRF's for the three boundary widths.  A simple pruning algorithm, which starts up \nafter convergence, iteratively removes the DRF's with the lowest connection weights to the \noutput unit (which is retrained after each link is removed). A range of roughly 20% to 40% \nof the DRF's were removed before developing misclassification errors on the training sets. \nThe LLDN was also tested on the \"two-spirals\" problem, which is know to be difficult for \nthe standard MLP methods.  Because ofthe boundary segmentation process, solution ofthe \ntwo-spirals problem was straightforward for the LLDN, and could be tuned to converge in \nas fast as 2.5 minutes on an Apollo DN10000. The solution shown in fig.  5 uses 50 DRF's \n(not  pruned).  The  generalization  pattern  is  relatively  \"nice\"  (for training  on  the  sparse \nversion  of the  data  set),  and  perhaps demonstrates  the  practical  nature  of the  smoothed \npiecewise linear boundary for nonlinear problems. \n\n4  Discussion \n\nThe effect of LLDN  parameters  on  generalization performance needs  to  be  studied.  In \nthe nested squares problem it is  clear that the MLP's will have better generalization when \nthey converge;  this  illustrates the potential utility  of a multi-scale approach to  developing \nlocalized discriminants.  A number of extensions are possible:  Localized feature selection \ncan  be  implemented  by  simply  zeroing  components  of V.  The  DRF  Gaussians  could \nmodel the radial dispersion of the data more effectively (in greater than two dimensions) by \ngenerating principal component axes which are orthogonal to V.  Extension to the multiclass \ncase can  be  based on  DRF sets  developed for discrimination between each  class  and all \nother classes, using the DRF's as features for a multi-output classifier.  The use of multiple \nhidden layers offers the prospect of more complex localized receptive fields.  Improvement \nin generalization might be gained by including a procedure for merging neighboring DRF's. \nWhile  it  is  felt  that the  LLD  parameters should remain fixed,  it may  be advantageous to \nallow adjustment of the DRF Gaussian dispersions as part of the output layer training.  A \nstopping rule for LLD training needs to be developed so that adaptive learning rates can be \nutilized effectively. This rule may also be useful in identifying poor token candidates early \nin the incremental LLD training. \n\nReferences \n\n[1]  J.  Sklansky  and G.N.  Wassel.  Pattern  Classifiers and Trainable Machines.  Springer \nVerlag, New York,  1981 \n\n[2]  S.  Makram-Ebeid,  lA. Sirat,  and  J.R.  Viala.  A  rationalized  error  backpropagation \nlearning algorithm.  Proc.  IlCNN, 373-380, 1988 \n\n[3] J. Sklansky, and Y. Park.  Automated design of mUltiple-class piecewise linear classifiers. \nJournal of Classification, 6: 195-222, 1989 \n\n[4]  R.D.  Short, and K.  Fukanaga.  A new nearest neighbor distance measure.  Proc.  Fifth \nInti.  Conf.  on Pattern Rec., 81-88 \n\n[5] R. Lippmann. A critical overview of neural network pattern classifiers.  Neural Networks \njor Signal Processing (IEEE), 267-275, 1991 \n\n[6]  M.S.  Glassman and M.B.  Starkey.  Minimal consonant pair discrimination for speech \ntherapy.  Proc.  European Con!  on Speech Comm.  and Tech., 273-276, 1989 \n\n\f", "award": [], "sourceid": 525, "authors": [{"given_name": "Martin", "family_name": "Glassman", "institution": null}]}