{"title": "Information Measure Based Skeletonisation", "book": "Advances in Neural Information Processing Systems", "page_first": 1080, "page_last": 1087, "abstract": null, "full_text": "Information  Measure Based  Skeletonisation \n\nSowmya Ramachandran \n\nDepartment of Computer Science \n\nUniversity of Texas at Austin \n\nAustin, TX 78712-1188 \n\nLorien Y.  Pratt * \n\nDepartment  of Computer Science \n\nRutgers  University \n\nNew  Brunswick,  NJ  08903 \n\nAbstract \n\nAutomatic determination of proper neural network  topology by  trimming \nover-sized  networks  is  an  important  area  of study,  which  has  previously \nbeen  addressed  using  a  variety  of techniques.  In  this  paper,  we  present \nInformation  Measure  Based  Skeletonisation  (IMBS),  a  new  approach  to \nthis  problem where  superfluous  hidden  units  are  removed  based  on  their \ninformation measure (1M).  This measure,  borrowed from decision  tree  in(cid:173)\nduction  techniques,  reflects  the  degree  to  which  the  hyperplane  formed \nby  a  hidden  unit  discriminates  between  training  data  classes.  We  show \nthe results of applying IMBS  to three classification tasks and demonstrate \nthat it removes a  substantial number of hidden  units without significantly \naffecting  network performance. \n\n1 \n\nINTRODUCTION \n\nNeural networks can be evaluated based on their learning speed,  the space and time \ncomplexity of the learned  network,  and generalisation performance.  Pruning over(cid:173)\nsized  networks  (skeletonisation)  has  the  potential to improve networks  along these \ndimensions as follows: \n\n\u2022  Learning  Speed:  Empirical  observation  indicates  that  networks  which  have \nbeen  constrained  to  have  fewer  parameters  lack  flexibility  during search,  and \nso  tend  to learn slower.  Training a  network  that is  larger  than  necessary  and \n\n*This work  was partially supported by  DOE #DE-FG02-91ER61129,  through subcon(cid:173)\n\ntract #097P753  from  the University  of Wisconsin. \n\n1080 \n\n\fInformation Measure Based Skeletonisation \n\n1081 \n\ntrimming  it  back  to  a  reduced  architecture  could  lead  to  improved  learning \nspeed . \n\n\u2022  Network Complexity:  Skeletonisation improves both space and time complexity \n\nby reducing  the number of weights  and hidden units . \n\n\u2022  Generalisation:  Skeletonisation  could  constrain  networks  to  generalise  better \n\nby  reducing  the  number of parameters used  to fit  the data. \n\nVarious  techniques  have been  proposed for  skeletonisation.  One  approach  [Hanson \nand  Pratt,  1989,  Chauvin,  1989,  Weigend  et  al.,  1991]  is  to  add  a  cost  term  or \nbias  to  the  objective  function.  This  causes  weights  to  decay  to  zero  unless  they \nare  reinforced.  Another  technique  is  to  measure  the  increase  in  error  caused  by \nremoving a  parameter  or  a  unit, as  in  [Mozer  and Smolensky,  1989,  Le  Cun  et  al., \n1990].  Parameters  that  have  the  least  effect  on  the error  may  be  pruned from  the \nnetwork. \n\nIn  this  paper,  we  present  Information Measure  Based  Skeletonisation  (IMBS),  an \nalternate  approach  to  this  problem,  in  which  superfluous  hidden  units  in  a  single \nhidden-layer  network  are  removed  based  on  their  information  measure  (1M).  This \nidea is  somewhat  related to  that presented  in  [Siestma and  Dow,  1991],  though  we \nuse  a  different  algorithm for  detecting superfluous  hidden units. \n\nWe  also  demonstrate  that  when  IMBS  is  applied  to  a  vowel  recognition  task,  to \na  subset  of the  Peterson-Barney  10-vowel  classification  problem,  and  to  a  heart \ndisease diagnosis problem, it removes a substantial number of hidden units without \nsignificantly affecting  network  performance. \n\n2  1M  AND  THE HIDDEN LAYER \n\nSeveral decision  tree induction schemes  use  a  particular information-theoretic mea(cid:173)\nsure,  called 1M, of the degree to which an attribute separates (discriminates between \nthe classes  of)  a  given set  of training data [Quinlan,  1986].  1M  is  a  measure  of the \ninformation gained by knowing the value of an attribute for  the purpose of classifi(cid:173)\ncation.  The higher  the 1M  of an  attribute,  the greater  the uniformity of class data \nin the subsets  of feature space it creates. \n\nA useful simplification of the sigmoidal activation function used in back-propagation \nnetworks  [Rumelhart  et  al.,  1986]  is  to reduce  this function  to a  threshold by  map(cid:173)\nping  activations  greater  than  0.5  to  1  and  less  than  0.5  to  O.  In  this  simplified \nmodel,  the hidden units form hyperplanes in the feature space which separate data. \nThus,  they  can  be considered  analogous to binary-valued attributes,  and the 1M  of \neach  hidden unit can be calculated as in decision  tree  induction  [Quinlan,  1986]. \n\nFigure 1 shows the training data for  a fabricated two-feature,  two-class problem and \na  possible  configuration of the  hyperplanes  formed  by  each  hidden  unit at the end \nof training.  Hyperplane h1 's higher 1M  corresponds to the fact that it separates the \ntwo classes  better than  h2. \n\n\f1082 \n\nRamachandran  and Pratt \n\n1M  = .0115 \n\no \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\nFigure  1:  Hyperplanes  and  their  IM.  Arrows  indicate  regions  where  hidden  units  have \nactivations>  0.5. \n\n3  1M  TO  DETECT  SUPERFLUOUS  HIDDEN  UNITS \n\nOne  of the  important  goals  of training  is  to  adjust  the set  of hyperplanes  formed \nby the hidden layer so  that they separate the training data. 1  We define  superfluous \nunits  as  those  whose  corresponding  hyperplanes  are  not  necessary  for  the  proper \nseparation of training data.  For example,  in Figure 1,  hyperplane  h2  is  superfluous \nbecause: \n\n1.  hI separates  the data better  than h2  and \n2.  h2  does  not separate  the data in either of the  two  regions  created  by hI. \n\nThe  IMBS  algorithm to  identify superfluous  hidden  units,  shown  in  Figure  2,  re(cid:173)\ncursively  finds  hidden  units  that are  necessary  to separate  the  data and  classifies \nthe  rest  as  superfluous.  It is  similar  to  the  decision  tree  induction  algorithm  in \n[Quinlan,  1986]. \n\nThe  hidden  layer  is  skeletonised  by  removing  the superfluous  hidden  units.  Since \nthe removal of these  units perturbs the inputs to the output layer,  the network  will \nhave  to be  trained further  after skeletonisation to recover  lost  performance. \n\n4  RESULTS \n\nWe  have  tested  IMBS  on  three classification problems,  as follows: \n\n1.  Train a  network  to an acceptable level  of performance. \n2.  Identify and remove superfluous  hidden units. \n3.  Train the skeletonised  network further  to  an acceptable  level  of performance. \n\nWe will refer  to the stopping point of training at step  1 as the  skeletonisation point \n(SP); further training will be referred to in terms of SP + number of training epochs. \n1 This again is not strictly true for  hidden units with sigmoidal activation,  but holds for \n\nthe approximate  model. \n\n\fInformation  Measure Based Skeletonisation \n\n1083 \n\nInput: \n\nTraining  data \nHidden  unit  activations  for  each  training  data pattern. \n\nOutput: \n\nList  of superfluous  hidden  units. \n\nMethod: \n\nmain  ident-superfluous-hu \nbegin \n\ndata-set~ training  data \nuseful-hu-list~ nil \npick-best-hu (data-set, useful-hu-list) \noutput hidden  units  that  are not in  useful-hu-list \n\nend \nprocedure  pick-best-hu(data-set,  useful-hu-list) \nbegin \n\nif all  the  data in  data-set  belong  to  the same class  then  return \nCalculate 1M of each  hidden  unit. \nhl~ hidden  unit  with  best  1M. \nadd  hl  to  the  useful-hu  list \ndsl~ all  the data in  data-set  for  which  hl has  an  activation  of>  .5 \nds2~ all  the  data in data-set  for  which  hl  has  an  activation  of <= .5 \npick-best-hu(dsl,  useful-hu-list) \npick-best-hu(ds2,  useful-hu-list) \n\nend \n\nFigure 2:  IMBS:  An Algorithm for  Identifying Superfluous  Hidden  Units \n\nFor  each  problem,  data  was  divided  into  a  training  set  and  a  test  set.  Several \nnetworks  were  run  for  a  few  epochs  with  different  back-propagation  parameters  'rJ \n(learning rate)  and  0:  (momentum)  to determine  their locally optimal values. \n\nFor each problem, we chose an initial architecture and trained 10 networks with dif(cid:173)\nferent  random initial weights for  the same number of epochs.  The performances  of \nthe original (i.e.  the network  before skeletonisation) and the skeletonised networks, \nmeasured as number of correct classifications of the training and test sets,  was mea.(cid:173)\nsured both at SP and after further  training.  The retrained skeletonised network was \ncompared with  the original network  at SP as well  as  the original network  that had \nbeen  trained further  for  the same number of weight  updates. 2  All training was  via \nthe standard back-propagation algorithm with a  sigmoidal activation function  and \nupdates after  every  pattern presentation  [Rumelhart  et  al.,  1986].  A  paired T-test \n[Siegel,  1988]  was  used  to measure  the significance of the difference  in performance \nbetween  the skeletonised and original networks.  Our experimental results are sum(cid:173)\nmarised in Figure 3, and Tables 1 and 2;  detailed experimental conditions are given \nbelow. \n\n2This  was  ensured  by  adjusting  the  number  of  epochs  a  network  was  trained  after \nskeletonisation according  to  the  number  of hidden  units in  the network.  Thus,  a  network \nwith 10 hidden  units was trained  on  twice as  many epochs  as one  with  20  hidden  units. \n\n\f1084 \n\nRamachandran and Pratt \n\n<Xl \n\n0> \n\n~ \n-a \nji  0> \nS \n18 \n~ \ni5  S; \nu \n~ \n\nPB Vowel \n\n0  EJ \n~  8[ \n\n300000 \n\n240000 \n\n-.:> \nQ) en \n~ \nen  ~ \n=:. :r  C> \n\n240000 \n\n32000 \n\n8  Robinson vowel \n\nHeart disease \n\n<Xl \n0> \n\n9.0 \nA5 \n\n1 .1 \nAA \n\n1.3 \nAA \n\n1 .5 \nAA \n\n166000 \n\n172000 \n\n17600C \n\nC> \neN \n\nC> \n\n9 .0 \n&5 \n\n1 . 1 \n&6 \n\n1 .3 \n&6 \n\n1 .5 \n&6 \n\n166000 \n\n174000 \n\nWatght Updatas \n\nWaight Updata& \n\nWatght  Updata& \n\nFigure  3:  Summary  of  experimental  results.  Circles  represent  skeletonised  networks; \ntriangles  represent  unskeletonised  networks for  comparison.  Note that when  performance \ndrops upon skeletonisation,  the original performance level is  recovered  within a few  weight \nupdates.  In all cases,  hidden  unit count  is  reduced. \n\n4.1  PETERSON-BARNEY DATA \n\nIMBS was first  evaluated on a  3-class subset of the Peterson-Barney 10-vowel classi(cid:173)\nfication  data set, originally described in [Peterson and Barney,  1952],  and recreated \nby  [Watrous,  1991].  This data consists  of the formant values  F1 and F2 for  each of \ntwo  repetitions  of each  of ten  vowels  by  76  speaker  (1520  utterances).  The vowels \nwere  pronounced  in  isolated  words  consisting  of the  consonant  \"h\",  followed  by  a \nvowel,  followed  by  \"d\".  This  set  was  randomly  divided  into  a  ~,~ training/test \nsplit, with  298  and  150  patterns,  respectively. \n\nOur initial architecture was a fully connected network with 2 input units, one hidden \nlayer with 20  units,  and 3 output units.  We  trained the  networks with T]  = 1.0  and \nex  =  0.001  until  the  TSS  (total  sum  of squared  error)  scores  seemed  to  reach  a \nplateau.  The networks  were  trained for  2000  epochs  and  then skeletonised. \n\nThe  skeletonisation  procedure  removed  an  average  of  10.1  (50.5%)  hidden  units. \nThough the  average performance  of the skeletonised networks was  worse  than that \nof the original,  this difference  was  not statistically significant  (p = 0.001). \n\n4.2  ROBINSON VOWEL  RECOGNITION \n\nUsing data from  [Robinson,  1989], we trained networks to perform speaker indepen(cid:173)\ndent recognition of the 11 steady-state vowels of British English using a training set \nof LPC-derived log  area ratios.  Training and  test  sets  were  as  used  by  [Robinson, \n1989], with 528  and 462  patterns,  respectively. \n\nThe initial network architecture was fully connected,  with 10  input units,  11  output \nunits,  and  30  hidden  units.  Networks  were  trained  with  T]  = 1.0  and  ex  = 0.01, \nuntil the performance on the training set exceeded  95%.  The networks were trained \nfor  1500  epochs  and  then  skeletonised.  The  skeletonisation  procedure  removed \nan  average  of 5.8  (19.3%)  hidden  units.  The  difference  in  performance  was  not \nstatistically significant (p = 0.001). \n\n\fInformation Measure Based Skeletonisation \n\n1085 \n\nTable 1:  Performance of unskeletonised  networks \n\nTable 2:  Mean difference  in the number of correct classifications between the original and \nskeletonised  networks.  Positive  differences  indicate  that  the  original  network  did  better \nafter further  training.  The numbers in parentheses indicate the 99.9%  confidence intervals \nfor  the mean. \n\ncomparison  points \n\nOriginal  Skeletonised \nPeterson-Barney \nSP \nSP \nSP \nSP+1010 \nSP+500  SP+1010 \nRobinson Vowel \nSP \nSP \nSP \nSP+620 \nSP+500  SP+620 \nHeart  Disease \nSP \nSP \nSP \nSP+33 \nSP+14 \nSP+33 \n\nmean  difference \n\nTraining set \n\nI \nTest set  I \n\n3.10  l-0.83,  7.03J \n-0.1  [-1.76,  1.56] \n0.20  [-1.52,  1.91] \n1.70 J -2.40, 5.80) \n-8.2  [-20.33,  3.93] \n-0.30  [ -3.15,  2.55] \n\n-0.10  l-2.05,  1.84J \n0.7  [-0.73,  2.13] \n0.30  [-1.30,  1.90] \n2.40 J -2.39, 7.19J \n-4.4  [-18.26,  9.46] \n-0.301-8.36, 7.76] \n\n20.80 J-5.66, 47.26J \no [-4.28,  +4.28] \n0.60  [ -4.55, 5.75] \n\n12.20  l-1.65, 26.051 \no [-2.85,  2.85] \n0.40  [ -3.03, 3.83] \n\n\f1086 \n\nRamachandran and Pratt \n\n4.3  HEART  DISEASE  DATA \n\nUsing  a  14-attribute set  of diagnosis  information,  we  trained  networks  on  a  heart \ndisease diagnosis problem [Detrano  et al.,  1989].  Training and test data were chosen \nrandomly in a  ~, ~ split of 820  and 410  patterns,  respectively.  The initial networks \nwere  fully  connected,  with  25  input  units,  one  hidden  layer  with  20  units,  and  2 \noutput  units.  The  networks  were  trained with  a  = 1.25  and  'rJ  = 0.005.  Training \nwas  stopped  when  the  TSS  scores  seemed  to  reach  a  plateau.  The  networks  were \ntrained for  300  epochs  and then skeletonised. \n\nThe skeletonisation procedure  removed an average of 9.6 (48%) hidden units.  Here, \nremoving  superfluous  units  degraded  the  performance  by  an  average  of 2.5%  on \nthe  training set  and  3.0% on  the  test set.  However,  after  being  trained further  for \nonly  30  epochs,  the  skeletonised  networks  recovered  to  do  as  well  as  the  original \nnetworks. \n\n5  CONCLUSION  AND  EXTENSIONS \n\nWe  have  introduced  an  algorithm,  called  IMBS,  which  uses  an  information  mea(cid:173)\nsure  borrowed from  decision  tree  induction schemes  to  skeletonise  over-sized  back(cid:173)\npropagation  networks.  Empirical  tests  showed  that  IMBS  removed  a  substantial \npercentage of hidden units without significantly affecting  the network performance. \n\nPotential extensions  to  this work include: \n\n\u2022  Using decision  tree reduction schemes  to allow for  trimming not only superflu(cid:173)\n\nous hyperplanes,  but also those responsible for  overfitting the training data, in \nan effort  to improve generalisation. \n\n\u2022  Extending IMBS  to  better  identify superfluous  hidden  units under  conditions \n\nof less  than  100% performance  on  the training data. \n\n\u2022  Extending IMBS  to work  for  networks  with more  than  one  hidden  layer. \n\n\u2022  Performing more rigorous empirical evaluation. \n\n\u2022  Making IMBS less sensitive to the hyperplane-as-threshold assumption.  In par(cid:173)\n\nticular,  a  model  with  variable-width hyperplanes  (depending on  the sigmoidal \ngain)  may be  effective. \n\nAcknowledgements \n\nOur thanks to Haym Hirsh and Tom Lee for insightful comments on earlier drafts of \nthis paper,  to  Christian Roehr for  an update to  the IMBS  algorithm, and to Vince \nSgro,  David Lubinsky,  David Loewenstern  and  Jack  Mostow  for  feedback  on  later \ndrafts.  Matthias  Pfister,  M.D.,  of University  Hospital  in  Zurich,  Switzerland  was \nresponsible  for  collection  of the  heart  disease  data.  We  used  software  distributed \nwith  [McClelland and Rumelhart,  1988]  for  many of our simulations. \n\n\fInformation Measure Based Skeletonisation \n\n1087 \n\nReferences \n\n[Chauvin,  1989]  Chauvin, Y.  1989.  A back-propagation algorithm with optimal use \nof hidden units.  In Touretzky, D. S., editor 1989,  Advances in Neural Information \nProcessing Systems  1.  Morgan  Kaufmann, San Mateo,  CA.  519-526. \n\n[Detrano  et  al.,  1989]  Detrano,  R.j  Janosi,  A.j  Steinbrunn,  W.j  Pfisterer,  M.; \n\nSchmid,  J.;  Sandhu,  S.;  Guppy,  K.;  Lee,  S.;  and  Froelicher,  V.  1989.  Inter(cid:173)\nnational application of a  new  probability algorithm for  the diagnosis of coronary \nartery disease.  American Journal  of Cardiology 64:304-310. \n\n[Hanson  and Pratt,  1989]  Hanson,  Stephen  Jose  and  Pratt,  Lorien Y.  1989.  Com(cid:173)\nparing biases for minimal network construction with back-propagation. In Touret(cid:173)\nzky,  D.  S.,  editor  1989,  Advances  in  Neural  Information  Processing  Systems  1. \nMorgan Kaufmann,  San  Mateo,  CA.  177-185. \n\n[Le  Cun  et  al.,  1990]  Le  Cun,  Yanni  Denker,  John;  Solla,  Sara  A.;  Howard, \n\nRichard  E.;  and  Jackel,  Lawrence  D.  1990.  Optimal  brain  damage.  In Touret(cid:173)\nzky,  D.  S.,  editor  1990,  Advances  in  Neural  Information  Processing  Systems  2. \nMorgan Kaufmann,  San Mateo,  CA. \n\n[McClelland and Rumelhart,  1988]  McClelland, James L. and Rumelhart, David E. \n1988.  Explorations  in  Parallel  Distributed  Processing:  A  Handbook  of Models, \nPrograms,  and  Exercises.  Cambridge, MA,  The MIT Press. \n\n[Mozer  and Smolensky,  1989]  Mozer,  Michael C.  and Smolensky, Paul 1989.  Skele(cid:173)\ntonization:  A  technique  for  trimming the  fat  from  a  network  via  relevance  as(cid:173)\nsessment. \nProcessing  Systems  1.  Morgan  Kaufmann, San  Mateo,  CA.  107-115. \n\nIn  Touretzky,  D.  S.,  editor  1989,  Advances  in  Neural  Information \n\n[Peterson  and Barney,  1952]  Peterson,  and Barney,  1952.  Control methods used  in \n\na study of the vowels.  J.  Acoust.  Soc.  Am. 24(2):175-184. \n\n[Quinlan,  1986]  Quinlan, J. R.  1986.  Induction of decision trees.  Machine  Learning \n\n1(1):81-106. \n\n[Robinson,  1989]  Robinson,  Anthony John 1989.  Dynamic  Error Propagation  Net(cid:173)\n\nworks.  Ph.D.  Dissertation, Cambridge University,  Engineering  Department. \n\n[Rumelhart  et  al.,  1986]  Rumelhart, D.; Hinton,  G.; and Williams, R.  1986.  Learn(cid:173)\n\ning representations  by  back-propagating errors.  Nature  323:533-536. \n\n[Siegel,  1988]  Siegel, Andrew F. 1988.  Statistics and data  analysis:  An Introduction. \n\nJohn Wiley and Sons.  chapter  15,  336-339. \n\n[Siestma and  Dow,  1991]  Siestma,  Jocelyn  and  Dow,  Robert  J.  F.  1991.  Creating \n\nartificial neural  networks  that generalize.  Neural Networks 4:67-79. \n\n[Watrous,  1991]  Watrous,  Raymond  L.  1991.  Current  status  of peterson-barney \nvowel formant data.  Journal  of the  Acoustical Society  of America 89(3):2459-60. \n[Weigend  et  al.,  1991]  Weigend,  Andreas S.;  Rumelhart, David E.;  and Huberman, \n\nBernardo A.  1991.  Generalization by weight-elimination with application to fore(cid:173)\ncasting.  In  Lippmann,  R.  P.;  Moody,  J.  E.;  and Touretzky,  D.  S.,  editors  1991, \nAdvances  in  Neural  Information  Processing  Systems  3.  Morgan  Kaufmann,  San \nMateo,  CA.  875-882. \n\n\f", "award": [], "sourceid": 484, "authors": [{"given_name": "Sowmya", "family_name": "Ramachandran", "institution": null}, {"given_name": "Lorien", "family_name": "Pratt", "institution": null}]}