{"title": "Active Learning in Multilayer Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 295, "page_last": 301, "abstract": null, "full_text": "Active Learning in  Multilayer \n\nPerceptrons \n\nInformation and Communication R&D  Center, Ricoh  Co.,  Ltd. \n\nKenji Fukumizu \n\n3-2-3,  Shin-yokohama,  Yokohama, 222  Japan \n\nE-mail:  fuku@ic.rdc.ricoh.co.jp \n\nAbstract \n\nWe propose an active learning method with  hidden-unit reduction. \nwhich is  devised specially for multilayer perceptrons (MLP). First, \nwe  review  our active  learning  method,  and  point  out  that  many \nFisher-information-based  methods  applied  to  MLP  have a  critical \nproblem:  the  information  matrix  may  be  singular.  To  solve  this \nproblem, we  derive the singularity condition of an information ma(cid:173)\ntrix, and  propose an active learning technique that is applicable to \nMLP.  Its effectiveness is  verified  through experiments. \n\n1 \n\nINTRODUCTION \n\nWhen one trains a learning machine using a set of data given by the true system, its \nability can  be  improved if one  selects  the  training data actively.  In this  paper,  we \nconsider the problem of active learning in multilayer perceptrons  (MLP).  First, we \nreview our method of active learning (Fukumizu  el al.,  1994), in which we  prepare a \nprobability distribution and obtain training data as  samples from  the distribution. \nThis methodology leads us to an information-matrix-based criterion similar to other \nexisting ones  (Fedorov,  1972;  Pukelsheim,  1993). \n\nActive learning techniques have  been recently used with neural networks  (MacKay, \n1992;  Cohn,  1994).  Our method,  however,  as  well as many other ones  has a  crucial \nproblem:  the required inverse of an information matrix may not exist (White, 1989). \n\nWe  propose an active learning technique which is  applicable to three-layer percep(cid:173)\ntrons.  Developing  a  theory  on  the  singularity  of a  Fisher information  matrix,  we \npresent an active learning algorithm which keeps the information matrix nonsingu(cid:173)\nlar.  We  demonstrate the effectiveness of the algorithm through experiments. \n\n\f296 \n\nK.  FUKUMIZU \n\n2  STATISTICALLY OPTIMAL TRAINING DATA \n\n2.1  A  CRITERION OF OPTIMALITY \n\nWe review the criterion of statistically optimal training data (Fukumizu et al.,  1994). \nWe  consider the regression  problem in which  the target system maps a  given input \nz  to y  according to \n\ny  =  I(z) + Z, \n\nwhere I( z) is a deterministic function from R L  to R M  ,  and Z  is a  random variable \nwhose  law  is  a  normal  distribution  N(O,(12IM ),  (IM  is  the  unit  M  x  M  matrix). \nOur objective is  to estimate the true function  1 as accurately as  possible. \nLet {/( z; O)}  be a parametric model for estimation.  We use the maximum likelihood \nestimator (MLE) 0 for  training data ((z(v), y(v\u00bb)}~=l' which minimizes the sum of \nsquared  errors  in  this  case.  In  theoretical  derivations,  we  assume  that  the  target \nfunction 1 is included in the model  and equal to 1(,; (0 ). \nWe make a  training example by choosing z(v)  to try,  observing the resulting output \ny(v),  and  pairing them.  The problem of active  learning is  how  to determine input \ndata  {z(v)} ~=l to  minimize  the  estimation  error  after  training.  Our approach  is \na  statistical  one  using  a  probability  for  training,  r( z),  and  choosing  {z(v) }:Y\"=l  as \nindependent  samples  from  r(z)  to  minimize  the  expectation  of  the  MSE  in  the \nactual environment: \n\nIn the above equation,  Q is  the  environmental probability which gives input vectors \nto the true  system in  the actual  environment, and  E{(zlv\"yIV')}  means  the expec(cid:173)\ntation  on  training  data.  Eq.(I),  therefore,  shows  the  average  error  of the  trained \nmachine that is used  as a  substitute of the true function in the actual environment. \n\n2.2  REVIEW OF AN ACTIVE LEARNING  METHOD \n\nUsing statistical  a.~ymptotic theory,  Eq. (1)  is  approximated a.~ follows: \n\nEMSE =  (12  + ~ Tr [I(Oo)J-1(Oo)]  + O(N- 3j2), \n\n2 \n\n(2) \n\nwhere  the matrixes I  and  J  are  (Fisher)  illformation  matrixes defined  by \n\n1(0) = J I(z;O)dQ(z).  J(O)  = J I(z;O)r(z)dz. \n\nThe  essential  part  of Eq.(2)  is  Tr[I(Oo)J-1(Oo\u00bb),  computed  by  the  unavailable  pa(cid:173)\nrameter 00 \u2022  We  have proposed  a  practical algorithm in which we  replace 00  with O. \nprepare a family of probability {r( z; 'lI) I 'U  :  paramater} to choose training samples, \nand optimize  'U  and {)  iteratively  (Fllkumizll  et  al.,  1994). \n\nActive  Learning Algorithm \n\n1.  Select an initial training data set D[o]  from  r( z; 'lI[O])'  and compute 0[0]' \n2.  k:= 1. \n3.  Compute the optimal v = V[k]  to minimize Tr[I(O[k_l])J-1(O[k_l]\u00bb)' \n\n\fActive  Learning in  Multilayer Perceptrons \n\n297 \n\n4.  Choose  ~ new  training  data from  r(z;V[k])  and  let  D[k]  be  a  union  of \n\nD[k-l]  and  the new  data. \n\n5.  Compute the MLE 9[k]  based  on  the training data set D[k]. \n6.  k  := k + 1 and go  to 3. \n\nIt has  the \nThe  above  method  utilizes  a  probability  to  generate  training  data. \nadvantage  of making  many  data in  one  step  compared  to  existing  ones  in  which \nonly  one  data is  chosen  in  each  step,  though  their  criterions  are  similar  to  each \nother. \n\n3  SINGULARITY OF  AN INFORMATION MATRIX \n\n3.1  A  PROBLEM ON ACTIVE LEARNING IN MLP \n\nHereafter,  we  focus  on  active  learning  in  three-layer  perceptrons  with  H  hidden \nunits,  NH  = {!(z, O)}.  The map !(z; 0)  is defined  by \nh(z; 0) = L Wij s(L UjkXk + (j) + 7]i, \n\n(1~i~M), \n\n(3) \n\nH \n\nL \n\nj=1 \n\nk=1 \n\nwhere  s(t) is the sigmoidal function:  s(t) =  1/(1 + e-t ). \nOur active  learning  method  as  well  as  many other  ones  requires  the inverse of an \ninformation  matrix  J.  The  information  matrix  of  MLP,  however,  is  not  always \ninvertible  (White,  1989).  Any  statistical  algorithms  utilizing  the  inverse,  then, \ncannot be applied  directly  to MLP  (Hagiwara  et  al.,  1993).  Such  problems do  not \narise in  linear models,  which almost always have a nonsingular information matrix. \n\n3.2  SINGULARITY OF AN INFORMATION MATRIX OF  MLP \n\nThe following theorem shows that the information matrix of a three-layer perceptron \nis  singular if and  only  if the network  has  redundant hidden  units.  We  can  deduce \ntha.t if the information matrix is singular, we can make it nonsingular by eliminating \nredundant  hidden  units without changing the input-output map. \n\nTheorem 1  Assume r(z)  is  continuous  and positive  at  any z.  Then.  the  Fisher \ninformation  matrix  J  is  singular  if and  only  if at  least  one  of the  follo'wing  three \ncon(litions is satisfied: \n(1)  u,j  := (Ujl, ... , UjL)T =  0,  for  some j. \n(2)  Wj:=  (Wlj, ... ,WMj) =  OT ,  for  some j. \n(3)  For  difJerenth  andh,  (U,h,(jt) = (U,1,(h)  or (U,h,(it) = -(U,h,(h)\u00b7 \n\nThe rough sketch of the proof is shown  below.  The complete proof will  appear in a \nforthcoming  pa.per ,(Fukumizu,  1996). \nRough sketch of the proof.  We know easily that an information matrix is singular if \nand ouly if {()fJ:~(J)}a are linearly dependent.  The sufficiency  can be proved easily. \nTo show the necessity, we show that the derivatives are linearly independent if none \nof the three conditions is satisfied.  Assume a  linear relation: \n\n\f298 \n\nK.  FUKUMIZU \n\nWe can show there exists a  basis of R L ,  (Z(l), ... , Z(L\u00bb, such that Uj . z(l) i- 0 for \n'Vj,  'VI,  and  Uj!  .  z(l) + (h  i- \u00b1(u12 . z(l)  + (h)  for  jl  i- h,'VI.  We  replace  z  in \neq.(4)  by z(l)t (t E R).  Let my)  := Uj\u00b7 z(l), Sjl)  := {z  E C  I z =  ((2n+ 1)1T/=1-\n(j)/m~l),  n  E  Z},  and  D(l)  := C  - UjSY).  The  points in  S~l)  are  the  singularities \nof s(m~l) z + (j).  We define  holomorphic functions on  D(l)  as \n\nq,~l)(z) \n\n._ \n\n'Ef=l aijs(my> z + (j) + aiO + 'E~l 'E~=l,BjkWijS'(my) z + (j)x~l> z \n+'E~l,BjOWijS'(my)z+(j), \n\n(1  ~ i  ~ M). \n\nFrom eq.( 4), we have q,~l) (t)  = 0 for all t  E R.  Using standard arguments on isolated \nsingularities  of holomorphic  functions,  we  know SY)  are  removable singularities of \nq,~l)(z), and finally  obtain \n\nWij  'E~=l,BjkX~I) =  0, \n\nWij,BjO  = 0, \n\naij = 0, \n\naiO = o. \n\nIt is  easy to see ,Bjk  =  O.  This completes the proof. \n\n3.3  REDUCTION PROCEDURE \n\nWe  introduce  the  following  reduction  procedure  based  on  Theorem  1.  Used  dur(cid:173)\ning  BP  training,  it  eliminates  redundant  hidden  units  and  keeps  the  information \nmatrix nonsingular.  The criterion  of elimination is  very important,  because  exces(cid:173)\nsive  elimination of hidden  units degrades the approximation capacity.  We  propose \nan  algorithm  which  does  not  increase  the  mean  squared  error on  average.  In  the \nfollowing,  let  Sj  := s( itj . z  + llj)  and \u00a3( N) ==  A/ N  for a  positive number A. \n\nReduction Procedure \n\n1.  If \n\nIIWjll2 J(Sj - s((j))2dQ < \u00a3(N), \n\nand  lli  -. lli + WijS((j)  for  all i. \n\nthen eliminate the jth hidden unit, \n\n2.  If \n3.  If \n\nIIwjll2 J(sj)2dQ < \u20ac(N), \nIIwhll2 J(sh  - sjJ 2 dQ  < \u20ac(N) \n\nthen eliminate the jth hidden unit. \n\nfor  different it and h, \n\nthen  eliminate the hth hidden unit and Wij!  -. wih  + Wijz  for  all  i. \n\n4.  If \n\nIIwhll2 J(1  - sh  - sjJ 2dQ  < \u20ac(N) \n\nfor different  jl  and h, \n\n~hen eliminate  the  j2th  hidden  unit  and  wih  -. Wij!  - wih, \nwih \n\nfor all 'i. \n\nili  -. ili  + \n\nFrom Theorem 1, we know that Wj,  itj, (ith' (h) - (it'};, (j!), or (ith, (h )+( it]:, (h) \ncan  be  reduced  to  0  if the  information  matrix  is  singular.  Let  0 E  NK  denote \nthe  reduced  parameter from  iJ  according  to  the  above  procedure.  The  above  four \nconditions are,  then,  given  by  calculating J II/(x; 0)  -/(x; iJ)WdQ. \nWe  briefly  explain  how  the  procedure  keeps  the  information  matrix  nonsingular \nand does not increase EMSE in  high  probability.  First, suppose detJ(Oo)  =  0,  then \nthere  exists  Off  E NK  (K < H)  such  that  f(x;Oo)  = f(x;Off)  and  detJ(Of)  i- 0 \nin N K.  The  elimination  of hidden  units  up to K,  of course,  does  not increase  the \nEMSE.  Therefore,  we  have  only  to  consider  the  case  in  which  detJ(Oo)  i- 0  and \nhidden  units are eliminated. \nSuppose J II/(z; Off) -/(z; Oo)1I 2dQ > O(N- 1 )  for any reduced parameter Off  from \n00 \u2022  The probability of satisfying J II/(z;iJ) -/(z;O)WdQ < A/N is  very small for \n\n\fActive Learning in  Multilayer Perceptrons \n\n299 \n\na  sufficiently  small  A.  Thus,  the  elimination  of  hidden  units  occurs  in  very  tiny \nprobability.  Next, suppose J 1I!(x; (Jff)  - !(x; (Jo)1I 2dQ =  O(N-l).  Let 0 E NK  be \na  reduced  parameter made from  9 with  the same  procedure as  we  obtain  (Jff  from \n(Jo.  We  will  show for a  sufficiently small A, \n\nwhere  OK  is  MLE  computed  in  NK.  We  write  (J  =  ((J(l),(J(2\u00bb)  in  which  (J(2)  is \nchanged to 0 in reduction,  changing the coordinate system if necessary.  The Taylor \nexpansion and asymptotic theory  give \nE [JII!(x; OK)  - !(x; (Jo)1I 2dQ]  ~ JII!(x; (Jf)- !(x; (Jo)11 2dQ+ ~ Tr[In((Jf)Jil1((Jf)), \n\nE [JII!(x; 9)  - !(x; O)WdQ]  ~ JII!(x; (Jf)- !(x; (Jo)1I 2dQ+ ;, Tr[h2 ((Jf)J2;l ((Jo)], \nwhere Iii  and  Jii  denote the local information matrixes w.r.t.  (J(i)  ('i  =  1,2).  Thus, \n\n2 \n\nE [JII!(x; 0)  - !(x; (Jo)1I 2dQ]  - E [JII!(x; OK)  - !(x; (Jo)1I 2dQ] \n~  -E [JII!(x;o) - !(X;O)1I 2dQ] +;' Tr[h2((Jf)J;1((Jo)) \n\n2 \n\n2 \n\n- ;, Tr[Ill((Jf)Jil 1((Jf)] + E [JII!(x; 0)  - !(x; (Jo)1I 2dQ]  . \n\nSince the sum of the last two terms is positive, the 1.h.s is positive if E[f II!( x; OK)_ \n!(x; 0)1I 2dQ)  < BIN for a sufficiently small B.  Although we cannot know the value \nof this expectation, we can make the probability of holding this enequality very high \nby  taking a  small  A. \n\n4  ACTIVE LEARNING WITH REDUCTION \n\nPROCEDURE \n\nThe reduction  procedure  keeps the information  matrix nonsingular and makes the \nactive learning algorithm applicable to MLP even with surplus hidden  units. \n\nActive Learning with Hidden Unit Reduction \n\n1.  Select initial  training data set Do from  r( x; V[O]).  and compute 0[0]' \n2.  k:= 1,  and do REDUCTION PROCEDURE. \n3.  Compute  the  optimal  v  =  1'[k]  to  minimize  Tr[I(9[k_l])J-l (9[k-l] )).  using \n\nthe steepest descent  method. \n\n4.  Choose  Nk  new  training  data from  r( x; V[k])  and  let  D[k]  be  a  union  of \n\nD[k-l]  and the new  data. \n\n5.  Compute  the  MLE  9[kbbased  on  the  training  data  D[k]  using  BP  with \n\nREDUCTION PROCE  URE. \n\n6.  k:= k + 1 and  go  to 3. \n\nThe BP with reduction  procedure is  applicable  not only  to active learning,  but to \na variety of statistical techniques that require the inverse of an information matrix. \nWe  do  not  discuss it in this paper.  however. \n\n\f300 \n\n- - Active Learning \n\n\u2022  Active Learning [Av\u00b7Sd,Av+Sd] \n\n- .. - . Passive Learning \n\n+  Passive Learning [Av\u00b7Sd,Av+Sd]  ~ \n\n0.00001 \n\nK.  FUKUMIZU \n\n- - Learning Curve \n\n4 \n\n..\u2022.. It of hidden units \n\n\u2022 \n\n\u2022 \n\n\u2022  \u2022  + \n\n+ \n\n+ \n\n+ \n\n+ \n\n\u2022  \u2022  \u2022 \n\n200 \n\n400 \n\n600 \n\n800 \n\n100> \n\nThe Number of Training nata \n\nO.IXXXlOOI \n\n100  200  300  400 \n\n0 \nsoo  600  700  800  900  100> \n\nThe Number of Training nata \n\nFigure 1:  Active/Passive Learning:  f(x) =  s(x) \n\n5  EXPERIMENTS \n\nWe  demonstrate  the  effect  of the  proposed  active  learning  algorithm  through  ex(cid:173)\nperiments.  First we use  a  three-layer model with  1 input unit, 3 hidden  units,  and \n1  output  unit.  The  true  function  f  is  a  MLP  network  with  1  hidden  unit.  The \ninformation  matrix  is  singular at  0o,  then.  The environmental  probability,  Q,  is \na  normal  distribution  N(O,4).  We  evaluate  the  generalization  error in  the  actual \nenvironment using  the following  mean squared error of the function  values: \n\n! 1If(:l!; 0)  - f(:l!)11 2dQ. \n\nWe  set  the  deviation  in  the  true  system  II  =  0.01.  As  a  family  of distributions \nfor  training  {r(:l!;v)},  a  mixture  model  of 4  normal  distributions is  used.  In  each \nstep of active  learning,  100  new  samples  are  added.  A  network  is  trained  using \nonline BP, presented with all  training data 10000 times in each step,  and operated \nthe  reduction  procedure once  a  100 cycles  between  5000th and  10000th  cycle.  We \ntry  30  trainings changing the seed  of random  numbers.  In  comparison,  we  train a \nnetwork passively based  on  training samples given by the probability Q. \nFig.1 shows the averaged learning curves of active/passive learning and the number \nof  hidden  units in a  typical  learning  curve.  The  advantage  of the proposed  active \nlearning algorithm is clear.  We  can find  that the algorithm has expected effects on \na simple,  ideal approximation problem. \n\nSecond,  we  apply  the  algorithm  to  a  problem  in  which  the  true  function  is  not \nincluded in the MLP model.  We use  MLP with 4 input units, 7 hidden units, and 1 \noutput unit.  The true function is given by  f(:l!)  = erf(xt), where erf(t) is the  error \nfunction.  The graph of the error function  resembles that of the sigmoidal function, \nwhile they never coincide  by any affine  transforms.  We set Q = N(0,25  X  14).  We \ntrain  a  network  actively/passively  based  on  10  data sets,  and  evaluate  MSE's  of \nfunction  values.  Other conditions are the same as  those  of the first  experiment. \n\nFig.2  shows  the  averaged  learning  curves  and  the  number  of  hidden  units  in  a \ntypical learning curve.  We find tha.t the active learning algorithm reduces the errors \nthough  the  theoretical  condition  is  not  perfectly  satisfied  in  this case.  It suggests \nthe robustness of our active learning algorithm. \n\n\fActive  Learning  in  Multilayer Perceptrons \n\n301 \n\n8 \n\nO.IXXlI \n\n- - Active Learning \n- .. - . Passive Learning \n\n- - Learning Curve \n\n7  It \n\n;; \n:s c :; \n~ \n6  e. \n:r \nis: \n~ :s \n..  5; \n~ \n\n..\u2022.. # of hidden units \n\nL  .... . ... . . . . ......... : \n\n200 \n\n400 \n\n600 \n\n800 \n\nIIXXl \n\nThe Number ofTraining nata \n\nr-~~~r-~-r~--r-~-+4 \n100  200  300  400  500  600  700  800  900 \n\nIIXXl \n\nThe Number of Training nata \n\nFigure 2:  Active/Passive Learning:  f(z) = erf(xI) \n\n6  CONCLUSION \n\nWe  review statistical active learning methods and point out a  problem in their ap(cid:173)\nplication to MLP:  the required inverse of an information matrix does not exist if the \nnetwork has  redundant  hidden  units.  We  characterize  the singularity condition  of \nan information matrix and propose an active learning algorithm which is applicable \nto  MLP  with  any  number  of hidden  units.  The  effectiveness  of the  algorithm  is \nverified  through  computer simulations, even  when  the theoretical  assumptions  are \nnot perfectly satisfied. \n\nReferences \n\nD.  A.  Cohn.  (1994)  Neural  network  exploration  using  optimal experiment design. \nIn  J.  Cowan  et  al.  (ed.),  A d'vances  in  Neural  Information  Processing  SYHtems  6, \n679-686.  San Mateo, CA:  Morgan Kaufmann. \n\nV.  V.  Fedorov.  (1972)  Theory  of Optimal Experiments.  NY:  Academic Press. \nK.  Fukumizu.  (1996)  A  Regularity  Condition of the Information  Matrix of a  Mul(cid:173)\ntilayer Percept ron  Network.  Neural Networks,  to appear. \nK.  Fukumizu,  &  S.  Watanabe.  (1994)  Error  Estimation  and  Learning  Data Ar(cid:173)\nrangement for  Neural  Networks.  Proc.  IEEE Int.  Conf.  Neural Networks :777-780. \n\nK.  Hagiwara,  N.  Toda,  &  S.  Usui.  (1993)  On  the  problem  of  applying  AIC  to \ndetermine the structure of a  layered feed-forward  neural  network.  Proc.  1993 Int. \nJoint  ConI.  Neural Networks  :2263-2266. \n\nD.  MacKay.  (1992)  Information-based objective functions for  active data selection, \nNe'ural  Computation 4(4):305-318. \n\nF.  Pukelsheim.  (1993)  Optimal Design  of Experiments.  NY:  John Wiley &  Sons. \n\nH.  White.  (1989)  Learning  in  artificial  neural  networks:  A  statistical  perspective \nNeural  Computation 1 ( 4 ):425-464. \n\n\f", "award": [], "sourceid": 1140, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}]}