{"title": "Data Visualization and Feature Selection: New Algorithms for Nongaussian Data", "book": "Advances in Neural Information Processing Systems", "page_first": 687, "page_last": 693, "abstract": "", "full_text": "Data Visualization and Feature  Selection: \n\nNew Algorithms  for  Nongaussian Data \n\nHoward Hua Yang and John Moody \n\nOregon  Graduate Institute of Science  and Technology \n20000  NW,  Walker Rd., Beaverton,  OR97006, USA \n\nhyang@ece.ogi.edu,  moody@cse.ogi.edu,  FAX:503  7481406 \n\nAbstract \n\nData  visualization  and  feature  selection  methods  are  proposed \nbased on the )oint mutual information and ICA.  The visualization \nmethods can find  many good 2-D  projections for  high dimensional \ndata interpretation,  which  cannot be easily found by  the other ex(cid:173)\nisting methods.  The new  variable selection  method is found  to be \nbetter in eliminating redundancy in the inputs than other methods \nbased  on  simple mutual information.  The efficacy  of the  methods \nis illustrated on a radar signal analysis problem to find  2-D viewing \ncoordinates for  data visualization and to select  inputs for  a  neural \nnetwork  classifier. \nKeywords:  feature  selection,  joint mutual information,  ICA,  vi(cid:173)\nsualization, classification. \n\n1 \n\nINTRODUCTION \n\nVisualization  of input  data and  feature  selection  are  intimately  related.  A  good \nfeature  selection  algorithm can  identify  meaningful  coordinate  projections  for  low \ndimensional data visualization.  Conversely,  a good visualization technique can sug(cid:173)\ngest  meaningful features  to include  in a  model. \n\nInput  variable selection  is  the most  important step  in  the  model selection  process. \nGiven  a  target  variable,  a  set  of  input  variables  can  be  selected  as  explanatory \nvariables by some prior knowledge.  However,  many irrelevant input variables cannot \nbe  ruled  out  by  the  prior  knowledge.  Too  many  input  variables  irrelevant  to  the \ntarget  variable  will  not  only  severely  complicate  the  model  selection/estimation \nprocess  but also damage the  performance of the final  model. \n\nSelecting input variables after model specification is a model-dependent approach[6]. \nHowever,  these  methods can be very slow if the model space is large.  To reduce  the \ncomputational  burden  in  the  estimation  and  selection  processes,  we  need  model(cid:173)\nindependent  approaches  to select  input  variables  before  model  specification.  One \nsuch approach is 6-Test  [7].  Other approaches are based on the  mutual information \n(MI)  [2,  3,4] which is very effective in evaluating the relevance of each input variable, \nbut it fails to eliminate redundant  variables. \n\nIn this paper,  we focus on the model-independent approach for input variable selec-\n\n\f688 \n\nH.  H.  Yang and J.  Moody \n\ntion based on joint mutual information (JMI). The increment from  MI  to JMI is the \nconditional MI.  Although the conditional MI  was used in [4]  to show the monotonic \nproperty  of the  MI,  it was  not used for  input selection. \n\nData visualization  is  very  important for  human  to  understand  the  structural  re(cid:173)\nlations  among  variables  in  a  system.  It is  also  a  critical  step  to  eliminate  some \nunrealistic  models.  We  give  two  methods for  data visualization.  One  is  based  on \nthe  JMI  and  another  is  based  on  Independent  Component  Analysis  (ICA).  Both \nmethods perform better  than some existing methods such  as  the methods based on \nPCA and canonical correlation analysis  (CCA)  for  nongaussian data. \n\n2  Joint  mutual information for  input/feature selection \n\nLet  Y  be  a  target  variable and  Xi'S  are  inputs.  The relevance  of a  single  input  is \nmeasured by  the  MI \n\nI(Xi;Y) =  K(p(Xj,y)llp(Xi)P(Y)) \n\nwhere  K(pllq)  is the Kullback-Leibler divergence of two probability functions P and \nq defined  by  K(p(x)llq(x))  = Lx p(x) log~. \nThe relevance  of a set of inputs is  defined  by  the joint  mutual information \n\nI(Xi' ... , Xk; Y)  =  K(P(Xi' ... , Xk,  y)llp(Xi' ... , Xk)P(Y))\u00b7 \nGiven  two  selected inputs Xj  and  Xk,  the conditional MI  is  defined  by \n\nI(Xi; YIXj, Xk)  = L p(Xj, xk)K(p(Xi' ylxj, xk)llp(xilxj, xk)p(ylxj, Xk)). \n\nSimilarly define  I(Xi; YIXj, ... , X k) conditioned on  more than two  variables. \nThe  conditional  MI  is  always  non-negative  since  it  is  a  weighted  average  of the \nKullback-Leibler divergence.  It has the following property \n\nI(XI. \u00b7 \u00b7\u00b7, Xn- l , Xn; Y) - I(XI ,\u00b7\u00b7\u00b7, Xn- l ; Y)  =  I(Xn; YIXI , \u00b7\u00b7\u00b7, Xn-I) 2:  o. \n\nTherefore,  I(X I , \u00b7\u00b7\u00b7 , X n- l , Xn; Y)  2:  I(X I ,\u00b7\u00b7\u00b7, Xn- l ; Y),  i.e.,  adding  the  variable \nXn  will always increase the mutual information.  The information gained by adding \na  variable is  measured by the conditional MI. \n\nWhen Xn  and Yare conditionally independent given Xl,\u00b7 \u00b7 \u00b7, X n - l ,  the conditional \nMI  between  Xn  and Y  is \n\n(1) \nso  Xn  provides  no  extra information about  Y  when  Xl,\u00b7\u00b7 \u00b7,Xn - l  are  known.  In \nparticular,  when  Xn  is  a  function  of Xl, .. . , Xn- l , the equality  (1)  holds.  This is \nthe reason  why  the joint MI  can be used  to eliminate redundant inputs. \n\nThe conditional MI  is  useful  when  the  input variables cannot  be  distinguished  by \nthe  mutual  information  I(Xi;Y).  For  example,  assume  I(XI;Y)  =  I(X2;Y) \nI(X3; Y),  and the problem is to select  (Xl, X2),  (Xl, X3)  or  (X2' X3) .  Since \n\nI(XI,X2;Y) - I(XI,X3;Y) =  I(X2;YIXI) - I(X3;YIXt}, \n\nwe  should choose  (Xl, X2)  rather than  (Xl, X3)  if I(X2; YIXI )  > I(X3; YIXI ).  Oth(cid:173)\nerwise,  we  should  choose  (Xl, X3).  All  possible  comparisons  are  represented  by  a \nbinary tree  in  Figure  1. \nTo  estimate  I(X I, . . . , Xk; Y),  we  need \njoint  probability \nP(XI,\u00b7\u00b7 \u00b7, Xk,  y).  This  suffers  from  the  curse  of dimensionality  when  k  is  large. \n\nto  estimate \n\nthe \n\n\fData  Visualization and Feature Selection \n\n689 \n\nSometimes, we  may not be  able to estimate high dimensional MI  due to the sample \nshortage.  Further  work  is  needed  to estimate  high  dimensional joint MI  based  on \nparametric  and  non-parametric density  estimations,  when  the  sample  size  is  not \nlarge  enough. \n\nIn some real world problems such  as  mining large data bases and radar pulse classi(cid:173)\nfication,  the sample size  is  large.  Since the  parametric densities for  the underlying \ndistributions are unknown,  it is better  to use  non-parametric methods such  as  his(cid:173)\ntograms  to  estimate  the  joint  probability  and  the  joint  MI  to  avoid  the  risk  of \nspecifying  a  wrong or  too complicated model for  the  true density  function. \n\n(xl. x2) \n\n\"'  A. .~ \n\n(xl .x3) \n\nl(Xl ;Y\\X3\u00bb:1(X2;Y\\X3y\\!(XLY\\X3)<1(X2;Y\\X3) \n\n1(Xl.Y\\X2\u00bb:1(X3;Y\\X/, \n\n, \\1;Y\\X2l<1(X3;Y\\X2) \n\n/ \" \\  \n\n(xl.x2) \n\n(x2.x3) \n\n(x1 .xl) \n\n(xl,x3) \n\nFigure  1:  Input selection based  on  the conditional MI. \n\nIn  this  paper,  we  use  the  joint  mutual  information  I(Xi, Xj; Y)  instead  of  the \nmutual information I(Xi; Y) to select inputs for  a neural network classifier.  Another \napplication  is  to  select  two  inputs  most  relevant  to  the  target  variable  for  data \nvisualiz ation. \n\n3  Data visualization methods \n\nWe  present  supervised  data visualization  methods  based  on  joint  MI  and  discuss \nunsupervised  methods based  on  ICA. \n\nThe  most  natural  way  to  visualize  high-dimensional  input  patterns  is  to  dis(cid:173)\nplay  them  using  two  of  the  existing  coordinates,  where  each  coordinate  corre(cid:173)\nsponds  to  one  input  variable.  Those  inputs  which  are  most  relevant  to  the  tar(cid:173)\nget  variable corresponds  the best  coordinates for  data visualization ,  Let  (i*, j*) = \narg  maxU ,nI(Xi, Xj; Y).  Then,  the  coordinate  axes  (Xi-, Xj-)  should  be  used  for \nvisualizing the input patterns since  the corresponding inputs achieve  the maximum \njoint MI.  To  find  the maximum I(Xj-, Xj-IY),  we  need  to evaluate every  joint MI \nI(Xi' Xj; Y)  for  i  < j.  The number of evaluations is  O(n 2 ) . \nNoticing  that  I(Xj,Xj;Y)  =  I(Xi;Y) + I(Xj;YIXi),  we  can  first  maximize  the \nMI  I(Xi; Y),  then  maximize the conditional MI.  This algorithm is suboptimal, but \nonly  requires  n - 1 evaluations of the joint  MIs.  Sometimes,  this  is  equivalent  to \nexhaustive search.  One such  example is  given  in  next section. \n\nSome existing methods to visualize high-dimensional patterns are  based  on dimen(cid:173)\nsionality reduction  methods such  as  PCA and CCA to find  the new  coordinates to \ndisplay  the data,  The new  coordinates found  by  PCA  and  CCA  are  orthogonal in \nEuclidean space and the space  with Mahalanobis inner product,  respectively.  How(cid:173)\never,  these  two  methods  are  not suitable for  visualizing  nongaussian  data because \nthe  projections  on  the  PCA or  CCA coordinates  are  not  statistically  independent \nfor  nongaussian  vectors.  Since  the  JMI  method is  model-independent,  it  is  better \nfor  analyzing nongaussian  data. \n\n\f690 \n\nH  H  Yang and J.  Moody \n\nBoth CCA and maximumjoint MI  are supervised  methods while the PCA  method \nis unsupervised.  An  alternative to these  methods is  ICA for  visualizing clusters  [5]. \nThe ICA  is  a  technique  to transform  a set  of variables  into  a  new  set  of variables, \nso  that statistical dependency  among the transformed variables is  minimized.  The \nversion  of ICA  that  we  use  here  is  based  on  the  algorithms in  [1,  8].  It discovers \na  non-orthogonal basis that minimizes mutual information between  projections on \nbasis vectors.  We shall compare these  methods in  a  real  world  application. \n\n4  Application to  Signal Visualization and  Classification \n\n4.1 \n\nJoint mutual information and visualization of radar pulse patterns \n\nOur  goal  is  to  design  a  classifier  for  radar  pulse  recognition.  Each  radar  pulse \npattern is  a  15-dimensional vector.  We first  compute the joint MIs,  then  use  them \nto select  inputs for  the visualization  and classification  of radar pulse  patterns. \n\nA  set  of radar  pulse  patterns  is  denoted  by  D  =  {(zi, yi)  :  i  =  1\"\", N}  which \nconsists  of patterns  in  three  different  classes.  Here,  each  Zi  E  R t5  and  each  yi  E \n{I, 2, 3}. \n\nI~  \"  MIl\"\" mlormabon \n\nCondIIionai NI given X2 \n\n0 \n\n14 \n\n12 \n\ni e-\n::E  0.8 \n1 \n106 \nI \n;; \n\n0.' \n\n~ \n\nto \n\nt>: \n\nI> \n\nt> \n\n02 \n\n~ \n\n0 \n\n00 \n\n0 \n\n0 \n\n0 \n\n0 \n\nI> \n\n0 \n\n0  \" \n\nInputvanablallldex \n\n(a) \n\n~ . '.a .. ,. \n\n'0 \n\n.~:: :  0 \n\nI>  .. \n\n15 \n\n, \n\n2 \n\n1 \n\nJ \n\n8 \n\n, \n\n02 \n\n0 \n\n1 \n\n2 \n\n3 \n\n.4 \n\n5 \n\n6 \n\n8 \n\n7 \n9 \nbundle nutrtJer \n\n(b) \n\n1J J 11 \n\n10 \n\n12 \n\n11 \n\n13 \n\n14 \n\n15 \n\nFigure 2:  (a)  MI vs conditional MI for the radar pulse data; maximizing the MI then \nthe conditional MI with O(n) evaluations gives I(Xil' Xii; Y)  =  1.201 bits.  (b)  The \njoint MI  for  the radar  pulse  data;  maximizing the joint MI  gives  I(Xi. ,Xj-; Y)  = \n1.201 bits with O(n2 )  evaluations of the joint MI.  (il' it) = (i* , j*) in this case. \n\nLet  it  =  arg maxJ(Xi;Y)  and it =  arg maXj;tiJ(Xj;YIXi1 ).  From  Figure  2(a), \nwe  obtain  (it,it)  = (2,9)  and  I(XillXjl;Y)  =  I(Xil;Y) + I(Xj1;YIXi1 )  = \n1.201  bits .  If the  number  of total  inputs is  n,  then  the  number of evaluations for \ncomputing the mutual information I(Xi; Y)  and the conditional mutual information \nI(Xj; YIXiJ  is  O(n). \nTo  find  the  maximum I(Xi-, X j>; Y),  we  evaluate  every  I(Xi, Xj; Y)  for  i  <  j. \nThese MIs are shown  by  the bars in  Figure 2(b),  where the i-th bundle displays the \nMIs  I(Xi,Xj;Y) for  j  = i+ 1\" \" , 15. \nIn order to compute the joint MIs,  the MI  and the conditional MI is evaluated 0 (n) \nand O(n2 )  times respectively.  The maximumjoint MI is I(Xi-, X j-; Y)  =  1.201 bits. \nGenerally,  we  only  know  I(Xil ' Xii; Y)  ~ I(Xi-, Xj-; Y).  But  in  this  particular \n\n\fData  Visualization and Feature Selection \n\n691 \n\napplication, the equality holds.  This suggests that sometimes we can use an efficient \nalgorithm  with  only  linear  complexity  to  find  the  optimal  coordinate  axis  view \n(Xi\u00b7,Xj.).  The joint  MI  also  gives  other  good  sets  of coordinate  axis  views  with \nhigh joint MI  values . \n\n<> \nx \n\n0 \nN \n\n~ \n\n0 \n\n0 \n\n~ \n\n25 \n\n15 \n\n05 \n\n0 \n\n3.~ \n\n'\" \nc  . \n~ \n0 \n~o \n8 ., \n\n.. \n\n!O \n.~ ~ \nIi\n' \n\n/2 h \n'\" 'l' \n\n\u00b720 \n\n-,0 \n\n,0 \n\n20 \n\nfirs, prinopol oomponen' \n\n(a) \n\nJ \n\n3 \n\n3J3 \n~,  3 \n\n3 \n\n3 \n\n3 \n\n\u00a7~~  3 \n. ~j;l;, \n\n2 \n\n2 \n\nf2   3 \n\na> \n2~:ai  2~ \n2 \n2  2  ~ \n2 \n\n3 \n\n2 \n\n\" \n,  \", ' \n1~~1~~ \n~,  , \n\" ,  \n\" \n\n1,1 \n\n, \n, \" \n\n, \n\n,  3 \n3 \n\n, \n\nN \n\nCl \n..J \n/20 \n\n\u00a7 \n\n0 \n\n'l' \n\n'1 \n\n-6 \n\n~ \n\n\u00b72 \n\nF.rstLD \n(c) \n\n2 \n\n'I\"llll~ \n\n3 3 \n\n~ 33 \n\n-20 \n\n20 \n\n40 \n\nX2 \n\n(b) \n\n1 \n\n1 \n\n1 \n\n\u2022 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\nf \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n3 \n\n3 \n\n2 \n\n1 \n\n\u2022 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n2 \n\n3 \n\n3 \n\n3 \n\n2 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n2 \n\n2 \n\n3 \n\n3 \n\n3 \n\n2 \n\n2 \n\n3 \n\n2 \n\n2 \n\n2 \n\n2 \n\n3 \n\n3 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n3 \n\n3 \n\n3 \n\n2 \n\n2 \n\n3 \n\n2 \n\n3 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n3 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n2 \n\n- 1 \n\n(d) \n\n~ \n\n-<15 \n\n-1 \n\n-15 \n\n-2 \n-3 \n\n2 \n\n-2 \n\nFigure  3:  (a)  Data visualization  by  two  principal components;  the spatial  relation \nbetween  patterns  is  not  clear.  (b)  Use  the  optimal coordinate  axis  view  (Xi., Xj-) \nfound  via joint  MI  to  project  the  radar  pulse  data;  the  patterns  are  well  spread \nto  give  a  better  view  on  the  spatial  relation  between  patterns  and  the  boundary \nbetween  classes.  (c)  The CCA  method.  (d)  The ICA  method. \n\nEach  bar in  Figure 2(b)  is  associated  with a  pair of inputs.  Those pairs with high \njoint MI  give good coordinate axis view for  data visualization.  Figure 3 shows that \nthe data visualizations by  the maximum JMI  and the ICA  is  better  than  those  by \nthe PCA  and the  CCA  because  the data is  nongaussian. \n\n4.2  Radar pulse classification \n\nNow  we  train a  two layer feed-forward  network to classify the radar pulse patterns. \nFigure  3 shows  that  it  is  very  difficult  to separate  the  patterns  by  using just  two \ninputs.  We  shall use  all  inputs or four  selected  inputs.  The data set  D  is  divided \n\n\f692 \n\nH  H  Yang and J.  Moody \n\ninto a training set DI  and a test set  D2  consisting of 20  percent  patterns in  D.  The \nnetwork  trained on the data set  DI  using all input variables  is denoted  by \n\nY  =  f(X I ,'\"  ,Xn; WI, W 2 , 0) \n\nwhere WI and W  2  are weight matrices and 0 is a vector of thresholds for the hidden \nlayer. \nFrom  the  data  set  D,  we  estimate  the  mutual  information  I(Xi; Y)  and  select \ni l  = arg maxJ(Xi ; Y).  Given Xii' we  estimate the conditional mutual information \nI(Xj; YIXii )  for  j  =1= \ni l .  Choose  three  inputs  Xi'J' Xi3  and  Xi 4 with  the  largest \nconditional MI.  We  found  a  quartet  (iI, i2, i3, i4)  =  (1,2,3,9).  The two-layer feed(cid:173)\nforward  network  trained on  DI  with four  selected  inputs is  denoted  by \n\nY  = g(X I ,X2 , X 3 , X g; W~, W~, 0'). \n\nThere are 1365 choices to select 4 input variables out of 15.  To set a reference perfor(cid:173)\nmance for  network with four inputs for comparison.  Choose 20 quartets from the set \nQ = {(h,h,h,h):  1 ~ jl < h  < h  < j4  ~ 15}.  For each quartet (h,h,h,j4), a \ntwo-layer feed-forward  network  is  trained  using  inputs  (XjllXh,Xh,Xj4)'  These \nnetworks  are  denoted  by \n\nY  =  hi(Xil ,Xh , Xh, X j4 ; W~, W~, 0\"), \n\ni  =  1,2\"\",20 . \n\n5 ... \n\u2022  ,  -~  - -\n\nl\\ \n\nI \n\nI \n\n\u2022 \n.55 \n\n3 \n\n.25 \n\n2 \n\n5 \n\n1 \n\n\u2022\u2022 \n\n-\n.... w..q ER. wlh3)QJnIIa \n- - -\n.... ~ER 'd\\mcpdltl \n- - . \nIf'1n11ER lIIIII'I4I1111d1dtnpil Xl. X2,l(3,n:G \n1>---+  1eItIngst.., 4.-..ct .... XI, X2.)(J, MIl xv \n\n- -\n\n- - - -\n\nI \n\\ \n\n\\\n\n\\~ \n\n, \n.. .. \n'7;:Y.-\n15 \n\n. \n,  , \n\n. ' ,  \n\n\" \n\n10 \n\n...  - -\n\n--\n\n(a) \n\nnini'I EA ............... Xl,X2, lQ, _)fJ \n..... ER; .............. Xl.X2. lQ .... XI \n..... a. . .., .. ~ \n..... Eft ....... ...,. \n\nI \n\n\u2022.. \n\n015 \n\n' .1 \n\nY \n\n25 \n\n-\n\n(b) \n\n(a)  The  error  rates  of  the  network  with  four  inputs  (Xl, X 2 , X 3 , Xg) \nFigure  4: \nselected  by  the  joint  MI  are  well  below  the  average  error  rates  (with  error  bars \nattached)  of the  20  networks  with different  input quartets  randomly selected;  this \nshows  that  the  input  quartet  (X I ,X2 ,X3 ,X9 )  is  rare  but  informative.  (b)  The \nnetwork  with  the  inputs  (X I ,X2 ,X3 ,X9 )  converges  faster  than  the  network  with \nall  inputs.  The  former  uses  65%  fewer  parameters  (weights  and  thresholds)  and \n73%  fewer  inputs  than  the  latter.  The  classifier  with  the  four  best  inputs  is  less \nexpensive  to  construct  and  use,  in  terms  of data  acquisition  costs,  training  time, \nand computing costs for  real-time application. \n\nThe mean and the variance of the error rates of the 20  networks are then computed. \nAll  networks  have  seven  hidden  units.  The training and  testing  error  rates  of the \nnetworks  at  each  epoch  are  shown  in  Figure  4,  where  we  see  that  the  network \nwith four  inputs selected  by  the joint  MI  performs  better  than  the  networks  with \nrandomly selected  input  quartets  and  converges  faster  than  the  network  with  all \ninputs.  The network with fewer  inputs is  not only faster  in  computing but also less \nexpensive  in data collection. \n\n\fData Visualization and Feature Selection \n\n693 \n\n5  CONCLUSIONS \n\nWe  have  proposed  data  visualization  and  feature  selection  methods  based  on  the \njoint  mutual  information and ICA. \n\nThe maximum JMI method can find  many good 2-D projections for  visualizing high \ndimensional data which cannot be easily found by the other existing methods.  Both \nthe  maximum JMI  method  and  the  ICA  method  are  very  effective  for  visualizing \nnongaussian data. \n\nThe variable selection method based on the JMI is found to be better in eliminating \nredundancy in the inputs than other methods based on simple mutual information. \nInput selection methods based on mutual information (MI)  have been useful in many \napplications, but they have two disadvantages.  First, they cannot distinguish inputs \nwhen  all of them have the same MI.  Second,  they  cannot eliminate the redundancy \nin  the  inputs  when  one  input  is  a  function  of other  inputs.  In  contrast,  our  new \ninput selection method based  on  th~ joint MI  offers  significant  advantages in  these \ntwo  aspects. \n\nWe have successfully applied these methods to visualize radar patterns and to select \ninputs for  a  neural network classifier  to recognize  radar pulses.  We  found  a smaller \nyet  more robust  neural network for  radar signal analysis using  the JMI. \n\nAcknowledgement:  This  research  was  supported  by  grant  ONR  N00014-96-1-\n0476. \n\nReferences \n\n[1]  S.  Amari,  A.  Cichocki,  and  H.  H.  Yang.  A  new  learning  algorithm  for  blind \nsignal  separation.  In  Advances  in  Neural  Information  Processing  Systems,  8, \neds.  David  S.  Touretzky,  Michael  C.  Mozer  and  Michael  E.  Hasselmo,  MIT \nPress:  Cambridge,  MA.,  pages  757-763,  1996. \n\n[2]  G. Barrows and J. Sciortino.  A mutual information measure for feature selection \n\nwith application  to  pulse  classification.  In  IEEE  Intern.  Symposium  on  Time(cid:173)\nFrequency  and  Time-Scale  Analysis,  pages 249-253,  1996. \n\n[3]  R.  Battiti.  Using mutual information for  selecting features  in supervised  neural \n\nnet  learning.  IEEE  Trans.  on  Neural  Networks,  5(4):537-550, July 1994. \n\n[4]  B.  Bonnlander.  Nonparametric  selection  of input  variables  for  connectionist \n\nlearning.  Technical report,  PhD Thesis.  University  of Colorado,  1996. \n\n[5]  C.  Jutten  and  J.  Herault.  Blind  separation  of sources,  part  i:  An  adaptive \nalgorithm based on neuromimetic architecture.  Signal Processing,  24:1-10, 1991. \n[6]  J.  Moody.  Prediction  risk  and  architecture  selection  for  neural  network.  In \nV.  Cherkassky,  J .H.  Friedman,  and  H.  Wechsler,  editors,  From  Statistics  to \nNeural  Networks:  Theory  and  Pattern  Recognition  Applications.  NATO  ASI \nSeries  F,  Springer-Verlag,  1994. \n\n[7]  H.  Pi  and  C.  Peterson.  Finding the embedding dimension  and  variable depen(cid:173)\n\ndencies  in  time series.  Neural  Computation,  6:509-520,  1994. \n\n[8]  H.  H.  Yang  and  S.  Amari.  Adaptive on-line  learning algorithms for  blind sep(cid:173)\naration:  Maximum entropy  and minimum mutual information.  Neural  Compu(cid:173)\ntation,  9(7):1457-1482,  1997. \n\n\f", "award": [], "sourceid": 1779, "authors": [{"given_name": "Howard", "family_name": "Yang", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}