{"title": "Constructing Heterogeneous Committees Using Input Feature Grouping: Application to Economic Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 921, "page_last": 927, "abstract": null, "full_text": "Constructing Heterogeneous  Committees \n\nUsing Input  Feature  Grouping: \n\nApplication to Economic Forecasting \n\nYuansong Liao  and John Moody \n\nDepartment of Computer Science,  Oregon Graduate Institute, \n\nP.O.Box 91000,  Portland, OR 97291-1000 \n\nAbstract \n\nThe  committee  approach  has  been  proposed  for  reducing  model \nuncertainty  and  improving  generalization  performance.  The  ad(cid:173)\nvantage of committees depends  on  (1)  the performance of individ(cid:173)\nual members  and  (2)  the correlational structure of errors between \nmembers.  This paper presents an input grouping technique for  de(cid:173)\nsigning a  heterogeneous  committee.  With  this  technique,  all input \nvariables are first grouped based on their mutual information.  Sta(cid:173)\ntistically  similar  variables  are  assigned  to  the  same  group.  Each \nmember's  input  set  is  then  formed  by  input  variables  extracted \nfrom different groups.  Our designed  committees have less error cor(cid:173)\nrelation between its members, since each member observes different \ninput variable combinations.  The individual member's feature sets \ncontain less redundant information, because highly correlated vari(cid:173)\nables will not be combined together.  The member feature sets con(cid:173)\ntain almost complete information, since each set contains a feature \nfrom  each  information group.  An  empirical study for  a  noisy  and \nnonstationary  economic  forecasting  problem  shows  that  commit(cid:173)\ntees constructed by our proposed technique outperform committees \nformed using several existing techniques. \n\nIntroduction \n\n1 \nThe  committee  approach  has  been  widely  used  to  reduce  model  uncertainty  and \nimprove generalization performance.  Developing methods for  generating candidate \ncommittee  members  is  a  very  important  direction  of  committee  research.  Good \ncandidate members of a  committee should have (1)  good (not necessarily excellent) \nindividual performance and (2) small residual error correlations with other members. \n\nMany techniques have been proposed to reduce residual correlations between mem(cid:173)\nbers.  These  include  resampling  the  training  and  validation  data  [3],  adding  ran(cid:173)\ndomness  to  data  [7],  and  decorrelation  training  [8].  These  approaches  are  only \neffective  for  certain models  and problems.  Genetic  algorithms have also  been used \nto generate good and diverse members  [6]. \n\nInput feature  selection  is  one  of the most  important stages  of the  model  learning \nprocess.  It has a  crucial  impact both on  the learning  complexity  and the general-\n\n\f922 \n\nY.  Liao and J.  Moody \n\nization performance.  It is  essential that a feature vector gives  sufficient information \nfor  estimation.  However,  too  many  redundant input features  not only  burden  the \nwhole learning process, but also degrade the achievable generalization performance. \n\nInput  feature  selection  for  individual  estimators  has  received  a  lot  of  attention \nbecause  of its importance.  However,  there has  not  been  much  research on feature \nselection  for  estimators  in  the  context  of  committees.  Previous  research  found \nthat giving committee members different input features  is very useful for  improving \ncommittee performance  [4],  but is  difficult  to implement  [9].  The feature  selection \nproblem for committee members is conceptually different than for single estimators. \nWhen using committees for estimation, as we stated previously, committee members \nnot  only  need  to  have  reasonable  performance  themselves,  but  should  also  make \ndecisions  independently. \n\nWhen all  committee members  are trained to model the same  underlying function, \nit is  difficult for  committee members to optimize both criteria at the same time.  In \norder to generate members that provide a good balance between the two criteria, we \npropose a feature selection approach, called input feature grouping, for commit(cid:173)\ntee members.  The idea is to give each member estimator of a  committee a  rich but \ndistinct  feature  sets,  in  the hope  that each  member  will  generalize  independently \nwith reduced error correlations. \n\nThe  proposed  method  first  groups  input  features  using  a  hierarchical  clustering \nalgorithm based on their mutual information, such that features in different groups \nare  less  related  to each  other and features  within  a  group  are statistically similar \nto  each  other.  Then  the  feature  set  for  each  committee  member  is  formed  by \nselecting a feature from each group.  Our empirical results demonstrate that forming \na  heterogeneous  committee using  input feature grouping is  a  promising approach. \n\n2  Committee Performance Analysis \n\nThere  are  many  ways  to  construct  a  committee. \nIn  this  paper,  we  are  mainly \ninterested in heterogeneous committees whose members have different input feature \nsets.  Committee  members  are  given  different  subsets  of the  available  feature  set. \nThey are trained independently,  and the committee output is  either a  weighted or \nunweighted combination of individual members' outputs. \n\nIn the following,  we analyze the relationship between committee errors and average \nmember errors from  the regression point of view  and discuss  how  the  residual cor(cid:173)\nrelations between members affect the committee error.  We define the training data \nV  = {(X.B, y.B);;:3  = 1,2, . . . N}  and  the test data T  = {(XI', YI'); JL  = 1,2, ... oo}, \nwhere  both  are  assumed  to  be  generated  by  the  model:  Y  =  t(X) + f. \nf.  '\" \nN(o, (72)  .  The  data  V  and  T  are  independent,  and  inputs  are  drawn  from  an \nunknown  distribution.  Assume  that  a  committee  has  K  members.  Denote  the \navailable  input  features  as  X  =  [Xl, X2,  . .. ,Xm ],  the  feature  sets  for  the  ith  and \njth  members  as  Xi  = [Xiu Xi2'  . .. , xm;]  and  Xj  = [Xjl) Xj2' . .. ,xmj ]  respectively, \nwhere  Xi  EX, Xj  E X  and Xi =I  X j , and the mapping function  of the ith  and lh \nmember  models  trained  on  data from  V  as  fi(Xd  and  fi(Xj ).  Define  the  model \nerror ef = tl' - h(Xn , for  all JL  = 1,2,3, ... ,00 and i  = 1,2, ... , K. \n\n, \n\n\fConstructing Heterogeneous  Committees for Economic Forecasting \n\nThe MSE of a  committee is \n\n=  ~2 L \u00a311 [(en2] + ~2 L \u00a311 [efej]  , \n\nK \n\ni#j \n\nK \n\ni=l \n\nand the average MSE made by the committee members acting individually is \n\nK \n\nEave =  ~ L \u00a311 [(en 21 , \n\ni=l \n\n923 \n\n(1) \n\n(2) \n\nwhere \u00a3[.]  denotes the expectation over all  test  data T.  Using  Jensen's inequality, \nwe  get  Ec  ~ Eave,  which  indicates that the performance of a  committee is  always \nequal to or better than the average performance of its  members. \nWe  define the  average  model  error correlation as  C  =  K(i -1) l:~j \u00a311 [efejl  , and \nthen have \n\n1 \n\nK-1 \n\n1  K-1 \n\nEc = KEave + ~C = (K + ~q)Eave , \n. We  consider the following  four  cases  of q: \n\n(3) \n\nwhere q =  Be \n\nave \n\n\u2022  Case  1:  - K~l ~ q  < O.  In this  case,  the model  errors between  members \nare anti-correlated, which might be achieved through decorrelation training. \n\n\u2022  Case  2:  q  =  O. \n\nIn  this  case,  the  model  errors  between  members  are \nuncorrelated, and we  have:  Ec = k Eave.  That is  to say,  a  committee can \ndo much  better than the average performance of its members. \n\n\u2022  Case  3:  0  < q  < 1.  If Eave  is  bounded  above,  when  the  committee size \nK  - t  00,  we  have  Ec  =  qEave  .  This  gives  the  asymptotic  limit  of a \ncommittee's performance.  As  the size  of a  committee goes  to infinity,  the \ncommittee  error  is  equal  to  the  average  model  error  correlation  C.  The \ndifference between Ec  and Eave is  determined by the ratio q. \n\n\u2022  Case 4:  q =  1.  In this case,  Ec  is  equal to Eave.  This happens only when \nei  =  ej, for  all  i,j =  1, ... ,K.  It is  obvious that  there is  no  advantage to \ncombining a set of models  that act identically. \n\nIt is  clear  from  the  analyses  above  that  a  committee  shows  its  advantage  when \nthe  ratio  q  is  less  than  one.  The smaller  the  ratio q  is,  the  better  the  committee \nperforms  compared  to the  average  performance of its  members.  For  the  commit(cid:173)\ntee  to  achieve  substantial  improvement  over  a  single  model,  committee  members \nnot only should  have small errors individually,  but also should have small residual \ncorrelations between each other. \n\nInput Feature Grouping \n\n3 \nOne  way  to  construct  a  feature  subset  for  a  committee  member  is  by  randomly \npicking a  certain number of features  from  the original feature  set.  The advantage \nof this method is  that it is  simple.  However, we  have no control on each member's \nperformance or on the residual correlation between members by randomly selecting \nsubsets. \n\n\f924 \n\nY.  Liao and J.  Moody \n\nInstead of randomly picking a subset of features for  a member, we  propose an input \nfeature  grouping  method for  forming  committee  member  feature  sets.  The  input \ngrouping method first  groups features  based on a  relevance  measure in  a  way  such \nthat features  between different  groups  are less  related to one another  and features \nwithin a  group are more related to one another. \n\nAfter  grouping,  there  are  two  ways  to  form  member  feature  sets.  One  method \nis  to  construct  the  feature  set  for  each  member  by  selecting  a  feature  from \u00b7 each \ngroup.  Forming a  member's feature set in this way, each  member will  have enough \ninformation to make decision, and its feature  set has  less  redundancy.  This is  the \nmethod we  use in  this  paper. \n\nAnother way is to use each group as the feature set for a committee member.  In this \nmethod each  member  will  only  have  partial information.  This  is  likely  to hurt in(cid:173)\ndividual  member's  performance.  However,  because the input features  for  different \nmembers  are  less  dependent,  these  members  tend  to  make  decisions  more  inde(cid:173)\npendently.  There is  always  a  trade-off between  increasing members '  independence \nand  hurting  individual  members'  performance.  If  there  is  no  redundancy  among \ninput feature representations, removing several features  may hurt individual mem(cid:173)\nbers' performance badly,  and the overall committee performance will  be hurt even \nthough  members  make  decisions  independently.  This  method  is  currently  under \ninvestigation. \n\nThe mutual information  [(Xi; X j)  between two input variables  Xi  and X j  is  used as \nthe relevance measure to group inputs.  The mutual information  [(Xi; Xj) ,  which  is \ndefined in  equation 4,  measures the dependence between the two random variables. \n\n(4) \n\nIf features  Xi  and  X j  are  highly  dependent,  [(Xi; X j)  will  be  large.  Because  the \nmutual information measures arbitrary dependencies  between random variables, it \nhas been effectively used for feature selections in complex prediction tasks [1], where \nmethods  bases  on linear  relations  like  the  correlation are likely  to  make  mistakes. \nThe  fact  that  the  mutual  information  is  independent  of  the  coordinates  chosen \npermits a  robust estimation. \n\n4  Empirical Studies \nWe apply the input grouping method to predict the one-month rate of change of the \nIndex  of Industrial Production  (IP), one of the key  measures of economic activity. \nIt is  computed  and  published monthly.  Figure  4  plots  monthly IP  data from  1967 \nto 1993. \n\nNine  macroeconomic  time  series,  whose  names  are  given  in  Table  1,  are  used \nfor  forecasting  IP.  Macroeconomic  forecasting  is  a  difficult  task  because  data  are \nusually  limited,  and  these  series  are  intrinsically  very  noise  and  nonstationary. \nThese  series  are  preprocessed  before  they  are  applied  to  the  forecasting  mod(cid:173)\nels.  The  representation  used  for  input  series  is  the  first  difference  on  one  month \ntime  scales  of  the  logged  series.  For  example,  the  notation  IP.L.Dl  represents \nIP.L.Dl ==  In(IP(t)) -In(IP(t-l)).  The target series  is  IP.L.FDl , which  is  defined \nas  IP.L.FDI  ==  In(IP(t+l)) - In(IP(t)).  The data set  has  been  one  of our bench(cid:173)\nmarks for  various studies  [5,  10]. \n\n\fConstructing Heterogeneous  Committees for Economic Forecasting \n\n925 \n\nIndex of Industrial  Production:  1967 \u2022 1993 \n\nFigure  1:  U.S. Index of Industrial Production  (IP)  for  the period  1967  to  1993.  Shaded \nregions  denote  official  recessions,  while  unshaded regions  denote  official  expansions.  The \nboundaries for  recessions  and expansions  are  determined by  the National  Bureau of Eco(cid:173)\nnomic  Research  based  on  several  macroeconomic  series.  As  is  evident  for  IP,  business \ncycles  are  irregular  in  magnitude,  duration,  and  structure,  making  prediction  of  IP  an \ninteresting challenge. \n\nSeries  Description \nIP \nSP \nDL \nM2 \nCP \nCB \nHS \nTB3 \nTr \n\nIndex of Industrial Production \nStandard &  Poor's  500 \nIndex of Leading Indicators \nMoney  Supply \nConsumer Price Index \nMoody's  Aaa Bond Yield \nHOUSing  Starts \n3-month Treasury  Bill  Yield \nYield Curve Slope:  (10-Year  Bond Composite)-(3-Month Treasury  Bill) \n\nTable  1:  Input data series.  Data are taken from  the Citibase database. \n\nDuring the grouping procedure,  measures of mutual information between  all  pairs \nof input variables are computed first.  A simple histogram method is  used to calcu(cid:173)\nlate these estimates.  Then a  hierarchical clustering algorithm [2]  is  applied to these \nvalues  to  group  inputs.  Hierarchical  clustering  proceeds  by  a  series  of successive \nfusions  of the  nine  input  variables  into  groups.  At  any  particular  stage,  the  pro(cid:173)\ncess  fuses  variables  or  groups  of variables  which  are  closest,  base  on their  mutual \ninformation estimates.  The distance between two groups  is  defined  as  the average \nof the  distances  between  all  pairs  of individuals  in  the  two  groups.  The  result  is \npresented by  a  tree which illustrates the fusions  made at each successive level  (see \nFigure 2).  From the clustering tree, it is  clear that we  can break the input variables \ninto four  groups:  (IP.L.Dl, DL.L.Dl)  measure recent economic changes,  (SP.L.Dl) \nreflects recent stock market momentum,  (CB.D1, TB3.D1, Tr.D1) give interest rate \ninformation, and (M2.L.D1,  CP.L.D1, HS.L.D1)  provide inflation information.  The \ngrouping algorithm meaningfully clusters the nine input series. \n\n\f926 \n\nY.  Liao and J.  Moody \n\n~ \n\n:; \n\n::l \n\n::I \n\n:;; \n\n~ \n\n~ \n\n~ \n\u2022 \n\n~ \n\n~ \n\n~ es \n\nS \n\n~ \n\nFigure 2:  Variable grouping based on  mutual information.  Y label  is  the distance. \n\nEighteen  differept  subsets  of features  can  be  generated  from  the  four  groups  by \nselecting  a  feature  from  each  group.  Each  subset  is  given  to  a  committee  mem(cid:173)\nber.  For example, the subsets (IP.L.Dl, SP.L.Dl, CB.Dl, M2.L.Dl) and (DL.L.Dl, \nSP.L.Dl, TB3.Dl, M2.L.Dl)  are used as feature sets for  different  committee mem(cid:173)\nbers.  A  committee  has  totally  eighteen  members.  Data from  Jan.  1950  to  Dec. \n1979 is used for  training and validation, and from Jan.  1980 to Dec.  1989 is  used for \ntesting.  Each member is a linear model that is  trained using neural net techniques. \n\nWe  compare the input  grouping method with three other committee member  gen(cid:173)\nerating  methods:  baseline,  random  selection,  and  bootstrapping.  The  baseline \nmethod  is  to  train  a  committee  member  using  all  the  input  variables.  Members \nare  only  different  in  their  initial  weights.  The  bootstrapping  method  also  trains \na  member  using  all  the  input  features,  but  each  member  has  different  bootstrap \nreplicates of the original training data as  its training and validation sets.  The ran(cid:173)\ndom selection method constructs a feature set for  a  member by randomly picking a \nsubset from the available features.  For comparison with the grouping method, each \ncommittee generated by these three methods has  18  members. \n\nTwenty runs are performed for  each of the four methods in order to get reliable per(cid:173)\nformance  measures.  Figure 3  shows  the  boxplots  of normalized  MSE for  the four \nmethods.  The grouping method gives the best result, and the performance improve(cid:173)\nment is significant compared to other methods.  The grouping method outperforms \nthe random selection method by meaningfully  grouping of input features.  It is  in(cid:173)\nteresting to note that the heterogeneous committee methods, grouping and random \nselection,  perform better than homogeneous  methods for  this  data set.  One  of the \nreasons for  this is  that giving different  members different  input sets increases their \nmodel independence.  Another reason could  be that  the problem becomes easier to \nmodel  because of smaller feature sets. \n\n5  Conclusions \n\nThe  performance  of a  committee  depends  on  both  the  performance  of individual \nmembers  and the correlational structure of errors between members.  An  empirical \nstudy for a noisy and nonstationary economic forecasting problem has demonstrated \nthat  committees  constructed  by  input  variable  grouping  outperform  committees \nformed by randomly selecting member input variables.  They also outperform com(cid:173)\nmittees without any input variable  manipulation. \n\n\fConstructing Heterogeneous Committees Jor Economic Forecasting \n\n927 \n\nCommltt .. Performanc. Comp.rteon (20 rurw) \n\n0.84 \n\n0.82 \n\nUJ \n~  0.8 \n\n) J 0.78 \n\n0.75 \n\n0.74 \n\nI \nI \n\n8; \n1  8 \n\nI \n\n-I-\n\n! \n\n9 I \n\n-L \n\n1 \n\n2 \n\n1 : Gr~lng. 2:Random \u2022\u2022 lectlon. 3.BaM\"n ..... Bootatrtpplng \n\n3 \n\nFigure  3:  Comparison  between  four  different  committee  member  generating  methods. \nThe proposed  grouping  method gives  the best result,  and  the performance  improvement \nis significant  compared to the other three methods. \n\nReferences \n\n[1]  R.  Battiti.  Using  mutual information  for  selecting features  in supervised neural  net \n\nlearning.  IEEE  TI-ans.  on  Neural  Networks,  5(4),  July 1994. \n\n[2]  B.Everitt.  Cluster  Analysis.  Heinemann Educational  Books,  1974. \n\n[3]  L. Breiman.  Bagging predictors.  Machine  Learning,  24(2):123-40,  1996. \n\n[4]  K.J.  Cherkauer.  Human expert-level performance on a  scientific image  analysis  task \nby  a  system  using  combined  artificai  neural  networks.  In  P.  Chan,  editor,  Working \nNotes  of the  AAAI Workshop  on  Integrating  Multiple  Learned  Models,  pages  15-2l. \n1996. \n\n[5]  J.  Moody,  U.  Levin,  and  S.  Rehfuss.  Predicting  the  U.S.  index  of  industrial  pro(cid:173)\n\nduction.  In proceedings  of the  1993 Parallel Applications  in Statistics  and Economics \nConference,  Zeist,  The  Netherlands. Special issue  of Neural Network World, 3(6):791-\n794,  1993. \n\n[6]  D.  Opitz  and  J.  Shavlik.  Generating  accurate  and  diverse  members  of  a  neural(cid:173)\n\nnetwork ensemble.  In  D . Touretzky,  M.  Mozer,  and M.  Hasselmo,  editors,  Advances \nin Neural  Information  Processing  Systems  8.  MIT Press,  Cambridge,  MA,  1996. \n\n[7]  Y.  Raviv  and  N.  Intrator.  Bootstrapping  with  noise:  An  effective  regularization \n\ntechnique.  Connection  Science,  8(3-4):355-72,  1996. \n\n[8]  B.  E.  Rosen.  Ensemble  learning  using  decorrelated  neural  networks.  Connection \n\nScience,  8(3-4):373-83,  1996. \n\n[9]  K.  Tumer and J.  Ghosh.  Error correlation and error reduction in ensemble classifiers. \n\nConnection  Science,  8(3-4):385-404,  December  1996. \n\n[10]  L. Wu and  J . Moody.  A smoothing regularizer  for  feedforward  and recurrent neural \n\nnetworks.  Neural  Computation,  8.3:463- 491,  1996. \n\n\f", "award": [], "sourceid": 1704, "authors": [{"given_name": "Yuansong", "family_name": "Liao", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}