{"title": "Inference for the Generalization Error", "book": "Advances in Neural Information Processing Systems", "page_first": 307, "page_last": 313, "abstract": null, "full_text": "Inference for the Generalization Error \n\nClaude Nadeau \n\nCIRANO \n\n2020, University, \n\nYoshua Bengio \n\nCIRANO and Dept. IRO \nUniversite de Montreal \n\nMontreal, Qc, Canada, H3A 2A5 \njcnadeau@altavista.net \n\nMontreal, Qc, Canada, H3C 3J7 \n\nbengioy@iro.umontreal.ca \n\nAbstract \n\nIn order to to compare learning algorithms, experimental results reported \nin  the  machine  learning  litterature often  use  statistical  tests  of signifi(cid:173)\ncance.  Unfortunately,  most of these  tests  do  not take  into account the \nvariability  due  to  the  choice  of training  set.  We  perform  a  theoretical \ninvestigation of the variance of the cross-validation estimate of the gen(cid:173)\neralization error that takes into account the  variability due to  the choice \nof training  sets.  This  allows  us  to  propose  two  new  ways  to  estimate \nthis variance. We show, via simulations, that these new statistics perform \nwell relative to the statistics considered by Dietterich (Dietterich, 1998). \n\n1  Introduction \n\nWhen  applying  a  learning  algorithm  (or comparing several  algorithms),  one  is  typically \ninterested in estimating its generalization error.  Its point estimation is rather trivial through \ncross-validation.  Providing a variance estimate of that estimation, so that hypothesis test(cid:173)\ning and/or confidence intervals are possible, is  more difficult,  especially, as pointed out in \n(Hinton et aI.,  1995), if one wants to  take into account the variability due to  the choice of \nthe training sets (Breiman, 1996). A notable effort in that direction is Dietterich's work (Di(cid:173)\netterich,  1998).  Careful investigation of the variance to  be estimated allows  us to provide \nnew variance estimates, which tum out to perform well. \n\nLet us first layout the framework in which we shall work.  We  assume that data are avail(cid:173)\nable  in  the  form  Zjl  =  {Z 1, ... ,  Zn}.  For example,  in  the  case of supervised learning, \nZi  =  (Xi,}Ii)  E  Z  ~ RP+q,  where p and q denote the dimensions of the  X/s (inputs) \nand the  }Ii's (outputs).  We  also  assume that the  Zi'S  are  independent with  Zi  rv  P(Z) . \nLet \u00a3(D; Z), where D  represents a subset of size  nl  ::;  n  taken  from  Zjl, be a function \nZnl  X  Z  -t R  For instance,  this  function  could be  the  loss  incurred by  the  decision \nthat a learning algorithm trained on D  makes on  a new example Z.  We  are interested in \nestimating  nJ.l.  ==  E[\u00a3(Zjl; Zn+1)]  where Zn+1  rv  P(Z) is independent of Zjl.  Subscript \nn stands for the size of the training set (Zjl here).  The above expectation is taken over Zjl \nand Zn+1,  meaning that we  are interested in  the performance of an  algorithm rather than \nthe performance of the specific decision function it yields on the data at hand. According to \nDietterich's taxonomy (Dietterich, 1998), we deal with problems of type 5 through 8, (eval(cid:173)\nuating learning algorithms) rather then type 1 through 4 (evaluating decision functions). We \ncall  nJ.l.  the generalization error even though it can also represent an error difference: \n\n\u2022  Generalization error \nWe may take \n\n\u00a3(D; Z) = \u00a3(D; (X, Y)) = Q(F(D)(X), Y), \n\n(1) \n\n\f308 \n\nC.  Nadeau and Y.  Bengio \n\nwhere  F(D)  (F(D)  :  ]RP  ~ ]Rq)  is  the  decision  function  obtained  when  training  an \nalgorithm  on  D,  and Q is  a  loss  function  measuring  the inaccuracy  of a  decision.  For \ninstance,  we  could  have Q(f), y)  =  I[f)  1=  y],  where I[  ] is  the  indicator function,  for \nclassification problems and Q(f), y) =11  f)  - y 11 2 , where is II  . II  is the Euclidean norm, for \n\"regression\" problems. In that case  nJ.L  is what most people call the generalization error. \n\u2022 Comparison of generalization errors \nSometimes,  we are not interested in  the performance of algorithms per se,  but instead in \nhow two algorithms compare with each other. In that case we may want to consider \n\n.cCDi Z) = .c(Di (X, Y)) = Q(FA(D)CX), Y) - Q(FB(D)(X), Y), \n\n(2) \nwhere  FA(D)  and  FB(D)  are decision functions obtained when  training two algorithms \n(A  and  B)  on  D ,  and  Q is  a  loss  function.  In  this  case  nJ.L  would  be  a  difference  of \ngeneralization errors as outlined in the previous example. \n\nThe generalization error is often estimated via some form of cross-validation.  Since there \nare various versions of the latter, we layout the specific form we use in this paper. \n\u2022  Let  Sj  be a  random  set of nl  distinct  integers  from  {I, ... , n }(nl  <  n).  Here  nl \nrepresents the size of the training set and  we  shall let n2  =  n  - nl be the size of the test \nset. \n\u2022 Let SI, ... SJ  be independent such random sets, and let Sj =  {I, ... , n} \\  Sj denote the \ncomplement of Sj. \n\u2022  Let Z Sj  =  {Zi Ii  E  Sj} be the training set obtained by subsampling Zr according to the \nrandom index set Sj. The corresponding test set is ZSj  =  {Zili E Sj}. \n\n\u2022  Let L(j, i)  =  .c(Zs;; Zi).  According to (1), this could be the error an algorithm trained \non the training set ZSj  makes on example Zi. According to (2), this could be the difference \nof such errors for two different algorithms. \n\u2022  Let (1,j  =  k 2:~=1 L(j, i{)  where i{, ... ,i'k  are  randomly  and independently drawn \nfrom  Sj.  Here we draw K  examples from  the test set ZS'j  with replacement and compute \nthe average error committed. The notation does not convey the fact that {1,j  depends on K, \nnl and n2 . \n\u2022  Let {1,j  = limK ..... oo (1,j  =  ';2  2:iES~ L(j, i)  denote what {1,j  becomes as  K  increases \nwithout bounds.  Indeed,  when  sampling  infinitely  often from  ZS'j'  each  Zi (i  E  Sj) is \nchosen with relative frequency .l.., yielding the usual \"average test error\". The use of K  is \njust a mathematical device to make the test examples sampled independently from Sj. \n\nn2 \n\nJ \n\nThen the cross-validation estimate of the generalization error considered in this paper is \n\nJ \n\nn2 ~ K  _ \nI  '\"\"' ~ \nnl J.LJ  - J L.J J.Lj. \n\nj=1 \n\nWe note that this an unbiased estimator of  nlJ.L  =  E[.c{Zfl, Zn+r)] (not the same as  nJ.L). \n\nThis paper is about the estimation of the variance of  ~~ {1,~.  We  first  study  theoretically \nthis variance in section 2, leading to two new variance estimators developped in section 3. \nSection 4 shows part of a simulation study we performed to see how the proposed statistics \nbehave compared to statistics already in use. \n\n2  Analysis of Var[ ~~itr] \nHere we  study  Var[  ~~ {1,~].  This is  important to understand why  some inference proce(cid:173)\ndures about  nl J.L  presently in  use are inadequate, as  we shall underline in  section 4.  This \ninvestigation also enables us to develop estimators of Var[  ~~ {1,~] in section 3.  Before we \nproceed, we state the following useful lemma, proved in (Nadeau and Bengio, 1999). \n\n\fInference for the Generalization Error \n\n309 \n\nLemma 1  Let U 1, ... , Uk  be random variables with common mean (3,  common variance \n6 and Cov[Ui , Uj] = \"I,  Vi  '#  j. Let1r = J be the correlation between Ui and Uj (i  '#  j). \nLet U  =  k- 1 2::=1 Ui  and 8b  =  k~1 2::=1 (Ui  - U)2  be the  sample mean and sample \nvariance respectively.  Then E[8b] =  6 - \"I and Var[U] = \"I + (6~'Y)  = 6 (11\"  + lk1l') . \nTo study  Var[ ~i j1,~] we need to define the following covariances. \n\n\u2022  Let lio  =  liO(nl)  =  Var[L(j, i)]  when i is randomly drawn from 8J. \n\u2022  Let lil  = lil (nl, n2)  = Cov[L(j, i), L(j, i')] for i and i' randomly and indepen(cid:173)\n\ndently drawn from 8j. \n\n\u2022  Let li2  =  liZ(nl, n2)  =  Cov[L(j, i), L(j', i')], with j  '#  j', i and i' randomly and \n\nindependently drawn from 8j and 8jl  respectively. \n\n\u2022  Let li3  =  li3(nl) =  Cov[L(j, i), L(j, i')] for i, i' E 8j and i  '# i'. This is not the \n\nsame as lil. In fact, it may be shown that \n\n. \n\nlil \n\nC \n\n[L( \") L('  \")] -\n-\n\nOV), z, \n\n), z \n\nlio  + (nz  - 1) li3  _ \nnz \n\nnz \n\n- li3 \n\n+ lio  -\n\nli3 \n\nnz \n\n(3) \n\n. \n\nLet  us  look  at  the  mean  and  variance of j1,j  and  ~i j1,~.  Concerning  expectations,  we \nobviously  have  E[j1,j]  =  n1f.\u00a3  and  thus  E[ ~ij1,~]  =  n1f.\u00a3.  From  Lemma  1,  we  have \nVar[j1,j]  = lil + O'\u00b0KO'I  which implies \n\nVar[j1,j] = Var[  lim  j1,j]  =  lim  Var[j1,j]  =  lil. \n\nK-too \n\nK-too \n\nIt can also be shown that Cov[j1,j, j1,j']  =  liZ, \n\nj  '#  j', and therefore (using Lemma 1) \n\nTT \n\n[n2  ~K] _ \n\nvar  n1f.\u00a3J  -liz+ \n\nTT  [~] \nvar f.\u00a3j  -\nJ \n\nliZ  _ \n\nlil \n\n-liZ+ \n\n-\n\nliZ \n\n. \n\n(4) \n\n+ 0'0-0'1 \nK \nJ \n\nWe shall often encounter liO, lil, liZ, li3 in the future, so some knowledge about those quan(cid:173)\ntities is valuable.  Here's what we can say about them. \nProposition 1  For given nl and n2,  we have 0 ~ liz  ~ lil  ~ lio and 0 ~ li3  ~ lil. \nProof See (Nadeau and Bengio,  1999). \nA natural question about the estimator  ~i j1,~ is how nl, nz, K  and J affect its variance. \nProposition 2  The variance of ~i j1,~ is non-increasing in J,  K  and nz. \nProof See (Nadeau and Bengio,  1999). \n\nClearly, increasing K  leads to  smaller variance because the noise introduced by  sampling \nwith replacement from  the test set disappears when this is done over and over again.  Also, \naveraging over many trainltest (increasing J) improves the estimation of  nl f.\u00a3.  Finally, all \nthings equal elsewhere (nl fixed among other things), the larger the size of the test sets, the \nbetter the estimation of  nl f.\u00a3. \nThe behavior of Var[  ~i j1,~] with respect to  nl is unclear, but we conjecture that in most \nsituations it should decrease in nl. Our argument goes like this.  The variability in  ~i j1,~ \ncomes from two sources:  sampling decision rules (training process) and sampling testing \nexamples. Holding n2, J and K  fixed freezes the second source of variation as it solely de(cid:173)\npends on those three quantities, not nl. The problem to solve becomes:  how does nl affect \nthe first source of variation? It is not unreasonable to say that the decision function yielded \nby a learning algorithm is less variable when the training set is large. We conclude that the \nfirst  source of variation, and thus the total  variation (that is  Var[  ~ij1,~]) is decreasing in \nnl. We advocate the use of the estimator \n\n(5) \n\n\f310 \n\nC.  Nadeau and Y  Bengio \n\nas it is easier to compute and has smaller variance than  ~~it} (J, nl, n2  held constant). \n\nVar[  n2 11 00 ]  = \n\nnl,-J \n\nlim  Var[  n2 rl.K]  =  (72  + (71  -\nJ \nK-+oo \n\nnl,-J \n\n(72  -\n-\n\nwhere P - ~ - Corr[ll.oo  r/OO] \n'-j  , '-j'  . \n\n111 \n\n-\n\n-\n\n(7  (p + 1 - p) \n\n- J - '  \n\n1 \n\n(6) \n\n3  Estimation of Var[ ~~JtJ] \nWe are interested in estimating  ~~(7J ==  Var[ ~~ it:f]  where  ~~ it:f is as defined in (5). We \nprovide two different estimators of Var[ ~~ it:f]. The first is simple but may have a positive \nor negative bias for the actual variance.  The second is meant to  be conservative,  that is, \nif our conjecture of the previous section  is correct,  its expected value exceeds the actual \nvariance. \n1st Method:  Corrected Resampled t-Test.  Let us recall that  ~~ it:f  =  J 'Ef=1 itj. Let \njj2  be the sample variance of the itj's. According to Lemma 1, \n\n( \n\nI-p \n\n, \n\n(7) \n\nI-P) \n\nI-p \nP +  J \n\nVar[ ~~it:f] \nl+--L  ' \nJ \n\n(71  (p+!=\u00a3) \n1.  ~ \nJ  + I-p \n\n!=\u00a3(71  p+~ = \n\nE[jj21=(71(1-p)= \nso  that  (J + G) jj2  is  an  unbiased  estimator of  Var[  ~~ iL:f].  The  only  problem  is \nthat  p  =  p(nl,n2)  =  :~t~:::~~,  the  correlation  between  the  itj's,  is  unknown  and \ndifficult  to  estimate.  We  use  a  naive  surrogate  for  p  as  follows.  Let  us  recall  that \niLj  =  :2  'EiES~ \u00a3(ZSj; Zi).  For the purpose of building our estimator, let us make the \napproximation that \u00a3(ZSj; Zi) depends only on Zi and nl.  Then it is not hard to show (see \n(Nadeau and Bengio, 1999)) that the correlation between the itj's becomes  nl~n2' There-\nfore our first estimator of Var[~~iL:fl is (J + l~~o) jj2  where Po  = po(nl,n2) = nl~n2' \nthat is (J + ~ ) jj2.  This will tend to overestimate or underestimate  Var[ ~~ iL:f]  accord(cid:173)\ning  to  whether Po  >  p or Po  <  p.  Note that this first  method basically  does  not require \nany  more  computations  than  that already  performed to  estimate  generalization error  by \ncross-validation. \n2nd Method:  Conservative Z.  Our second method  aims  at overestimating  Var[  ~~ iL:f] \nwhich will  lead to  conservative inference, that is  tests of hypothesis with actual  size less \nthan  the  nominal  size.  This  is  important  because  techniques  currently  in  use  have  the \nopposite defect, that is they tend to be liberal (tests with actual size exceeding the nominal \nsize), which is typically regarded as less desirable than conservative tests. \n\nEstimating  ~~ (7J  unbiasedly  is  not  trivial  as  hinted  above.  However  we  may  estimate \nunbiasedly  nn? (7J  = Var[  nn? it:fl  where n~ = L!!2 J - n2  < nl.  Let  n? uJ  be the unbiased \nestimator, developed below, of the above variance.  We argued in  the previous section that \nVar[  ~~ it:fl  ~  Var[  ~~ iL:fl.  Therefore  ~;uJ will  tend  to  overestimate  ~~(7J, that  is \nE[ n2a-2]  =  n2(72  >  n2(72 \nnl  J' \n\nJ  -\n\nn 1 \n\n1 \n\n1 \n\nn;  J \n\nn; \n\n1 \n\nHere's how  we  may estimate  ~? (7J  without bias.  For simplicity, assume that n  is even. \nWe  have to randomly split our data Zr into two distinct data sets, Dl  and D1, of size  ~ \neach.  Let iL(1)  be the statistic of interest ( ~; iL:f) computed on D 1 .  This involves, among \nother things, drawing  J  train/test subsets from DI .  Let iL(l)  be the statistic computed on \nD1\u00b7 Then iL(l)  and iL(l)  are independent since Dl and Dl are independent data sets,  so \nh \nt  at  /-L(l)  -\nestImate \nof  ~?(7J. This splitting process may be repeated M  times.  This yields Dm and D~, with \n\n+  J.L(I)  -\n\nAC)2' \n/-L(l) \n\nIS  an un  las \n\n=  2\"  /-L(l)  -\n\nit(!)+it(I))2 \n\nit(I )+it(1))2 \n\nb'  ed \n\n( A \n\nI(A \n\n(AC \n\n. \n\n2 \n\n2 \n\n1 \n\n\fInference for the Generalization Error \n\n311 \n\nDm U D~ = zf, Dm n D~ = 0 for m  = 1, ... , M. Each split yields a pair (it(m) , it(m\u00bb) \nthat is such that ~(it(m) - it(m\u00bb)2 is unbiased for  ~~U}. This allows us to use the following \nunbiased estimator of  ~? U}: \n\n1 \n\nn2 ~ 2 _  1  \"\"' (~ \nn~ U J  - 2M  L..J  J-t(m)  - J-t(m) \n\n~ c \n\nM \n\n)2 \n. \n\n(8) \n\nm=1 \n\nNote that,  according to Lemma 1,  Var[  ~~oj] =  t Var[(it(m)  - it(m\u00bb)2]  (r + IMr)  with \nr  =  Corr[(it(i)  - it(i\u00bb)2, (it(j)  - it(j\u00bb)2]  for i  i- j.  Simulations suggest that r  is usually \nclose to 0, so that the above variance decreases roughly like  k for M  up to 20, say.  The \nsecond  method  is  therefore a  bit  more computation intensive,  since requires  to  perform \ncross-validation M  times, but it is expected to be conservative. \n4  Simulation study \nWe consider five different test statistics for the hypothesis Ho  :  niJ-t  =  J-to.  The first three \nare methods already in  use in  the  machine learning community, the last two  are the new \nmethods we put forward.  They all have the following form \n\nreject  Ho  if  I it ~J-to I > c. \n\n(9) \n\nTable  1  describes  what  they  are  1.  We  performed  a  simulation  study  to  inves(cid:173)\ntigate  the  size  (probability  of  rejecting  the  null  hypothesis  when  it  is  true)  and \nthe  power  (probability  of  rejecting  the  null  hypothesis  when  it  is  false)  of  the \nfive  test  statistics  shown  in  Table  1.  We  consider  the  problem  of  estimating  gen(cid:173)\neralization  errors  in  the  Letter  Recognition  classification  problem  (available  from \nwww.  ics. uci . edu/pub/machine-learning-databases). The learning algo(cid:173)\nrithms are \n\n1.  Classification tree \n\nWe  used the function  tree in  Splus version 4.5 for  Windows.  The default argu(cid:173)\nments were used and no pruning was performed. The function predict with option \ntype=\"class\" was used to retrieve the decision function of the tree:  FA (Zs)(X). \nHere the classification  loss  function  LAU,i)  =  I[FA(Zsj)(Xi )  i- Yi ]  is equal \nto  1 whenever this algorithm misclassifies example i  when the training set is  Sj; \notherwise it is O. \n\n2.  First nearest neighbor \n\nWe  apply  the  first  nearest neighbor rule  with  a distorted distance metric to pun \ndown the performance of this algorithm to the level of the classification  tree (as \nin (Dietterich,  1998\u00bb.  We have LBU, i) equal to 1 whenever this algorithm mis(cid:173)\nclassifies example i  when the training set is Sj; otherwise it is O. \n\nIn addition to inference about the generalization errors  ni J-tA  and  ni J-tB  associated with \nniJ-tB  = \nthose  two  algorithms,  we also  consider inference about  niJ-tA-B  =  niJ-tA  -\nE[LA-B(j,i)] whereLA_B(j,i) =  LAU,i) - LB(j,i). \nWe sample, without replacement, 300 examples from  the 20000 examples available in the \nLetter Recognition data base.  Repeating this  500 times,  we obtain 500 sets of data of the \nform {ZI,\"\"  Z300}.  Once a data set zloO  =  {ZI,'\"  Z300}  has been generated, we may \nlWhen comparing two  classifiers,  (Nadeau  and Bengio,  1999)  show  that the t-test is  closely re(cid:173)\n\nlated to McNemar's test described  in (Dietterich,  1998).  The 5  x  2 cv  procedure  was  developed  in \n(Dietterich,  1998)  with solely the comparison of classifiers in mind but may trivially be extended to \nother problems as  shown in (Nadeau and Bengio,  1999). \n\n\f312 \n\nII  Name \n\nt-test (McNemar) \nresampled t \nDietterich's 5 x  2 cv \n\n1: conservative Z \n\n2:  corr.  resampled t \n\nII \n\nn2  AOO \nnl/-Ll \nn2  AOO \nnl/-LJ \nn(2 AOO \nn/2/-Ll \nn2  AOO \nnl/-LJ \n\nn2  AOO \nnl /-L J \n\nC.  Nadeau and Y.  Bengio \n\nc \n\n~2 SV(L(I, i)) \n\nyO-:.l \n\nsee (Dietterich, 1998) \n\nn2IT3+~ITO-IT3) > 1 \nt n2 - 1,1-ar/2 \nt J - 1,1-ar/2  I+J~ > 1 \ntS,1-ar/2 \n\nITn -IT3 \n\n? \n\n\" \n\nn2  A2 \nn'UJ \n\n1 \n\n(!. + ~) 0-2 \n\nnl \n\nJ \n\nZl-ar/2 \n\ntJ-l,1-ar/2 \n\n~~IT?  < 1 \nn,uJ \nl+JE \nl+J~ \n\nTable  1:  Description of five test statistics in  relation to the rejection criteria shown in (9). \nZp  and h,p refer to the quantile p of the N(O, 1)  and Student tk  distribution respectively. \n0-2  is as defined above (7) and SV (L(I, i)) is the sample variance of the L(I, i)'s involved \nin  ~i {l{'.  The  ~t~~l ratio (which comes from proper application of Lemma 1, except for \nDietterich's 5 x  2 cv and the Conservative Z) indicates if a test will tend to be conservative \n(ratio less than  1) or liberal (ratio greater than  1). \n\nperform hypothesis testing  based  on  the  statistics  shown  in  Table  1.  A  difficulty  arises \nhowever.  For a given n  (n = 300 here), those methods don't aim at inference for the same \ngeneralization error. For instance, Dietterich's 5 x  2 cv test aims at  n/2/-L, while the others \naim  at  nl/-L  where nl  would usually  be different for different methods (e.g.  nl  =  23n  for \nthe t  test statistic,  and nl  =  ~~ for the resampled t  test statistic,  for instance).  In order \nto compare the different techniques, for a given n, we shall always aim  at  n/2/-L,  i.e.  use \nnl  = \u00a5-.  However, for statistics involving  ~ip.r with J  > 1, normal usage would call for \nnl to be 5 or 10 times larger than n2, not nl  =  n2  =  \u00a5-.  Therefore, for those statistics, we \nalso use nl =  \u00a5- and n2 = l~ so that ~ = 5.  To obtain  ~~;o p.r we simply throw out 40% \nof the data.  For the conservative Z, we do the variance calculation as we would normally \ndo (n2  = l~ for instance) to obtain  ~i2-n2a-J =  ;~~~a-J. However, in the numerator we \ncompute  ot  n/2/-LJ  an \nn-n2/-LJ' as exp rune  a  ove. \nNote that the rationale that led to the conservative Z statistics is maintained, that is  ;~~~a-J \n[n/lOAOO]  > \n\nn/lOAoo' \nn/2  /-LJ  mstea  0 \n\nE  [n/lOA2]  >  TT \n\nb  h  n/2AOO \n\nd  n2  AOO \nn/2/-LJ \n\nI'  db \n\n[n/lOAOO] \n\nd  f n2 \n\nAOO \n\nd  TT  [n/2A OO] \n\nvar  n/2/-LJ: \n\nan \n\noverestimates  ot  var  n/2  /-LJ \n\nb  h  TT \n\n. \n\n2n/su J  _  var  n/2  /-LJ \n\nTT \nvar  n/2/-LJ \n\n[n/2 A  00] \n. \n\nFigure 1 shows the estimated power of different statistics when we are interested in /-LA  and \n/-LA-B.  We estimate powers by computing the proportion of rejections of Ho . We see that \ntests based on the t-test or resampled t-test are liberal, they reject the null hypothesis with \nprobability greater than the prescribed a  = 0.1, when the null hypothesis is true.  The other \ntests appear to have sizes that are either not significantly larger the 10% or barely so.  Note \nthat Dietterich's 5  x  2cv is not very powerful (note that its curve has the lowest power on \nthe extreme values of muo). To make a fair comparison of power between two curves, one \nshould mentally align the size (bottom of the curve) of these two curves.  Indeed, even the \nresampled t-test and the conservative Z that throw out 40% of the data are more powerful. \nThat is of course due to the fact that the 5 x  2 cv method uses J  =  1 instead of J  =  15. \n\nThis is  just a  glimpse  of a  much larger simulation  study.  When  studying  the  corrected \nresampled t-test and the conservative Z in their natural habitat (nl = ~9 and n2  =  l~)' we \nsee that they are usually either right on the money in term of size, or slIghtly conservative. \nTheir powers appear equivalent.  The simulations were performed with J  up to 25 and M \nup to 20.  We found that taking  J  greater than  15 did not improve much the power of the \n\n\fInference for the Generalization Error \n\n313 \n\nFigure  1:  Powers of the tests  about Ho  :  /-LA  = /-Lo  (left panel) and Ho  :  /-LA-B  = /-Lo \n(right panel) at level  a  = 0.1  for  varying /-Lo.  The dotted vertical lines correspond to  the \n95% confidence interval for the actual/-LA  or /-LA-B.  therefore that is where the actual size \nof the tests  may be read.  The solid horizontal line displays the nominal size of the tests. \ni.e.  10%.  Estimated probabilities of rejection laying above the dotted horizontal line are \nsignificatively greater than  10% (at significance level 5%).  Solid curves either correspond \nto the resampled t-test or the corrected resampled t-test. The resampled t-test is the one that \nhas ridiculously high size.  Curves with circled points are the versions of the ordinary and \ncorrected resampled t-test and conservative Z  with 40% of the data thrown away.  Where it \nmatters J = 15. M  = 10 were used. \nstatistics.  Taking  M  = 20 instead of M  = 10 does not lead to any noticeable difference \nin  the distribution of the conservative Z.  Taking  M  = 5 makes the statistic slightly less \nconservative. See (Nadeau and Bengio. 1999) for further details. \n5  Conclusion \nThis paper addresses  a  very  important practical  issue  in  the empirical  validation of new \nmachine learning algorithms:  how to decide whether one algorithm is significantly better \nthan another one.  We argue that it is important to  take into account the variability due to \nthe choice of training set.  (Dietterich.  1998) had already proposed a statistic for this pur(cid:173)\npose.  We  have constructed two  new  variance estimates of the cross-validation estimator \nof the  generalization error.  These enable one to  construct tests  of hypothesis and confi(cid:173)\ndence intervals that are seldom liberal.  Furthermore. tests based on these have powers that \nare  unmatched by  any  known  techniques with comparable size.  One of them  (corrected \nresampled t-test) can be computed without any additional cost to  the usual  K-fold cross(cid:173)\nvalidation estimates.  The other one (conservative Z) requires M  times more computation. \nwhere we found sufficiently good values of M  to be between 5 and 10. \n\nReferences \nBreiman, L. (1996).  Heuristics of instability and stabilization in model selection.  Annals of Statistics, \n\n24 (6):2350-2383. \n\nDietterich, T.  (1998).  Approximate statistical tests for comparing supervised classification learning \n\nalgorithms.  Neural Computation,  10 (7):1895-1924. \n\nHinton.  G.,  Neal.  R., Tibshirani, R.,  and  DELVE team members  (1995).  Assessing learning proce(cid:173)\ndures using DELVE.  Technical report, University of Toronto, Department of Computer Science. \nNadeau, C. and Bengio, Y.  (1999).  Inference for the generalisation error.  Technical Report in prepa(cid:173)\n\nration, CIRANO. \n\n\f", "award": [], "sourceid": 1661, "authors": [{"given_name": "Claude", "family_name": "Nadeau", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}