{"title": "Principled Architecture Selection for Neural Networks: Application to Corporate Bond Rating Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 683, "page_last": 690, "abstract": null, "full_text": "Principled Architecture Selection \n\nfor Neural Networks: \n\nApplication to Corporate Bond Rating Prediction \n\nJohn Moody \n\nDepartment of Computer Science \n\nYale University \n\nP. O. Box 2158 Yale Station \n\nNew Haven, CT 06520 \n\nJoachim U tans \n\nDepartment of Electrical Engineering \n\nYale University \n\nP. O. Box 2157 Yale Station \n\nNew Haven, CT 06520 \n\nAbstract \n\nThe notion of generalization ability can be defined precisely as the pre(cid:173)\ndiction risk, the expected performance of an estimator in predicting new \nobservations. In this paper, we propose the prediction risk as a measure \nof the generalization ability of multi-layer perceptron networks and use it \nto select an optimal network architecture from a set of possible architec(cid:173)\ntures. We also propose a heuristic search strategy to explore the space of \npossible architectures. The prediction risk is estimated from the available \ndata; here we estimate the prediction risk by v-fold cross-validation and \nby asymptotic approximations of generalized cross-validation or Akaike's \nfinal prediction error. We apply the technique to the problem of predicting \ncorporate bond ratings. This problem is very attractive as a case study, \nsince it is characterized by the limited availability of the data and by the \nlack of a complete a priori model which could be used to impose a structure \nto the network architecture. \n\n1 Generalization and Prediction Risk \n\nThe notion of generalization ability can be defined precisely as the prediction risk, \nthe expected performance of an estimator is predicting new observations. Consider \na set of observations D = {(Xj, tj); j = 1 ... N} that are assumed to be generated \n683 \n\n\f684 \n\nMoody and Urans \n\nas \n\n( 1) \nwhere J.l(x) is an unknown function, the inputs Xj are drawn independently with \nan unknown stationary probability density function p(x), the fj are independent \nrandom variables with zero mean (l = 0) and variance (j~, and the tj are the \nobserved target values. \nThe learning or regression problem is to find an estimate jt)..(x; D) of J.l(x) given \nthe data set D from a class of predictors or models J.l)..(x) indexed by 'x. In general, \n,x E A = (5, A, W), where 5 C X denotes a chosen subset of the set of available \ninput variables X, A is a selected architecture within a class of model architectures \nA, and Ware the adjustable parameters (weights) of architecture A. \n\nThe prediction risk P(,x) is defined as the expected performance on future data and \ncan be approximated by the expected performance on a finite test set: \n\n(2) \n\nwhere (xi, ti) are new observations that were not used in constructing jt)..(x). In \nwhat follows, we shall use P(,x) as a measure of the generalization ability of a model. \nSee [4] and [6] for more detailed presentations. \n\n2 Estimates of Prediction Risk \n\nSince we cannot directly calculate the prediction risk P).., we have to estimate it \nfrom the available data D. The standard method based on test-set validation is \nnot advisable when the data set is small. In this paper we consider such a case; \nthe prediction of corporate bond ratings from a database of only 196 firms. Cross(cid:173)\nvalidation (CV) is a sample re-use method for estimating prediction risk; it makes \nmaximally efficient use of the available data. Other methods are the generalized \ncross-validation (GCV) and the final prediction error (FPE) criteria, which combine \nthe average training squared error ME with a measure of the model complexity. \nThese will be discussed in the next sections. \n\n2.1 Cross Validation \n\nCross-Validation is a method that makes minimal assumptions on the statistics of \nthe data. The idea of cross validation can be traced back to Mosteller and Tukey [7]. \nFor reviews, see Stone [8, 9], Geisser [5] and Eubank [4]. \nLet jt)..(j)(x) be a predictor trained using all observations except (Xj, tj) such that \njt )..(j) (x) minimizes \n\nME j = (N ~ 1) L (tk - jt)..(j)(Xk\u00bb) 2 \n\nk~j \n\nThen, an estimator for the prediction risk P(-X) is the cross validation average \n\n\fPrincipled Architecture Selection for Neural Networks \n\n685 \n\nsquared error \n\nCV(,x) = ~ E (tj - flA(j)(Xj)) 2 \n\nN \n\nN j=l \n\n(3) \n\nThis form of CV(,x) is known as leave-one-out cross-validation. \nHowever, CV(,x) in (3) is expensive to compute for neural network models; it in(cid:173)\nvolves constructing N networks, each trained with N - 1 patterns. For the work \ndescribed in this paper we therefore use a variation of the method, v-fold cross(cid:173)\nvalidation, that was introduced by Geisser [5] and Wahba et al [12]. \nInstead of \nleaving out only one observation for the computation of the sum in (3) we delete \nlarger subsets of D. \n\nLet the data D be divided into v randomly selected disjoint subsets Pj of roughly \nequal size: Uj=lPj = D and Vi i= j, Pi n Pj = 0. Let N j denote the number of \nobservations in subset Pj. Let flA(Pj) (x) be an estimator trained on all data except \nfor (x, t) E Pj. Then, the cross-validation average squared error for subset j is \ndefined as \n\nCVPj('x) = ~. E \n\n3 (Xk,tk)ePj \n\n(tk - flA(Pj)(Xk)) 2 , \n\nand \n\nCVp(,x) = ; L CVPj('x). \n\n1 \n\nj \n\n(4) \n\nTypical choices for v are 5 and 10. Note that leave-one-out CV is obtained in the \nlimit v = N. \n\n2.2 Generalized Cross-Validation and Final Prediction Error \n\nFor linear models, two useful criteria for selecting a model architecture are general(cid:173)\nized cross-validation (CCV) (Wahba [11]) and Akaike's final prediction error (FPE) \n([1]): \n\nGCV('x) = ASE('x) \n\n1 \n\n2 \n\n(I-\u00a5) \n\n( I+~) \nFPE('x) = ASE('x) 1- ~ . \n\nS(A) denotes the number of weights of model'x. See [4] for a tutorial treatment. \nNote that although they are slightly different for small sample sizes, they are asymp(cid:173)\ntotically equivalent for large N: \n\np(,x) - ASE('x) (1 + 2S~)) ~ GCV('x) ~ FPE('x) \n\n(5) \n\nWe shall use this asymptotic estimate for the prediction risk in our analysis of the \nbond rating models. \n\nIt has been shown by Moody [6] that FPE and therefore p(,x) is an unbiased estimate \nof the prediction risk for the neural network models considered here provided that \n(1) the noise fj in the observed targets tj is independent and identically distributed, \n\n\f686 \n\nMoody and Utans \n\n(2) weight decay is not used, and (3) the resulting model is unbiased. (In practice, \nhowever, essentially all neural network fits to data will be biased (see Moody [6]).) \nFPE is a special case of Barron's PSE [2] and Moody's GPE [6]. Although FPE \nand P{A.) are unbiased only under the above assumptions, they are much cheaper \nto compute than GVp since no retraining is tequired. \n\n3 A Case Study: Prediction of Corporate Bond Ratings \n\nA bond is a debt security which constitutes a promise by the issuing firm to pay a \ngiven rate of interest on the original issue price and to redeem the bond at face value \nat maturity. Bonds are rated according to the default risk of the issuing firm by \nindependent rating agencies such as Standard & Poors (S&P) and Moody's Investor \nService. The firm is in default if it is not able make the promised interest payments. \n\nRepresentation of S&P Bond Ratings \n\nTable 1: Key to S&P bond ratings. We only used the range from 'AAA' or 'very low default risk' to \n'CCC' meaning 'very high default risk'. (Note that AAA- is a not a standard category; its inclusion was \nsuggested to us by a Wall Street analyst.) Bonds with rating BBB- or better are \"investment grade\" \nwhile \"junk bonds\" have ratings BB+ or below. For our output representation, we assigned an integer \nnumber to each rating as shown . \n\nS&P and Moody's determine the rating from various financial variables and possibly \nother information, but the exact set of variables is unknown. It is commonly believed \nthat the rating is at least to some degree judged on the basis of subjective factors \nand on variables not directly related to a particular firm. In addition, the method \nused for assigning the rating based on the input variables is unknown. The problem \nwe are considering here is to predict the S&P rating of a bond based on fundamental \nfinancial information about the issuer which is publicly available. Since the rating \nagencies update their bond ratings infrequently, there is considerable value to being \nable to anticipate rating changes before they are announced. A predictive model \nwhich maps fundamental financial factors onto an estimated rating can accomplish \nthis. \n\nThe input data for our model consists of 10 financial ratios reflecting the fundamen(cid:173)\ntal characteristics of the firms. The database was prepared for us by analysts at a \nmajor financial institution. Since we did not attempt to include all information in \nthe input variables that could possibly be related to a firms bond rating (e.g. all \nfundamental or technical financial factors, or qualitative information such as quality \nof management), we can only attempt to approximate the S&P rating. \n\n3.1 A Linear Bond Rating Predictor \n\nFor comparison with the neural network models, we computed a standard linear \nregression model. All input variables were used to predict the rating which is \nrepresented by a number in [0,1]. The rating varies continuously from one category \nto the next higher or next lower one and this \"smoothness\" is captured in the single \noutput representation and should make the task easier. To interpret the network \n\n\fCross Validation Error vs. \n\nNur.ber of Hidden Units \n\n1.00 ,....,....-.----r--,.--.--..-~~~.--,......., \n\nr... o \nr... \nr... \n~I.. \n\n~ \n\n::: 1,10 \nIII \n't1 \n\nfill . . . \n\nIII > \nID o \nr... \nU \n\nPrincipled Architecture Selection for Neural Networks \n\n687 \n\np vs . \n\nNur.ber of Hidden Units \n\n... \n... \n... \n... \n\nI .a \n\nNumber of Hidden Units \n\nNumber of Hidden Units \n\nFigure 1: Cross validation error CVp (>.) and Pp..) versus number of hidden units . \n\nresponse, the output was rescaled from [0,1] to [2 , 19] and rounded to the nearest \ninteger; 19 corresponds to a rating of 'AAA' and 2 to 'eee' and below (see Table 1). \nThe input variables were normalized to the interval [0,1] since the original financial \nratios differed widely in magnitude. The model predicted the rating of 21.4 % of \nthe firms correctly, for 37.2 % the error was one notch and for 21.9 % two notches \n(thus predicting 80.5 % of the data within two notches from the correct target). \nThe RMS training error was 1.93 and the estimate of the prediction risk P == 2.038. \n\n3.2 Beyond Linear Regression: Prediction by Two Layer Perceptrons \n\nThe class of models we are considering as predictors are two-layer perceptron net(cid:173)\nworks with h input variables, H>. internal units and a single output unit having \nthe form \n\np>.(x) = f( Vo + L Va g(WaO + L Wa{3 X(3)) . \n\nH>. \n\nI>. \n\n(6) \n\nThe hidden units have a sigmoidal transfer function while our single output unit \nuses a piecewise linear function. \n\na=l \n\n{3=1 \n\n3.3 Heuristic Search over the Space of Percept ron Architectures \n\nOur proposed heuristic search algorithm over the space of perceptron architectures \nis as follows. First, we select the optimal number of internal units from a sequence \nof fully connected networks with increasing number of hidden units. Then, using the \noptimal fully connected network, we prune weights and input variables in parallel \nresulting in two separately pruned networks. Lastly, the methods were combined \nand the resulting networks is retrained to yield the final model \n\n3.3.1 Selecting the Number of Hidden Units \n\nWe initially trained fully connected networks with all 10 available inputs variables \nbut with the number of hidden units H>. varying from 2 to 11. Five-fold cross-\n\n\f688 \n\nMoody and Utans \n\nTraining Error \n3 Hidden Units \n\nCross Validation Error \n\n3 Hidden Units \n\n% \n\n34.2 \n42.9 \n17.3 \n5 .6 \n\nfirms \n\n67 \n84 \n34 \n11 \n\nIE.,,..,,./, I \n0 \n1 \n2 \n>2 \nnumber of weights \nstandard deviation \nmean absolute deviation \ntraining errOr \n\ncum. % \n\n34.2 \n77 .1 \n94.4 \n100.0 \n\n37 \n\n1.206 \n0.898 \n1.320 \n\nfirms \n\n1'0 \n28.6 \n38.8 \n17.3 \n15.3 \n\n54 \n77 \n35 \n30 \n\n.IE....n.otc.b.l \n0 \n1 \n2 \n>2 \nnumber of weights \nstandard deviation \nmean absolute deviation \ncross validation error \n\ncum.1! \n\n28.6 \n67.3 \n84.7 \n100.0 \n\n37 \n\n1.630 \n1.148 \n1.807 \n\nTable 2: Results for the network with 3 hidden units. The standard deviation and the mean absolute \ndeviation are computed after rescaling the output of the network to [2,19] and rounding to the nearest \ninteger (notches) . The RMS training error is computed using the rescaled output of the network before \nrounding. The table also describes the predictive ability of the network by a histogram; the error column \ngives the number of rating categories the network was off from the correct target. The network with \n3 hidden units significantly outperformed the linear regression model. On the right Cross Validation \nresults for the network with 3 hidden units are shown. In order to predict the rating for a firm, we \nchoose among the networks trained for the cross-validation procedure the one that was not trained using \nthe subset the firm belongs to. Thus the results concerning the predictive ability of the model reflect the \nexpected performance of the model trained on all the data with new data in the cross-validation-sense. \n\nvalidation and P(>\\) were used to select the number of hidden units. We compute \nCVp(A) according to equation (4); the data set was partitioned into v = 5 subsets. \nWe also computed P(A) according to equation (5). The results of the two methods \nare consistent, having a common minimum for H>.. = 3 internal units (see figure 1). \nTable 2 (left ) shows the results for the network with H)\" = 3 trained on the entire \ndata set. A more accurate description of the performance of the model is shown in \ntable 2( right) were the predictive ability is calculated from the hold-out sets of the \ncross-validation procedure. \n\n3.3.2 Pruning of Input Variables via Sensitivity Analysis \n\nNext, we attempted to further reduce the number of weights of the network by \neliminating some of the input variables. To test which inputs are most significant \nfor determining the network output, we perform a sensitivity analysis. We define \nthe \"Sensitivity\" of the network model to variable (3 as: \n\n1 N \n\nSf3 = N L AS'E(x{3) - AS'E(xf3) \n\nj=l \n\nwith \n\n1 N \n\nx{3 = N LX{3j \n\nj=l \n\nHere, x{3j is the 13th input variable of the ph exemplar. S{3 measures the effect on \nthe training AS'E of replacing the 13th input xf3 by its average x{3. Replacement of a \nvariable by its average value reIl!0ves its influence on the network output. Again we \nuse 5-fold cross-validation and P to estimate the prediction risk P>... We constructed \na sequence of models by deleting an increasin~ number of input variables in order \nof increasing S{3. For each model, CVp and P was computed, figure 2 shows the \nresults. A minimum was attained for the model with 1>.. = 8 input variables (2 \ninputs were removed). This reduces the number of weights by 2H)\" = 6. \n\n\fPrincipled Architecture Selection for Neural Networks \n\n689 \n\nI.a \n\na.1 \n\na.\\ \n\nc..a.o \n\n... \n\n1 .\u2022 \n\n1.. \n\nP and \n\nSensitivity Analysis \n\n\u2022 \nNurDer of Inputs RSIDved \n\na \n\nI \n\n1.00 \n\n.. 16 \n\n1.00 \n\nc.. \"16 \n\n1.10 \n\n1.76 \n\n\\. .. \n\nP and \n\n\"Op t iIml Brain Damge\" \n\nNurrher of Weights ReaDved \n\n\\0 \n\n\\ \n\nao \n\nFigure 2: peA) for the sensitivity analysis and OBD. In both cases, the Cross validation error CVp(A) \nhas a minimum for the same A. \n\n3.3.3 Weight Pruning via \"Optimal Brain Damage\" \n\nOptimal Brain Damage (OBD) was introduced by Le Cun at al [3] as a method \nto reduce the number of weights in a neural network to avoid overfitting. OBD is \ndesigned to select those weights in the network whose removal will have a small \neffect on the training ME. Assuming that the original network was too large, \nremoving these weights and retraining the now smaller network should improve \nthe generalization performance. The method approximates ME at a minimum in \nweight space by a diagonal quadratic expansion. The saliency \n\nSi = -\n\n1 {PME 2 \n2 ow. \n2 w\u00b7 \n\nI \n\nI \n\ncomputed after training has stopped is a measure (in the diagonal approximation) \nfor the change of ME when weight Wi is removed from the network. \nCVp and P were computed to select the optimal model. We find that CVp and P \nare minimized when 9 weights are deleted from the network using all input variables. \nHowever, some overlap exists when compared to the sensitivity analysis described \nabove: 5 of the deleted weights would also have been removed by the sensitivity \nmethod. \n\nTable 3 show the overall performance of our model when the two techniques were \ncombined to yield the final architecture. This architecture is obtained by deleting \nthe union of the sets of weights that were deleted using weight and input pruning \nseparately. Note the improvement in estimated prediction performance (CV error) \nin table 3 relative to 2. \n\n4 Summary \n\nOur example shows that (1) nonlinear network models can out-perform linear re(cid:173)\ngression models, and (2) substantial benefits in performance can be obtained by the \nuse of principled architecture selection methods. The resulting structured networks \n\n\f690 \n\nMoody and Utans \n\nTraining Error, 3 Hidden Units \n\n2 Inputs and 9 Connections Removed \ncum . % \n\nfirms \n\n% \n35.2 \n41.3 \n16 .3 \n7.2 \n\n69 \n81 \n32 \n14 \n\nIEnatt'Ohl \n0 \n1 \n2 \n>2 \nnumber of weights \nstandard deviation \nmean absolute deviation \ntraining error \n\n35.2 \n76.5 \n92 .8 \n100.0 \n\n27 \n\n1.208 \n0.882 \n1.356 \n\nCross Validation Error, 3 Hidden Units \n2 Inputs and 9 Connections Removed \ncum . % \n\nfirms \n\n% \n\n29.6 \n38.8 \n18.9 \n12.8 \n\n58 \n76 \n37 \n26 \n\nIEnotchl \n0 \n1 \n2 \n>2 \nnumber of weights \nstandard deviation \nmean absolute deviation \ncross validation error \n\n29.6 \n68.4 \n87 .2 \n100.0 \n\n27 \n\n1.546 \n1.117 \n1.697 \n\nTable 3: Results for the network with 3 hidden units with both, sensitivity analysis and OBD applied . \nNote the improvement in CV error performance of relative to Table 2. \n\nare optimized with respect to the task at hand, even though it may not be possible \nto design them based on a priori knowledge. \n\nEstimates of the prediction risk offer a sound basis for assessing the performance \nof the model on new data and can be used as a tool for principled architecture \nselection. Cross-validation, GCV and FPE provide computationally feasible means \nof estimating the prediction risk. These estimates of prediction risk provide very \neffective criteria for selecting the number of internal units and performing sensitivity \nanalysis and OBD. \n\nReferences \n[1] H. Akaike. Statistical predictor identification. Ann. Inst. Statist. Math., 22:203-217, \n\n1970. \n\n[2] A. Barron. Predicted squared error: a criterion for automatic model selection. In \nS. Farlow, editor, Self-Organizing Methods in Modeling. Marcel Dekker, New York, \n1984. \n\n[3] Y. Le Cun J. S. Denker, and S. A. Solla. Optimal brain damage. In D. S. Touretzky, \neditor, Advances in Neural Information Processing Systems 2. Morgan Kaufmann \nPublishers, 1990. \n\n[4] Randall L. Eubank. Spline Smoothing and Nonparametric Regression. Marcel Dekker, \n\nInc., 1988. \n\n[5] Seymour Geisser. The predictive sample reuse method with applications. Journal of \n\nThe American Statistical Association, 70(350), June 1975. \n\n[6] John Moody. The effective number of parameters: an analysis of generalization and \nregularization in nonlinear learning systems. short version in this volume, long version \nto appear, 1992. \n\n[7] F. Mosteller and J. W. Tukey. Data analysis, including statistics. In G. Lindzey and \nE. Aronson, editors, Handbook of Social Psychology, Vol. 2. Addison-Wesley, 1968 \n(first edition 1954). \n\n[8] M. Stone. Cross-validatory choice and assessment of statistical predictions. Roy. \n\nStat. Soc., B36, 1974. \n\n[9] M. Stone. Cross-validation: A review. Math. Operationsforsch. Statist., Ser. Statis(cid:173)\n\ntics, 9(1), 1978. \n\n[10] Joachim Utans and John Moody. Selecting neural network architectures via the \n~ediction risk: Application to corporate bond rating prediction. In Proceedings of the \nFirst International Conference on Artifical Intelligence Applications on Wall Street. \nIEEE Computer Society Press, Los Alamitos, CA, 1991. \n\n[11] G. Wahba. Spline Models for Observational Data, volume 59 of Regional Conference \n\nSeries in Applied Mathematics. SIAM Press, Philadelphia, 1990. \n\n[12] G. Wahba and S. Wold. A completely automatic french curve: Fitting spline functions \n\nby cross-validation. Communiations in Statistics, 4(1): 1-17, 1975. \n\n\f", "award": [], "sourceid": 441, "authors": [{"given_name": "John", "family_name": "Moody", "institution": null}, {"given_name": "Joachim", "family_name": "Utans", "institution": null}]}