{"title": "Improving the Performance of Radial Basis Function Networks by Learning Center Locations", "book": "Advances in Neural Information Processing Systems", "page_first": 1133, "page_last": 1140, "abstract": null, "full_text": "Improving the Performance of Radial Basis \n\nFunction Networks by Learning Center Locations \n\nDietrich Wettschereck \n\nThomas Dietterich \n\nDepartment of Computer Science \n\nDepartment of Computer Science \n\nOregon State University \nCorvallis, OR 97331-3202 \n\nOregon State University \nCorvallis, OR 97331-3202 \n\nAbstract \n\nThree methods for improving the performance of (gaussian) radial basis \nfunction (RBF) networks were tested on the NETtaik task. In RBF, a \nnew example is classified by computing its Euclidean distance to a set of \ncenters chosen by unsupervised methods. The application of supervised \nlearning to learn a non-Euclidean distance metric was found to reduce the \nerror rate of RBF networks, while supervised learning of each center's vari(cid:173)\nance resulted in inferior performance. The best improvement in accuracy \nwas achieved by networks called generalized radial basis function (GRBF) \nnetworks. In GRBF, the center locations are determined by supervised \nlearning. After training on 1000 words, RBF classifies 56.5% of letters \ncorrect, while GRBF scores 73.4% letters correct (on a separate test set). \nFrom these and other experiments, we conclude that supervised learning \nof center locations can be very important for radial basis function learning. \n\n1 \n\nIntroduction \n\nRadial basis function (RBF) networks are 3-layer feed-forward networks in which \neach hidden unit a computes the function \nfa(x) = e-\n\nIIX-X\",1I2 \n\n,,2 \n\n, \n\nand the output units compute a weighted sum of these hidden-unit activations: \n\nN \n\nJ*(x) = L cafa(x). \n\n1133 \n\n\f1134 \n\nWettschereck and Dietterich \n\nIn other words, the value of rex) is determined by computing the Euclidean dis(cid:173)\n\ntance between x and a set of N centers, Xa. These distances are then passed \nthrough Gaussians (with variance 17 2 and zero mean), weighted by Ca, and summed. \n\nRadial basis function networks (RBF networks) provide an attractive alternative \nto sigmoid networks for learning real-valued mappings: (a) they provide excellent \napproximations to smooth functions (Poggio & Girosi, 1989), (b) their \"centers\" are \ninterpretable as \"prototypes\" , and (c) they can be learned very quickly, because the \ncenter locations (xa) can be determined by unsupervised learning algorithms and \nthe weights (c a ) can be computed by pseudo-inverse methods (Moody and Darken, \n1989). \n\nAlthough the application of unsupervised methods to learn the center locations \ndoes yield very efficient training, there is some evidence that the generalization \nperformance of RBF networks is inferior to sigmoid networks. Moody and Darken \n(1989), for example, report that their RBF network must receive 10 times more \ntraining data than a standard sigmoidal network in order to attain comparable \ngeneralization performance on the Mackey-Glass time-series task. \n\nThere are several plausible explanations for this performance gap. First, in sigmoid \nnetworks, all parameters are determined by supervised learning, whereas in RBF \nnetworks, typically only the learning of the output weights has been supervised. \nSecond, the use of Euclidean distance to compute Ilx - Xa II assumes that all input \nfeatures are equally important. In many applications, this assumption is known to \nbe false, so this could yield poor results. \n\nThe purpose of this paper is twofold. First, we carefully tested the performance \nof RBF networks on the well-known NETtaik task (Sejnowski & Rosenberg, 1987) \nand compared it to the performance of a wide variety of algorithms that we have \npreviously tested on this task (Dietterich, Hild, & Bakiri, 1990). The results confirm \nthat there is a substantial gap between RBF generalization and other methods. \n\nSecond, we evaluated the benefits of employing supervised learning to learn (a) \nthe center locations X a , (b) weights Wi for a weighted distance metric, and (c) \nvariances a; for each center. The results show that supervised learning of the \ncenter locations and weights improves performance, while supervised learning of \nthe variances or of combinations of center locations, variances, and weights did \nnot. The best performance was obtained by supervised learning of only the center \nlocations (and the output weights, of course). \n\nIn the remainder of the paper we first describe our testing methodology and review \nthe NETtaik domain. Then, we present results of our comparison ofRBF with other \nmethods. Finally, we describe the performance obtained from supervised learning \nof weights, variances, and center locations. \n\n2 Methodology \n\nAll of the learning algorithms described in this paper have several parameters (such \nas the number of centers and the criterion for stopping training) that must be spec(cid:173)\nified by the user. To set these parameters in a principled fashion, we employed the \ncross-validation methodology described by Lang, Hinton & Waibel (1990). First, as \n\n\fImproving the Performance of Radial Basis Function Networks by Learning Center Locations \n\n1135 \n\nusual, we randomly partitioned our dataset into a training set and a test set. Then, \nwe further divided the training set into a subtraining set and a cross-validation set. \nAlternative values for the user-specified parameters were then tried while training \non the subtraining set and testing on the cross-validation set. The best-performing \nparameter values were then employed to train a network on the full training set. The \ngeneralization performance of the resulting network is then measured on the test \nset. Using this methodology, no information from the test set is used to determine \nany parameters during training. \n\nWe explored the following parameters: (a) the number of hidden units (centers) \nN, (b) the method for choosing the initial locations of the centers, (c) the variance \n(j2 (when it was not subject to supervised learning), and (d) (whenever supervised \ntraining was involved) the stopping squared error per example. We tried N = \n50, 100, 150, 200, and 250; (j2 = 1, 2, 4, 5, 10, 20, and 50; and three different \ninitialization procedures: \n\n(a) Use a subset of the training examples, \n(b) Use an unsupervised version of the IB2 algorithm of Aha, Kibler & Albert \n\n(1991), and \n\n(c) Apply k-means clustering, starting with the centers from (a). \n\nFor all methods, we applied the pseudo-inverse technique of Penrose (1955) followed \nby Gaussian elimination to set the output weights. \n\nTo perform supervised learning of center locations, feature weights, and variances, \nwe applied conjugate-gradient optimization. We modified the conjugate-gradient \nimplementation of backpropagation supplied by Barnard & Cole (1989). \n\n3 The NETtalk Domain \n\nWe tested all networks on the NETtaik task (Sejnowski & Rosenberg, 1987), in \nwhich the goal is to learn to pronounce English words by studying a dictionary of \ncorrect pronunciations. We replicated the formulation of Sejnowski & Rosenberg in \nwhich the task is to learn to map each individual letter in a word to a phoneme and \na stress. \n\nTwo disjoint sets of 1000 words were drawn at random from the NETtaik dictionary \nof 20,002 words (made available by Sejnowski and Rosenberg): one for training \nand one for testing. The training set was further subdivided into an 800-word \nsub training set and a 200-word cross-validation set. \n\nTo encode the words in the dictionary, we replicated the encoding of Sejnowski \n& Rosenberg (1987): Each input vector encodes a 7-letter window centered on the \nletter to be pronounced. Letters beyond the ends of the word are encoded as blanks. \nEach letter is locally encoded as a 29-bit string (26 bits for each letter, 1 bit for \ncomma, space, and period) with exactly one bit on. This gives 203 input bits, seven \nof which are 1 while all others are O. \n\nEach phoneme and stress pair was encoded using the 26-bit distributed code devel(cid:173)\noped by Sejnowski & Rosenberg in which the bit positions correspond to distinctive \nfeatures of the phonemes and stresses (e.g., voiced/unvoiced, stop, etc.). \n\n\f1136 \n\nWettschereck and Dietterich \n\n4 RBF Performance on the NETtaik Task \n\nWe began by testing RBF on the NETtalk task. Cross-validation training deter(cid:173)\nmined that peak RBF generalization was obtained with N = 250 (the number of \ncenters), (12 = 5 (constant for all centers), and the locations of the centers computed \nby k-means clustering. Table 1 shows the performance of RBF on the lOOO-word \ntest set in comparison with several other algorithms: nearest neighbor, the decision \ntree algorithm ID3 (Quinlan, 1986), sigmoid networks trained via backpropagation \n(160 hidden units, cross-validation training, learning rate 0.25, momentum 0.9), \nWolpert's (1990) HERBIE algorithm (with weights set via mutual information), \nand ID3 with error-correcting output codes (ECC, Dietterich & Bakiri, 1991). \n\nTable 1: Generalization performance on the NETtalk task. \n\nAlgorithm \n\nNearest neighbor \n\nRBF \nID3 \n\nBack propagation \n\nWolpert \n\nID3 + 127-bit ECC \nPrIor row dIfferent, p < .05* \n\n% correct Jl000-word test seQ \n\nPhoneme \n61.1 \n\nLetter \nWord \n53.1 \n3.3 \n57.0***** 65.6***** \n3.7 \n9.6***** 65.6***** 78.7***** \n70.6***** 80.8**** \n13.6** \n82.6***** \n72.2* \n15.0 \n85.6***** \n20.0*** \n73.7* \n.002**** \n\nStress \n74.0 \n80.3***** \n77.2***** \n81.3***** \n80.2 \n81.1 \n.001***** \n\n.01** \n\n.005*** \n\nPerformance is shown at several levels of aggregation. The \"stress\" column indicates \nthe percentage of stress assignments correctly classified. The \"phoneme\" column \nshows the percentage of phonemes correctly assigned. A \"letter\" is correct if the \nphoneme and stress are correctly assigned, and a \"word\" is correct if all letters in \nthe word are correctly classified. Also shown are the results of a two-tailed test for \nthe difference of two proportions, which was conducted for each row and the row \npreceding it in the table. \nFrom this table, it is clear that RBF is performing substantially below virtually all \nof the algorithms except nearest neighbor. There is certainly room for supervised \nlearning of RBF parameters to improve on this. \n\n5 Supervised Learning of Additional RBF Parameters \n\nIn this section, we present our supervised learning experiments. In each case, we \nreport only the cross-validation performance. Finally, we take the best supervised \nlearning configuration, as determined by these cross-validation scores, train it on \nthe entire training set and evaluate it on the test set. \n\n5.1 Weighted Feature Norm and Centers With Adjustable Widths \n\nThe first form of supervised learning that we tested was the learning of a weighted \nnorm. In the NETtaik domain, it is obvious that the various input features are not \nequally important . In particular, the features describing the letter at the center of \n\n\fImproving the Performance of Radial Basis Function Networks by Learning C enter Locations \n\n1137 \n\nthe 7-letter window-the letter to be pronounced-are much more important than \nthe features describing the other letters, which are only present to provide context. \nOne way to capture the importance of different features is through a weighted \nnorm: \n\nIlx - xall! = L Wi(Xi - xad 2 . \n\ni \n\nWe employed supervised training to obtain the weights Wi. We call this configu(cid:173)\nration RBFFW. On the cross-validation set, RBFFW correctly classified 62.4% of \nthe letters (N=200, (j2 = 5, center locations determined by k-means clustering) . \nThis is a 4.7 percentage point improvement over standard RBF, which on the cross(cid:173)\nvalidation set classifies only 57.7% of the letters correctly (N=250, (j2 = 5, center \nlocations determined by k-means clustering). \nMoody & Darken (1989) suggested heuristics to set the variance of each center. \nThey employed the inverse of the mean Euclidean distance from each center to its \nP-nearest neighbors to determine the variance. However, they found that in most \ncases a global value for all variances worked best. We replicated this experiment for \nP = 1 and P = 4, and we compared this to just setting the variances to a global value \n((j2 = 5) optimized by cross-validation. The performance on the cross-validation \nset was 53.6% (for P=l), 53.8% (for P=4) , and 57.7% (for the global value). \n\nIn addition to these heuristic methods, we also tried supervised learning of the \nvariances alone (which we call RBFu). On the cross-validation set, it classifies \n57.4% of the letters correctly, as compared with 57.7% for standard RBF. \n\nHence, in all of our experiments, a single global value for (j2 gives better results \nthan any of the techniques for setting separate values for each center. Other re(cid:173)\nsearchers have obtained experimental results in other domains showing the useful(cid:173)\nness of nonuniform variances. Hence, we must conclude that, while RBF u did not \nperform well in the NETtaik domain, it may be valuable in other domains. \n\n5.2 Learning Center Locations (Generalized Radial Basis Functions) \n\nPoggio and Girosi (1989) suggest using gradient descent methods to implement \nsupervised learning of the center locations, a method that they call generalized \nradial basis functions (GRBF). We implemented and tested this approach . On the \ncross-validation set, GRBF correctly classifies 72.2% ofthe letters (N = 200, (j2 = 4, \ncenters initialized to a subset of training data) as compared to 57.7% for standard \nRBF. This is a remarkable 14.5 percentage-point improvement. \n\nWe also tested GRBF with previously learned feature weights (GRBFFW) and in \ncombination with learning variances (G RBF u ). The performance of both of these \nmethods was inferior to GRBF. For GRBFFW, gradient search on the center lo(cid:173)\ncations failed to significantly improved performance of RBF FW networks (RBF FW \n62.4% vs. GRBFFw 62.8%, RBFFw 54.5% vs. GRBFFW 57.9%). This shows that \nthrough the use of a non-Euclidian, fixed metric found by RBFFW the gradient \nsearch of GRBFFw is getting caught in a local minimum. One explanation for this \nis that feature weights and adjustable centers are. two alternative ways of achieving \nthe same effect-namely, of making some features more important than others. Re(cid:173)\ndundancy can easily create local minima. To understand this explanation, consider \nthe plots in Figure 1. Figure l(A) shows the weights of the input features as they \n\n\f1138 \n\nWettschereck and Dietterich \n\n5 .--.---.---.~-.--~--~--. \n4 \n3 \n2 \n1 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 \n\no 29 \n(A) \n\n58 \n\n87 116 145 \n\n174 203 \n\ninput number \n\n29 \n\n0 \n(B) \n\n58 \n\n87 116 145 174 203 \n\ninput number \n\nFigure 1: (A) displays the weights of input features as learned by RBFFW. In \n(B) the mean square-distance between centers (separate for each dimension) from \na GRBF network (N = 100, 0-2 = 4) is shown. \n\nwere learned by RBF FW . Features with weights near zero have no influence in \nthe distance calculation when a new test example is classified. Figure l(B) shows \nthe mean squared distance between every center and every other center (computed \nseparately for each input feature). Low values for the mean squared distance on \nfeature i indicate that most centers have very similar values on feature i. Hence, \nthis feature can play no role in determining which centers are activated by a new \ntest example. In both plots, the features at the center of the window are clearly \nthe most important. Therefore, it appears that GRBF is able to capture the in(cid:173)\nformation about the relative importance of features without the need for feature \nweights. \n\nTo explore the effect of learning the variances and center locations simultaneously, \nwe introduced a scale factor to allow us to adjust the relative magnitudes of the \ngradients. We then varied this scale factor under cross validation. Generally, the \nlarger we set the scale factor (to increase the gradient of the variance terms) the \nworse the performance became. As with GRBF FW, we see that difficulties in \ngradient descent training are preventing us from finding a global minimum (or even \nre-discovering known local minima). \n\n5.3 Summary \n\nBased on the results of this section as summarized in Table 2, we chose GRBF as \nthe best supervised learning configuration and applied it to the entire 1000-word \ntraining set (with testing on the 1000-word test set). We also combined it with a \n63-bit error-correcting output code to see if this would improve its performance, \nsince error-correcting output codes have been shown to boost the performance of \nbackpropagation and ID3. The final comparison results are shown in Table 3. The \nresults show that GRBF is superior to RBF at all levels of aggregation. Further(cid:173)\nmore, GRBF is statistically indistinguishable from the best method that we have \ntested to date (103 with 127-bit error-correcting output code), except on phonemes \nwhere it is detectably inferior and on stresses where it is detect ably superior. GRBF \nwith error-correcting output codes is statistically indistinguishable from 103 with \nerror-correcting output codes. \n\n\fImproving the Performance of Radial Basis Function Networks by Learning Center Locations \n\n1139 \n\nTable 2: Percent of letters \ncorrectly classified on the \n200-word cross-validation \ndata set. \n\nMethod \nRBF \nRBFFW \nRBFq \nGRBF \nGRBFFW \nGRBF q \n\n% Letters \nCorrect \n\n57.7 \n62.4 \n57.4 \n72.2 \n62.8 \n67.5 \n\nTable 3: Generalization performance \non the NETtaik task. \n\nAlgorithm Word Letter Phoneme Stress \n\n% correct (lOOO-word test set) \n\n3.7 \n19.8** 73.8*** 84.1*** \n\n65.6 \n\n57.0 \n\n80.3 \n82.4** \n\n20.0 \n\n73.7 \n\n85.6* \n\n81.1* \n\nRBF \nGRBF \nID3 + \n127-bit ECC \nGRBF + \n63-bit ECC \n\nPrIor row different ,p < .05* \n\n85.3 \n.002** \n\n82.2 \n.001*** \n\n19.2 \n\n74.6 \n\nThe near-identical performance of GRBF and the error-correcting code method \nand the fact that the use of error correcting output codes does not improve GRBF's \nperformance significantly, suggests that the \"bias\" of GRBF (i.e., its implicit as(cid:173)\nsumptions about the unknown function being learned) is particularly appropriate \nfor the NETtaik task. This conjecture follows from the observation that error(cid:173)\ncorrecting output codes provide a way of recovering from improper bias (such as \nthe bias of ID3 in this task). This is somewhat surprising, since the mathematical \njustification for GRBF is based on the smoothness of the unknown function, which \nis certainly violated in classification tasks. \n\n6 Conclusions \n\nRadial basis function networks have many properties that make them attractive in \ncomparison to networks of sigmoid units. However, our tests of RBF learning (un(cid:173)\nsupervised learning of center locations, supervised learning of output-layer weights) \nin the NETtaik domain found that RBF networks did not generalize nearly as well \nas sigmoid networks. This is consistent with results reported in other domains. \n\nHowever, by employing supervised learning of the center locations as well as the \noutput weights, the GRBF method is able to substantially exceed the generalization \nperformance of sigmoid networks. Indeed, GRBF matches the performance of the \nbest known method for the NETtaik task: ID3 with error-correcting output codes, \nwhich, however, is approximately 50 times faster to train. \n\nWe found that supervised learning of feature weights (alone) could also improve the \nperformance of RBF networks, although not nearly as much as learning the center \nlocations. Surprisingly, we found that supervised learning of the variances of the \nGaussians located at each center hurt generalization performance. Also, combined \nsupervised learning of center locations and feature weights did not perform as well \nas supervised learning of center locations alone. The training process is becoming \nstuck in local minima. For GRBFFW, we presented data suggesting that feature \nweights are redundant and that they could be introducing local minima as a result. \n\nOur implementation of GRBF, while efficient, still gives training times comparable \nto those required for backpropagation training of sigmoid networks. Hence, an \n\n\f1140 \n\nWettschereck and Dietterich \n\nimportant open problem is to develop more efficient methods for supervised learning \nof center locations. \n\nWhile the results in this paper apply only to the NETtaik domain, the markedly \nsuperior performance of GRBF over RBF suggests that in new applications of RBF \nnetworks, it is important to consider supervised learning of center locations in order \nto obtain the best generalization performance. \n\nAcknowledgments \n\nThis research was supported by a grant from the National Science Foundation Grant \nNumber IRI-86-57316. \n\nReferences \n\nD. W. Aha, D. Kibler & M. K. Albert. (1991) Instance-based learning algorithms. \nMachine Learning 6(1):37-66. \nE. Barnard & R. A. Cole. (1989) A neural-net training program based on conjugate(cid:173)\ngradient optimization. Rep. No. CSE 89-014. Oregon Graduate Institute, Beaver(cid:173)\nton, OR. \nT. G. Dietterich & G. Bakiri. (1991) Error-correcting output codes: A general \nmethod for improving multiclass inductive learning programs. Proceedings of the \nNinth National Conference on Artificial Intelligence (AAAI-91), Anaheim, CA: \nAAAI Press. \nT. G. Dietterich, H. Hild, & G. Bakiri. (1990) A comparative study ofID3 and back(cid:173)\npropagation for English text-to-speech mapping. Proceedings of the 1990 Machine \nLearning Conference, Austin, TX. 24-31. \nK. J. Lang, A. H. Waibel & G. E. Hinton. (1990) A time-delay neural network \narchitecture for isolated word recognition. Neural Networks 3:33-43. \nJ. MacQueen. (1967) Some methods of classification and analysis of multivariate \nobservations. In LeCam, 1. M. & Neyman, J. (Eds.), Proceedings of the 5th Berkeley \nSymposium on Mathematics, Statistics, and Probability (p. 281). Berkeley, CA: \nUniversity of California Press. \nJ. Moody & C. J. Darken. (1989) Fast learning in networks of locally-tuned pro(cid:173)\ncessing units. Neural Computation 1(2):281-294. \n\nR. Penrose. (1955) A generalized inverse for matrices. Proceedings of Cambridge \nPhilosophical Society 51:406-413. \nT. Poggio & F. Girosi. (1989) A theory of networks for approximation and learning. \nReport Number AI-1140. MIT Artificial Intelligence Laboratory, Cambridge, MA. \nJ. R. Quinlan. (1986) Induction of decision trees. Machine Learning 1(1):81-106. \nT. J. Sejnowski & C. R. Rosenberg. (1987) Parallel networks that learn to pronounce \nEnglish text. Complex Systems 1:145-168. \n\nD. Wolpert. (1990) Constructing a generalizer superior to NETtaik via a mathe(cid:173)\nmatical theory of generalization. Neural Networks 3:445-452. \n\n\f", "award": [], "sourceid": 544, "authors": [{"given_name": "Dietrich", "family_name": "Wettschereck", "institution": null}, {"given_name": "Thomas", "family_name": "Dietterich", "institution": null}]}