{"title": "Learning with Product Units", "book": "Advances in Neural Information Processing Systems", "page_first": 537, "page_last": 544, "abstract": null, "full_text": "Comparing the prediction accuracy of \n\nartificial neural networks and other \nstatistical models for breast cancer \n\nsurvival \n\nHarry B. Burke \n\nDepartment of Medicine \nNew York Medical College \n\nValhalla, NY 10595 \n\nDavid B. Rosen \n\nDepartment of Medicine \nNew York Medical College \n\nValhalla, NY 10595 \n\nPhilip H. Goodman \nDepartment of Medicine \n\nUniversity of Nevada School of Medicine \n\nReno, Nevada 89520 \n\nAbstract \n\nThe TNM staging system has been used since the early 1960's \nto predict breast cancer patient outcome. In an attempt to in(cid:173)\ncrease prognostic accuracy, many putative prognostic factors have \nbeen identified. Because the TNM stage model can not accom(cid:173)\nmodate these new factors, the proliferation of factors in breast \ncancer has lead to clinical confusion. What is required is a new \ncomputerized prognostic system that can test putative prognostic \nfactors and integrate the predictive factors with the TNM vari(cid:173)\nables in order to increase prognostic accuracy. Using the area un(cid:173)\nder the curve of the receiver operating characteristic, we compare \nthe accuracy of the following predictive models in terms of five \nyear breast cancer-specific survival: pTNM staging system, princi(cid:173)\npal component analysis, classification and regression trees, logistic \nregression, cascade correlation neural network, conjugate gradient \ndescent neural, probabilistic neural network, and backpropagation \nneural network. Several statistical models are significantly more ac-\n\n\f1064 \n\nHarry B. Burke, David B. Rosen, Philip H. Goodman \n\ncurate than the TNM staging system. Logistic regression and the \nbackpropagation neural network are the most accurate prediction \nmodels for predicting five year breast cancer-specific survival \n\n1 \n\nINTRODUCTION \n\nFor over thirty years measuring cancer outcome has been based on the TNM staging \nsystem (tumor size, number of lymph nodes with metastatic disease, and distant \nmetastases) (Beahr et. al., 1992). There are several problems with this model \n(Burke and Henson, 1993). First, it is not very accurate, for breast cancer it is \n44% accurate. Second its accuracy can not be improved because predictive vari(cid:173)\nables can not be added to the model. Third, it does not apply to all cancers. In \nthis paper we compare computerized prediction models to determine if they can \nimprove prognostic accuracy. Artificial neural networks (ANN) are a class of non(cid:173)\nlinear regression and discrimination models. ANNs are being used in many areas \nof medicine, with several hundred articles published in the last year. Representa(cid:173)\ntive areas of research include anesthesiology (Westenskow et. al., 1992), radiology \n(Tourassi et. al. , 1992) , cardiology (Leong and Jabri, 1982), psychiatry (Palombo, \n1992), and neurology (Gabor and Seyal, 1992). ANNs are being used in cancer \nresearch including image processing (Goldberg et. al., 1992) , analysis of labora(cid:173)\ntory data for breast cancer diagnosis (0 Leary et. al., 1992), and the discovery of \nchemotherapeutic agents (Weinstein et . al., 1992). It should be pointed out that \nthe analyses in this paper rely upon previously collected prognostic factors. These \nfactors were selected for collection because they were significant in a generalized \nlinear model such as the linear or logistic models. There is no predictive model that \ncan improve upon linear or logistic prediction models when the predictor variables \nmeet the assumptions of these models and there are no interactions. Therefore \nhe objective of this paper is not to outperform linear or logistic models on these \ndata. Rather, our objective is to show that, with variables selected by generalized \nlinear models, artificial neural networks can perform as well as the best traditional \nmodels . There is no a priori reason to believe that future prognostic factors will \nbe binary or linear, and that there will not be complex interactions between prog(cid:173)\nnostic factors. A further objective of this paper is to demonstrate that artificial \nneural networks are likely to outperform the conventional models when there are \nunanticipated nonmonotonic factors or complex interactions. \n\n2 METHODS \n\n2.1 DATA \n\nThe Patient Care Evaluation (PCE) data set is collected by the Commission on \nCancer of the American College of Surgeons (ACS). The ACS, in October 1992, \nrequested cancer information from hospital tumor registries in the United States. \nThe ACS asked for the first 25 cases of breast cancer seen at that institution in 1983, \nand it asked for follow up information on each of these 25 patients through the date \nof the request. These are only cases of first breast cancer. Follow-up information \nincluded known deaths. The PCE data set contains, at best, eight year follow-up. \n\n\fPrediction Accuracy of Models for Breast Cancer Survival \n\n1065 \n\nWe chose to use a five year survival end-point. This analysis is for death due to \nbreast cancer, not all cause mortality. \n\nFor this analysis cases with missing data, and cases censored before five years, are \nnot included so that the prediction models can be compared without putting any \nprediction model at a disadvantage. We randomly divided the data set into training, \nhold-out, and testing subsets of 3,100, 2,069, and 3,102 cases, respectively. \n\n2.2 MODELS \n\nThe TMN stage model used in this analysis is the pathologic model (pTNM) based \non the 1992 American Joint Committee on Cancer's Manual for the Staging of \nCancer (Beahr et. al., 1992). The pathologic model relies upon pathologically de(cid:173)\ntermined tumor size and lymph nodes, this contrasts with clinical staging which \nrelies upon the clinical examination to provide tumor size and lymph node infor(cid:173)\nmation. To determine the overall accuracy of the TNM stage model we compared \nthe model's prediction for each patient, where the individual patient's prediction \nis the fraction of all the patients in that stage who survive, to each patient's true \noutcome. \n\nPrincipal components analysis, is a data reduction technique based on the linear \ncombinations of predictor variables that minimizes the variance across patients (Jol(cid:173)\nlie, 1982). The logistic regression analysis is performed in a stepwise manner, with(cid:173)\nout interaction terms, using the statistical language S-PLUS (S-PLUS, 1992), with \nthe continuous variable age modeled with a restricted cubic spline to avoid assuming \nlinearity (Harrell et. al., 1988). Two types of Classification and Regression Tree \n(CART) (Breiman et. al., 1984) analyses are performed using S-PLUS. The first \nwas a 9-node pruned tree (with 10-fold cross validation on the deviance), and the \nsecond was a shrunk tree with 13.7 effective nodes. \n\nThe multilayer perceptron neural network training in this paper is based on the \nmaximum likelihood function unless otherwise stated, and backpropagation refers \nto gradient descent. Two neural networks that are not multilayer perceptrons are \ntested. They are the Fuzzy ARTMAP neural network (Carpenter et. al., 1991) and \nthe probabilistic neural network (Specht, 1990). \n\n2.3 ACCURACY \n\nThe measure of comparative accuracy is the area under the curve of the receiver \noperating characteristic (Az) . Generally, the Az is a nonparametric measure of \ndiscrimination. Square error summarizes how close each patient's predicted value is \nto its true outcome. The Az measures the relative goodness of the set of predictions \nas a whole by comparing the predicted probability of each patient with that of all \nother patients. The computational approach to the Az that employs the trapezoidal \napproximation to the area under the receiver operating characteristic curve for \nbinary outcomes was first reported by Bamber (Bamber, 1975), and later in the \nmedical literature by Hanley (Hanley and McNeil, 1982). This was extended by \nHarrell (Harrell et. al., 1988) to continuous outcomes. \n\n\f1066 \n\nHarry B. Burke, David B. Rosen, Philip H. Goodman \n\nTable 1: PCE 1983 Breast Cancer Data: 5 Year Survival Prediction, 54 Variables. \n\nPREDICTION MODEL \n\nACCURACY\u00b7 SPECIFICATIONS \n\npTNM Stages \nPrincipal Components Analysis \nCART, pruned \nCART, shrunk \nStepwise Logistic regression \nFuzzy ARTMAP ANN \nCascade correlation ANN \nConjugate gradient descent ANN \nProbabilistic ANN \nBackpropagation ANN \n* The area under the curve of the receiver operating characteristic. \n\nO,I,I1A,I1B,IIIA,I1IB,IV \none scaling iteration \n9 nodes \n13.7 nodes \nwith cubic splines \n54-F2a, 128-1 \n54-21-1 \n54-30-1 \nbandwidth = 16s \n54-5-1 \n\n.720 \n.714 \n.753 \n.762 \n.776 \n.738 \n.761 \n.774 \n.777 \n.784 \n\n3 RESULTS \n\nAll results are based on the independent variable sample not used for training (i.e., \nthe testing data set), and all analyses employ the same testing data set. Using \nthe PCE breast cancer data set, we can assess the accuracy of several prediction \nmodels using the most powerful of the predictor variables available in the data set \n(See Table 1). \nPrincipal components analysis is not expected to be a very accurate model; with \none scaling iteration, its accuracy is .714. Two types of classification and regres(cid:173)\nsion trees (CART), pruned and shrunk, demonstrate accuracies of .753 and .762, \nrespectively. Logistic regression with cubic splines for age has an accuracy of .776. \nIn addition to the backpropagation neural network and the probabilistic neural net(cid:173)\nwork, three types of neural networks are tested. Fuzzy ARTMAP's accuracy is \nthe poorest at .738. It was too computationally intensive to be a practical model. \nCascade-correlation and conjugate gradient descent have the potential to do as well \nas backpropagation. The PNN accuracy is .777. The PNN has many interesting \nfeatures, but it also has several drawbacks including its storage requirements. The \nbackpropagation neural network's accuracy is .784.4. \n\n4 DISCUSSION \n\nFor predicting five year breast cancer-specific survival, several computerized pre(cid:173)\ndiction models are more accurate than the TNM stage system, and artificial neural \nnetworks are as good as the best traditional statistical models. \n\nReferences \n\nBamber D (1975). The area above the ordinal dominance graph and the area below \nthe receiver operating characteristic. J Math Psych 12:387-415. \nBeahrs OH, Henson DE, Hutter RVP, Kennedy BJ (1992). Manual for staging of \n\n\fPrediction Accuracy of Models for Breast Cancer Survival \n\n1067 \n\ncancer, 4th ed. Philadelphia: JB Lippincott. \n\nBurke HB, Henson DE (1993) . Criteria for prognostic factors and for an enhanced \nprognostic system. Cancer 72:3131-5. \n\nBreiman L, Friedman JH, Olshen RA (1984). Classification and Regression Trees. \nPacific Grove, CA: Wadsworth and Brooks/Cole. \n\nCarpenter GA, Grossberg S, Rosen DB (1991). Fuzzy ART: Fast stable learning \nand categorization of analog patterns by an adaptive resonance system. Neural \nNetworks 4:759-77l. \n\nGabor AJ, M. Seyal M (1992) . Automated interictal EEG spike detection using \nartificial neural networks. Electroencephalogr Clin Neurophysiology 83 :271-80. \n\nGoldberg V, Manduca A, Ewert DL (1992). Improvement in specificity of ultra(cid:173)\nsonography for diagnosis of breast tumors by means of artificial intelligence. Med \nPhys 19:1275-8l. \n\nHanley J A, McNeil BJ (1982). The meaning of the use of the area under the receiver \noperating characteristic (ROC) curve. Radiology 143:29-36. \n\nHarrell FE, Lee KL, Pollock BG (1988). Regression models in clinical studies: \ndetermining relationships between predictors and response. J Natl Cancer Instit \n80:1198-1202. \n\nJollife IT (1986). Principal Component Analysis. New York: Springer-Verlag, 1986. \n\nLeong PH, J abri MA (1982). MATIC - an intracardiac tachycardia classification \nsystem. PACE 15:1317-31,1982. \nO'Leary TJ, Mikel UV, Becker RL (1992). Computer-assisted image interpretation: \nuse of a neural network to differentiate tubular carcinoma from sclerosing adenosis. \nModern Pathol 5:402-5. \n\nPalombo SR (1992). Connectivity and condensation in dreaming. JAm Psychoanal \nAssoc 40:1139-59. \n\nS-PLUS (1991), v 3.0. Seattle, WA; Statistical Sciences, Inc. \n\nSpecht DF (1990). Probabilistic neural networks. Neural Networks 3:109-18. \n\nTourassi GD, Floyd CE, Sostman HD, Coleman RE (1993). Acute pulmonary \nembolism: artificial neural network approach for diagnosis. Radiology 189:555-58. \n\nWeinstein IN, Kohn KW, Grever MR et. al. (1992) Neural computing in cancer \ndrug development: predicting mechanism of action. Science 258:447-51. \nWestenskow DR, Orr JA, Simon FH (1992) . Intelligent alarms reduce anesthesiol(cid:173)\nogist's response time to critical faults. Anesthesiology 77:1074-9, 1992. \n\n\f\fLearning with Product Units \n\nLaurens R. Leerink \n\nAustralian Gilt Securities LTD \n\n37-49 Pitt Street \n\nNSW 2000, Australia \nlaurens@sedal.su.oz.au \n\nC. Lee Giles \n\nNEC Research Institute \n\n4 Independence Way \n\nPrinceton, NJ 08540, USA \ngiles@research.nj.nec.com \n\nBill G. Horne \n\nNEC Research Institute \n\n4 Independence Way \n\nPrinceton, NJ 08540, USA \nhorne@research.nj.nec.com \n\nMarwan A. Jabri \n\nDepartment of Electrical Engineering \n\nThe University of Sydney \n\nNSW 2006, Australia \nmarwan@sedal.su.oz.au \n\nAbstract \n\nProduct units provide a method of automatically learning the \nhigher-order input combinations required for efficient learning in \nneural networks. However, we show that problems are encoun(cid:173)\ntered when using backpropagation to train networks containing \nthese units. This paper examines these problems, and proposes \nsome atypical heuristics to improve learning. Using these heuristics \na constructive method is introduced which solves well-researched \nproblems with significantly less neurons than previously reported. \nSecondly, product units are implemented as candidate units in the \nCascade Correlation (Fahlman & Lebiere, 1990) system. This re(cid:173)\nsulted in smaller networks which trained faster than when using \nsigmoidal or Gaussian units. \n\n1 \n\nIntroduction \n\nIt is well-known that supplementing the inputs to a neural network with higher-order \ncombinations ofthe inputs both increases the capacity of the network (Cover, 1965) \nand the the ability to learn geometrically invariant properties (Giles & Maxwell, \n\n\f538 \n\nLaurens Leerink, C. Lee Giles, Bill G. Home, Marwan A. Jabri \n\n1987). However, there is a combinatorial explosion of higher order terms as the \nnumber of inputs to the network increases. Yet in order to implement a certain \nlogical function, in most cases only a few of these higher order terms are required \n(Redding et al., 1993). \nThe product units (PUs) introduced by (Durbin & Rumelhart, 1989) attempt to \nmake use of this fact. These networks have the advantage that, given an appropriate \ntraining algorithm, the units can automatically learn the higher order terms that \nare required to implement a specific logical function. \n\nIn these networks the hidden layer units compute the weighted product ofthe inputs, \nthat is \n\nN \n\nN \n\nII X~i \n\ni=l \n\ninstead of \n\n2:XiWi \ni=l \n\n(1) \n\nas in standard networks. An additional advantage of PUs is the increased infor(cid:173)\nmation capacity of these units compared to standard summation networks. It is \napproximately 3N (Durbin & Rumelhart, 1989), compared to 2N for a single \nthreshold logic function (Cover, 1965), where N is the number of inputs to the \nunit. \nThe larger capacity means that the same functions can be implemented by networks \ncontaining less units. This is important for certain applications such as speech \nrecognition where the data bandwidth is high or if realtime implementations are \ndesired. \nWhen PUs are used to process Boolean inputs, best performance is obtained \n(Durbin & Rumelhart, 1989) by using inputs of {+1, -I}. If the imaginary compo(cid:173)\nnent is ignored, with these inputs, the activation function is equivalent to a cosine \nsummation function with {-1,+1} inputs mapped {I,D} (Durbin & Rumelhart, \n1989). In the remainder of this paper the terms product unit (PU) and cos{ine) \nunit will be used interchangeably as all the problems examined have Boolean inputs. \n\n2 Learning with Product Units \n\nAs the basic mechanism of a PU is multiplicative instead of additive, one would \nexpect that standard neural network training methods and procedures cannot be \ndirectly applied when training these networks. This is indeed the case. If a neural \nnetwork simulation environment is available the basic functionality of a PU can be \nobtained by simply adding the cos function cos( 1(\" * input) to the existing list of \ntransfer functions. This assumes that Boolean mappings are being implemented \n{I,D} mapping has been performed on the input \nand the appropriate {-1,+1} -\nvectors. However, if we then attempt to train a network on on the parity-6 problem \nshown in (Durbin & Rumelhart, 1989), it is found that the standard backpropa(cid:173)\ngat ion (BP) algorithm simply does not work. We have found two main reasons for \nthis. \n\nThe first is weight initialization. A typical first step in the backpropagation proce(cid:173)\ndure is to initialize all weights to small random values. The main reason for this \nis to use the dynamic range of the sigmoid function and it's derivative. However, \nthe dynamic range of a PU is unlimited. Initializing the weights to small random \n\n\fLearning with Product Units \n\n539 \n\nvalues results in an input to the unit where the derivative is small. So apart from \nchoosing small weights centered around ml\" with n = \u00b11, \u00b12, ... this is the worst \npossible choice. In our simulations weights were initialized randomly in the range \n[-2,2]. In fact, learning seems insensitive to the size of the weights, as long as they \nare large enough. \n\nThe second problem is local minima. Previous reports have mentioned this prob(cid:173)\nlem, (Lapedes & Farber, 1987) commented that \"using sin's often leads to nu(cid:173)\nmerical problems, and nonglobal minima, whereas sigmoids seemed to avoid such \nproblems\". This comment summarizes our experience of training with PUs. For \nsmall problems (less than 3 inputs) backpropagation provides satisfactory training. \nHowever, when the number of inputs are increased beyond this number, even with \nth: .weight initialization in the correct range, training usually ends up in a local \nmInIma. \n\n3 Training Algorithms \n\nWith these aspects in mind, the following training algorithms were evaluated: online \nand batch versions of Backpropagation (BP), Simulated Annealing (SA), a Random \nSearch Algorithm (RSA) and combinations of these algorithms. \n\nBP was used as a benchmark and for use in combination with the other algorithms. \nThe Delta-Bar-Delta learning rate adaptation rule (Jacobs, 1988) was used along \nwith the batch version of BP to accelerate convergence, with the parameters were \nset to 0 = 0.35, K, = 0.05 and \u00a2 = 0.90. RSA is a global search method (i.e. \nthe whole weight space is explored during training). Weights are randomly chosen \nfrom a predefined distribution, and replaced if this results in an error decrease. SA \n(Kirkpatrick et aI., 1983) is a standard optimization method. The operation of SA \nis similar to RSA, with the difference that with a decreasing probability solutions \nare accepted which increase the training error. The combination of algorithms were \nchosen (BP & SA, BP & RSA) to combine the benefits of global and local search. \nUsed in this manner, BP is used to find the local minima. If the training error at \nthe minima is sufficiently low, training is terminated. Otherwise, the global method \ninitializes the weights to another position in weight space from which local training \ncan continue. \n\nThe BP-RSA combination requires further explanation. Several BP-(R)SA combi(cid:173)\nnations were evaluated, but best performance was obtained using a fixed number of \niterations of BP (in this case 120) along with one initial iteration of RSA. In this \nmanner BP is used to move to the local minima, and if the training error is still \nabove the desired level the RSA algorithm generates a new set of random weights \nfrom which BP can start again. \nThe algorithms were evaluated on two problems, the parity problem and learning all \nlogical functions of 2 and 3 inputs. The infamous parity problem is (for the product \nunit at least) an appropriate task. As illustrated by (Durbin & Rumelhart, 1989), \nthis problem can be solved by one product unit. The question is whether the \ntraining algorithms can find a solution. The target values are {-1, + 1}, and the \noutput is taken to be correct if it has the correct sign. The simulation results are \nshown in Table 1. It should be noted that one epoch of both SA and RSA involves \n\n\f540 \n\nLaurens Leerink, C. Lee Giles, Bill G. Home, Marwan A. Jahri \n\nrelaxing the network across the training set for every weight, so in computational \nterms their nepoeh values should be multiplied by a factor of (N + 1). \n\nParity \n\nN \n6 \n8 \n10 \n\nOnline BP \n\nBatch BP \n\nSA \n\nRSA \n\nneonv \n10 \n8 \n6 \n\nnepoeh \n30.4 \n101.3 \n203.3 \n\nn eonv \n\nnepoeh \n\n7 \n2 \n0 \n\n34 \n700 \n-\n\nneonv \n10 \n10 \n10 \n\nnepoeh \n12.6 \n52.8 \n99.9 \n\nneon v \n10 \n10 \n10 \n\nnepoeh \n15.2 \n45.4 \n74.1 \n\nTable 1: The parity N problem: The table shows neon v the number of runs out of \n10 that have converged and nepoeh' the average number of training epochs required \nwhen training converged. \n\nFor the parity problem it is clear that local learning alone does not provide good \nconvergence. For this problem, global search algorithms have the following advan(cid:173)\ntages: (1) The search space is bounded (all weights are restricted to [-2, +2]) (2) \nThe dimension of search space is low (maximum of 11 weights for the problems \nexamined). (3) The fraction of the weight space which satisfies the parity problem \nrelative to the total bounded weight space is high. \n\nIn a second set of simulations, one product unit was trained to calculate all 2(2N ) \nlogical functions of the N input variables. Unfortunately, this is only practical for \nN E {2,3} . For N = 2 there are only 16 functions, and a product unit has no \nproblem learning all these functions rapidly with all four training algorithms. In \ncomparison a single summation unit can learn 14 (not the XOR & XNOR functions). \nFor N =3, a product unit is able to implement 208 of the 256 functions, while a single \nsummation unit could only implement 104. The simulation results are displayed in \nTable 2. \n\nOnline BP \n\nBatch BP \n\nBP-RSA \n\nTable 2: Learning all logical functions of 3 inputs: The rows display nlogie , the \naverage number of logical functions implemented by a product unit and nepoeh, the \nnumber of epochs required for convergence. Ten simulations were performed for \neach of the 256 logical functions , each for a maximum of 1,000 iterations. \n\n4 Constructive Learning with Product Units \n\nSelecting the optimal network architecture for a specific application is a nontrivial \nand time-consuming task, and several algorithms have been proposed to automate \nthis process. These include pruning methods and growing algorithms. In this section \na simple method is proposed for adding PUs to the hidden layer of a three layer \nnetwork. The output layer contains a single sigmoidal unit. \n\nSeveral constructive algorithms proceed by freezing a subset of the weights and \nlimiting training to the newly added units. As mentioned earlier, for PUs a global \n\n\fLearning with Product Units \n\n541 \n\nTiling AI orithm ~ \n\nUpstart AI orithm I-t--< \nUnits >S-t \n\n81M using Pr \n\n300 \n\n250 \n\n200 \n\n150 \n\n100 \n\n50 \n\ni!: \n0 \n\n~ c \n\n.!; \n