{"title": "Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 530, "page_last": 536, "abstract": null, "full_text": "Training Knowledge-Based Neural Networks to \n\nRecognize Genes in DNA Sequences \n\nMichiel O. Noordewier \nComputer Science \nRutgers University \nNew Brunswick, NJ 08903 \n\nGeoffrey G. Towell \n\nComputer Sciences \n\nUniversity of Wisconsin \n\nMadison, WI 53706 \n\nJude W. Shavlik \nComputer Sciences \nUniversity of Wisconsin \nMadison, WI 53706 \n\nAbstract \n\nWe describe the application of a hybrid symbolic/connectionist machine \nlearning algorithm to the task of recognizing important genetic sequences. \nThe symbolic portion of the KBANN system utilizes inference rules that \nprovide a roughly-correct method for recognizing a class of DNA sequences \nknown as eukaryotic splice-junctions. We then map this \"domain theory\" \ninto a neural network and provide training examples. Using the samples, \nthe neural network's learning algorithm adjusts the domain theory so that \nit properly classifies these DNA sequences. Our procedure constitutes \na general method for incorporating preexisting knowledge into artificial \nneural networks. We present an experiment in molecular genetics that \ndemonstrates the value of doing so. \n\n1 \n\nIntroduction \n\nOften one has some preconceived notions about how to perform some classifica(cid:173)\ntion task. It would be useful to incorporate this knowledge into a, neural net(cid:173)\nwork, and then use some training examples to refine these approximately-correct \nrules of thumb. This paper describes the KBANN (Knowledge-Based Artificial Neu(cid:173)\nral Networks) hybrid learning system and demonstrates its ability to learn in the \ncomplex domain of molecular genetics. Briefly, KBANN uses a knowledge base of \nhierarchically-structured rules (which may be both incomplete and incorrect) to \nform an artificial neural network (ANN). In so doing, KBANN makes it possible to \napply neural learning techniques to the empirical improvement of knowledge bases. \n\nThe task to be learned is the recognition of certain DNA (deoxyribonucleic acid) \nsubsequences important in the expression of genes. A large governmental research \n\n530 \n\n\fTraining Knowledge-Based Neural Networks to Recognize Genes \n\n531 \n\nDNA \n\nprecursormRNA \n\nmRNA (after splicing) \n\nD t EJ \nt \nI I \nt \n\nprotein \n\nfolded protein \n\nFigure 1: Steps in the Expression of Genes \n\nprogram, called the Human Genome Initiative, has recently been undertaken to \ndetermine the sequence of DNA in humans, estimated to be 3 x 109 characters of \ninformation. This provides a strong impetus to develop genetic-analysis techniques \nbased solely on the information contained in the sequence, rather than in com(cid:173)\nbination with other chemical, physical, or genetic techniques. DNA contains the \ninformation by which a cell constructs protein molecules. The cellular expression \nof proteins proceeds by the creation of a \"message\" ribonucleic acid (mRNA) copy \nfrom the DNA template (Figure 1). This mRNA is then translated into a protein. \nOne of the most unexpected findings in molecular biology is that large pieces of the \nmRNA are removed before it is translated further [1]. \nThe utilized sequences (represented by boxes in Figure 1) are known as \"exons\", \nwhile the removed sequences are known as \"introns\", or intervening sequences. \nSince the discovery of such \"split genes\" over a decade ago, the nature of the \nsplicing event has been the subject of intense research. The points at which DNA \nis removed (the boundaries of the boxes in Figure 1) are known as splice-junctions. \nThe splice-junctions of eukaryotic1 mRNA precursors contain patterns similar to \nthose in Figure 2. \n\nexon \n\nintron \n\nexon \n\nno (A/C) A Gil GT (A/G) A GT no (crr) 6 X (Crr) A Gil G (Grr) ... \n\nFigure 2: Canonical Splice-Junctions \n\nDNA is represented by a string of characters from the set {A,G,C,T}. \nIn this figure, X represents any character, slashes represent disjunctive \noptions, and subscripts indicate repetitions of a pattern. \n\nHowever, numerous other locations can resemble these canonical patterns. As a \nresult, these patterns do not by themselves reliably imply the presence of a splice(cid:173)\njunction. Evidently, if junctions are to be recognized on the basis of sequence \ninformation alone, longer-range sequence information will have to be included in \n\n1 Eukaryotic cells contain nuclei, unlike prokaryotic cells such as bacterial and viruses. \n\n\f532 \n\nNoordewier, Towell, and Shavlik \n\nthe decision-making criteria. A central problem is therefore to determine the extent \nto which sequences surrounding splice-junctions differ from sequences surrounding \nspurious analogues. \nWe have recently described a method [9, 12] that combines empirical and sym(cid:173)\nbolic learning algorithms to recognize another class of genetic sequences known as \nbacterial promoters. Our hybrid KBANN system was demonstrated to be superior \nto other empirical learning systems including decision trees and nearest-neighbor \nalgorithms. In addition, it was shown to more accurately classify promoters than \nthe methods currently reported in the biological literature. In this manuscript we \ndescribe the application of KBANN to the recognition of splice-junctions, and show \nthat it significantly increases generalization ability when compared to randomly(cid:173)\ninitialized, single-hidden-Iayer networks (i.e., networks configured in the \"usual\" \nway). The paper concludes with a discussion of related research and the areas \nwhich our research is currently pursuing. \n\n2 The KBANN Algorithm \n\nKBANN uses a knowledge base of domain-specific inference rules in the form of \nPROLOG-like clauses to define what is initially known about a topic. The knowledge \nbase need be neither complete nor correct; it need only support approximately \ncorrect reasoning. KBANN translates knowledge bases into ANNs in which units \nand links correspond to parts of knowledge bases. A detailed explanation of the \nprocedure used by KBANN to translate rules into an ANN can be found in [12]. \nAs an example of the KBANN method, consider the artificial knowledge base in \nFigure 3a which defines membership in category A. Figure 3b represents the hi(cid:173)\nerarchical structure of these rules: solid and dotted lines represent necessary and \nprohibitory dependencies, respectively. Figure 3c represents the ANN that results \nfrom the translation into a neural network of this knowledge base. Units X and Yin \nFigure 3c are introduced into the ANN to handle the disjunction in the knowledge \nbase. Otherwise, units in the ANN correspond to consequents or antecedents in \nthe knowledge base. The thick lines in Figure 3c represent the links in the ANN \nthat correspond to dependencies in the explanation. The weight on thick solid lines \nis 3, while the weight on thick dotted lines is -3. The lighter solid lines represent \nthe links added to the network to allow refinement of the initial rules. At present, \nKBANN is restricted to non-recursive, propositional (i.e., variable-free) sets of rules. \n\nNumbers beside the unit names in Figure 3c are the biases of the units. These \nbiases are set so that the unit is active if and only if the corresponding consequent \nin the knowledge base is true. \n\nAs this example illustrates, the use of KBANN to initialize ANNs has two principle \nbenefits. First, it indicates the features believed to be important to an example's \nclassification. Second, it specifies important derived features; through their deduc(cid:173)\ntion the complexity of an ANN's final decision is reduced. \n\n\fTraining Knowledge-Based Neural Networks to Recognize Genes \n\n533 \n\nA:- B, C. \nB :-notF, O. \nB :-notH. \nC :-1,1. \n\na \n\n\u00b70 \n\n. \n\nG \n\n1\\ \n\nI \n\n.. \n\n\u2022 \n\u2022 \n\u2022 \nF \n\nA \n\n\u2022 \n\u2022 \n\u2022 \nH \nb \n\n~ \n\nJ K \n\nFigure 3: Translation of a Knowledge Base into an ANN \n\n3 Problem Definition \n\nThe splice-junction problem is to determine into which of the following three cate(cid:173)\ngories a specified location in a DNA sequence falls: (1) exon/intron borders, referred \nto as donors, (2) intron/exon borders, referred to as acceptors, and (3) neither. To \naddress this problem we provide KBANN with two sets of information: a set of DNA \nsequences 60 nucleotides long that are classified as to the category membership of \ntheir center and a domain theory that describes when the center of a sequence \ncorresponds to one of these three categories. \n\nTable 1 contains the initial domain theory used in the splice-junction recognition \ntask. A special notation is used to specify locations in the DNA sequence. When a \nrule's antecedents refer to input features, they first state a relative location in the \nsequence vector, then the DNA symbol that must occur (e.g., @3=A). Positions \nare numbered negatively or positively depending on whether they occur before or \nafter the possible junction location. By biological convention, position numbers of \nzero are not used. The set of rules was derived in a straightforward fashion from \nthe biological literature [13]. Briefly, these rules state that a donor or acceptor \nsequence is present if characters from the canonical sequence (Figure 2) are present \nand triplets known as stop codons are absent in the appropriate positions. \nThe examples were obtained by taking the documented split genes from all primate \ngene entries in Genbank release 64.1 [1] \nthat are described as complete. Each \ntraining example consists of a window that covers 30 nucleotides before and after \neach donor and acceptor site. This procedure resulted in 751 examples of acceptor \nand 745 examples of donors. Negative examples are derived from similarly-sized \nwindows, which did not cross an intron/exon boundary, sampled at random from \nthese sequences. Note that this differs from the usual practice of generating ran(cid:173)\ndom sequences with base-frequency composition the same as the positive instances. \nHowever, we feel that this provides a more realistic training set, since DNA is known \nto be highly non-random [3]. Although many more negative examples were avail(cid:173)\nable, we used approximately as many negative examples are there were both donor \nand acceptors. Thus, the total data set we used had 3190 examples. \n\nThe network created by KBANN for the splice-junction problem has one output \n\n\f534 \n\nNoordewier, Towell, and Shavlik \n\nTable 1: Knowledge Base for Splice-Junctions \n\ndonor :- @-3=M, @-2=A, @-l=G, @l=G, @2=T, @3=R, \n\n@4=A, @5=G, @6=T, not(don-stop). \n\ndon-stop :- @-3=T, @-2=A, @-l=A. \ndon-stop :- @-3=T, @-2=A, @-l=G. \ndon-stop :- @-3=T, @-2=G, @-l=A. \ndon-stop :- @-4=T, @-3=A, @-2=A. \ndon-stop :- @-5=T, @-4=G, @-3=A. \nacceptor :- pyr-rich, @-3=Y, @-2=A, @-l=G, @l=G, @2=K, not(ace-stop). \npyr-rich :- 6 of (@-15=Y, @-14=Y, @-13=Y, @-12=Y, @-l1=Y, \n@-lO=Y, @-9=Y, @-8=Y, @-7=Y, @-6=Y.) \n\ndon-stop :- @-4=T, @-3=A, @-2=G. \ndon-stop :- @-4=T, @-3=G, @-2=A. \ndon-stop :- @-5=T, @-4=A, @-3=A. \ndon-stop :- @-5=T, @-4=A, @-3=G. \n\nace-stop :- @l=T, @2=A, @3=A. \nacc-stop :- @l=T, @2=A, @3=G. \nacc-stop :- @l=T, @2=G, @3=A. \nace-stop :- @3=T, @4=A, @5=A. \nacc-stop :- @3=T, @4=G, @5=A. \nR:- A. R:- G. Y:- C. Y:- T. M:- C. M:- A. K:- G. K:- T \n\nace-stop :- @2=T, @3=A, @4=A. \nacc-stop :- @2=T, @3=A, @4=G. \nacc-stop :- @2=T, @3=G, @4=A. \nacc-stop :- @3=T, @4=A, @5=G. \n\nunits for each category to be learned; and four input units for each nucleotide in \nthe DNA training sequences, one for each of the four values in the DNA alphabet. In \naddition, the rules for ace-stop, don-stop, R, Y, and M are considered definitional. \nThus, the weights on the links and biases into these units were frozen. Also, the \nsecond rule only requires that six of its 11 antecedents be true. Finally, there are \nno rules in Table 1 for recognizing negative examples. So we added four unassigned \nhidden units and connected them to all of the inputs and to the output for the \nneither category. The final result is that the network created by KBANN has 286 \nunits: 3 output units, 240 input units, 31 fixed-weight hidden units, and 12 tunable \nhidden units. \n\n4 Experimental Results \n\nFigure 4 contains a learning curve plotting the percentage of errors made on a set \nof \"testing\" examples by KBANN-initialized networks, as a function of the number \nof training examples. Training examples were obtained by randomly selecting ex(cid:173)\namples from the population of 3190 examples described above. Testing examples \nconsisted of all examples in the population that were not used for training. Each \ndata point represents the average of 20 repetitions of this procedure. \n\nFor comparison, the error rate for a randomly-initialized, fully-connected, two-layer \nANN with 24 hidden units is also plotted in Figure 4. (This curve is expected to have \nan error rate of 67% for zero training examples. Test results were slightly better due \nto statistical fluctuations.) Clearly, the KBANN-initialized networks learned faster \nthan randomly-initialized ANNs, making less than half the errors of the randomly(cid:173)\ninitialized ANNs when there were 100 or fewer training examples. However, when \n\n\fTraining Knowledge-Based Neural Networks to Recognize Genes \n\n535 \n\n60 \n\n, \n\n--a \n- - 4_ - - - - - ~ - . Randomly-weighted network \n\nKBANN network \n\n0 \n\nCD \nCIJ \nC) \n\n:\u00a7 45 m \n.... \nc: \no 30 \n~ L.. \nW -c: 15 \n\n~ \n&. \n\n~ \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \n\n~ , , \n'6 ... \n\n..... ...... ... 6--- __ 6_ ----6 ____ _ \n\nO~----r---~----~----r----' I \n500 SOO \n\n400 \n\n200 \n\n100 \n\no \n\n300 \nNumber of Training Examples \n\n1000 \n\ni \n\nI \n\n1500 \n\nI \n\n2000 \n\nFigure 4: Learning Curve for Splice Junctions \n\nlarge numbers of training examples were provided the randomly-initialized ANNs \nhad a slightly lower error rate (5.5% vs. 6.4% for KBANN). All of the differences in \nthe figure are statistically significant. \n\n5 Related and Future Research \n\nSeveral others have investigated predicting splice-junctions. Staden [10] has devised \na weight-matrix method that uses a perceptron-like algorithm to find a weighting \nfunction that discriminates two sets (true and false) of boundary patterns in known \nsequences. Nakata et al. [7] employ a combination of methods to distinguish be(cid:173)\ntween exons and introns, including Fickett's statistical method [5]. When applied to \nhuman sequences in the Genbank database; this approach correctly identified 81% \nof true splice-junctions. Finally, Lapedes et al. [6] also applied neural networks and \ndecision-tree builders to the splice-junction task. They reported neural-network ac(cid:173)\ncuracies of 92% and claimed their neural-network approach performed significantly \nbetter than the other approaches in the literature at that time. The accuracy we re(cid:173)\nport in this paper represents an improvement over these results. However, it should \nbe noted that these experiments were not all performed under the same conditions. \n\nOne weakness of neural networks is that it is hard to understand what they have \nlearned. We are investigating methods for the automatic translation into symbolic \nrules of trained KBANN-initialized networks [11]. These techniques take advantage of \nthe human-comprehensible starting configuration of KBANN's networks to create a \nsmall set of hierarchically-structured rules that accurately reflect what the network \nlearned during training. We are also currently investigating the use of richer splice(cid:173)\njunction domain theories, which we hope will improve KBANN'S accuracy. \n\n\f536 \n\nN oordewier, lOwell, and Shavlik \n\n6 Conclusion \n\nThe KBANN approach allows ANN s to refine preexisting knowledge, generating ANN \ntopologies that are well-suited to the task they are intended to learn. KBANN does \nthis by using a knowledge base of approximately correct, domain-specific rules to \ndetermine the ANN's structure and initial weights. This provides an alternative to \ntechniques that either shrink [2] or grow [4] networks to the \"right\" size. Our exper(cid:173)\niments on splice-junctions, and previously on bacterial promoters, [12] demonstrate \nthat the KBANN approach can substantially reduce the number of training examples \nneeded to reach a given level of accuracy on future examples. \n\nThis research was partially supported by Office of Naval Research Grant N00014-90-J-1941, National \nScience Foundation Grant IRI-9002413, and Department of Energy Grant DE-FG02-91ER61129. \n\nReferences \n[1] R. J. Breathnach, J. L. Mandel, and P. Chambon. Ovalbumin gene is split in chicken \n\nDNA. Nature, 270:314-319, 1977. \n\n[2] Y. Le Cun, J. Denker, and S. Solla. Optimal brain damage. Advances in Neural \n\nInformation Processing Systems 2, pages 598-605, 1990. \n\n[3] G. Dykes, R. Bambara, K. Marians, and R. Wu. On the statistical significance of \nprimary structural features found in DNA-protein interaction sites. Nucleic Acids \nResearch, 2:327-345, 1975. \n\n[4] S. Fahlman and C. Lebiere. The cascade-correlation learning architecture. Advances \n\nin Neural Information Processing Systems 2, pages 524-532, 1990. \n\n[5] J. W. Fickett. Recognition of protein coding regions in DNA sequences. Nucleic Acids \n\nResearch, 10:5303-5318, 1982. \n\n[6] A. Lapedes, D. Barnes, C. Burks, R. Farber, and K. Sirotkin. Application of neu(cid:173)\n\nral networks and other machine learning algorithms to DNA sequence analysis. In \nComputers and DNA, pages 157-182. Addison-Wesley, 1989. \n\n[7] K. Nakata, M. Kanehisa, and C. DeLisi. Prediction of splice junctions in mrna se(cid:173)\n\nquences. NucleIC Acids Research, 13:5327-5340, 1985. \n\n[8] M. C. O'Neill. Escherichia coli promoters: 1. Consensus as it relates to spacing \nclass, specificity, repeat substructure, and three dimensional orgainzation. Journal of \nBiological Chemistry, 264:5522-5530, 1989. \n\n[9] J. W. Shavlik and G. G. Towell. An approach to combining explanation-based and \n\nneural learning algorithms. Connection Science, 1:233-255, 1989. \n\n[10] R. Staden. Computer methods to locate signals in DNA sequences. Nucleic Acids \n\nResearch, 12:505-519, 1984. \n\n[11] G. G. Towell, M. Craven, and J. W. Shavlik. Automated interpretation of knowledge \nbased neural networks. Technical report, University of Wisconsin, Computer Sciences \nDepartment, Madison, WI, 1991. \n\n[12] G. G. Towell, J. W. Shavlik, and M. O. Noordewier. Refinement of approximately \ncorrect domain theories by knowledge-based neural networks. In Proc. of the Eighth \nNational Conf. on Artificial Intelligence, pages 861-866, Boston, MA, 1990. \n\n[13] J. D. Watson, N. H. Hopkins, J. W. Roberts, J. A. Steitz, and A. M. Weiner. Molecular \n\nBiology of the Gene, pages 634-647, 1987. \n\n\f", "award": [], "sourceid": 387, "authors": [{"given_name": "Michiel", "family_name": "Noordewier", "institution": null}, {"given_name": "Geoffrey", "family_name": "Towell", "institution": null}, {"given_name": "Jude", "family_name": "Shavlik", "institution": null}]}