{"title": "A Neural Network to Detect Homologies in Proteins", "book": "Advances in Neural Information Processing Systems", "page_first": 423, "page_last": 430, "abstract": null, "full_text": "A Neural Network to Detect Homologies in Proteins \n\n423 \n\nA Neural Network to Detect \n\nHomologies in Proteins \n\nY oshua Bengio \nSchool of Computer Science \nMcGill University \nMontreal, Canada H3A 2A7 \n\nSamy Bengio \nDepartement dlnformatique \nUniversite de Montreal \n\nYannick Pouliot \nDepartment of Biology \nMcGill University \nMontreal Neurological Institute \n\nPatrick Agin \nDepartement d'Informatique \nU niversite de Montreal \n\n.ABSTRACT \n\nIn order to detect the presence and location of immunoglobu(cid:173)\nlin (Ig) domains from amino acid sequences we built a system \nbased on a neural network with one hidden layer trained with \nback propagation. The program was designed to efficiently \nidentify proteins exhibiting such domains, characterized by a \nfew localized conserved regions and a low overall homology. \nWhen the National Biomedical Research Foundation (NBRF) \nNEW protein sequence database was scanned to evaluate the \nprogram's performance, we obtained very low rates of false \nnegatives coupled with a moderate rate of false positives. \n\n1 INTRODUCTION \nTwo amino acid sequences from proteins are homologous if they can be \naligned so that many corresponding amino acids are identical or have similar \nchemical properties. Such subsequences (domains) often exhibit similar three \ndimensional structure. Furthemore, sequence similarity often results from \ncommon ancestors. Immunoglobulin (Ig) domains are sets of ,a-sheets bound \n\n\f424 \n\nBengio, Bengio, Pouliot and Agin \n\nby cysteine bonds and with a characteristic tertiary structure. Such domains \nare found in many proteins involved in immune, cell adhesion and receptor \nfunctions. These proteins collectively form the immunoglobulin superfamily \n(for review, see Williams and Barclay, 1987). Members of the superfamily \noften possess several Ig domains. These domains are characterized by well(cid:173)\nconserved groups of amino acids localized to specific subregions. Other resi(cid:173)\ndues outside of these regions are often poorly conserved, such that there is \nlow overall homology between Ig domains, even though they are clearly \nmembers of the same superfamily. \nCurrent search programs incorporating algorithms such as the Wilbur-Lipman \nalgorithm (1983) or the Needleman-Wunsch algorithm (1970) and its modifica(cid:173)\ntion by Smith and Waterman (1981) are ill-designed for detecting such \ndomains because they implicitly consider each amino acid to be equally im(cid:173)\nportant. This is not the case for residues within domains such as the Ig \ndomain, since only some amino acids are well conserved, while most are vari(cid:173)\nable. One solution to this problem are search algorithms based upon the sta(cid:173)\ntistical occurrence of a residue at a particular position (Wang et al., 1989; \nGribskov et al., 1987). The Profile Analysis set of programs published by the \nUniversity of Wisconsin Genetics Computer Group (Devereux et al., 1984) \nrely upon such an algorithm. Although Profile Analysis can be applied to \nsearch for domains (c./. Blaschuk, Pouliot & Holland 1990), the output from \nthese programs often suffers from a high rate of false negatives and positives. \nVariations in domain length are handled using the traditional method of penal(cid:173)\nties proportional to the nuinber of gaps introduced, their length and their po(cid:173)\nsition. This approach entails a significant amount of spurious recognition if \nthere is considerable variation in domain length to be accounted for. \nWe have chosen to address these problems by training a neural network to \nrecognize accepted Ig domains. Perceptrons and various types of neural net(cid:173)\nworks have been used previously in biological research with various degrees of \nsuccess (cf. Stormo et al., 1982; Qian and Sejnowski, 1988). Our results sug(cid:173)\ngest that they are well suited for detecting relatively cryptic sequence patterns \nsuch as those which characterize Ig domains. Because the design and training \nprocedure described below is relatively simple, network-based search pro(cid:173)\ngrams constitute a valid solution to problems such as searching for proteins \nassembled from the duplication of a domain. \n\n2 ALGORITHM, NETWORK DESIGN AND TRAINING \nThe network capitalizes upon data concerning the existence and localization \nof highly conserved groups of amino acids characteristic of the Ig domain. Its \ndesign is similar in several respects to neural networks we have used in the \nstudy of speech recognition (Bengio et al., 1989). Four conserved subregions \n(designated P1-P4) of the Ig domain homology were identified. These roughly \ncorrespond to ,a-strands B, C, E and F, respectively, of the Ig domain (see \nalso Williams and Barclay, 1988). Amino acids in these four groups are not \nnecessarily all conserved, but for each subregion they show a distribution very \ndifferent from the distribution generally observed elsewhere in these proteins. \nHence the first and most important stage of the system learns about these \njoint distributions. The program scans proteins using a window of 5 residues. \n\n\fA Neural Network to Detect Homologies in Proteins \n\n425 \n\nThe first stage of the system consists of a 2-layer feedforward neural network \n(5 X 20 inputs - 8 hidden - 4 outputs; see Figure 1) trained with back propaga(cid:173)\ntion (Rumelhart et al., 1986). Better results were obtained for the recognition \nof these conserved regions with this architecture than without hidden layer \n(similar to a perceptron). The second stage evaluates, based upon the stream \nof outputs generated by the first stage, whether and where a region similar to \nthe Ig domain has been detected. This stage currently uses a simple dynamic \nprogramming algorithm, in which constraints about order of subregions and \ndistance between them are explicitly programmed. We force the recognizer to \ndetect a sequence of high values (above a threshold) for the four conserved \nregions, in the correct order and such that the sum of the values obtained at \nthe four recognized regions is greater than a certain threshold. Weak penalties \nare applied for violations of distance constraints between conserved subre(cid:173)\ngions (e.g., distance between P1 and P2, P2 and P3, etc) based upon simple \nrules derived from our analysis of Ig domains. These rules have little impact if \nstrong homologies are detected, such that the program easily handles the large \nvariation in domain size exhibited by Ig domains. It was necessary to explicit(cid:173)\nly formulate these constraints given the low number of training examples as \nwell as the assumption that the distance between groups is not a critical \ndiscriminating factor. We have assumed that inter-region subsequences prob(cid:173)\nably do not significantly influence discrimination. \n\n4 output units \nrepresenting \n4 features of \nthe Ig domain \n\n8 hidden \nunits \n\n20 \npossible \namino \nacids \n\nwindow scanning 5 consecutive residues \n\nFigure 1: Structure of the neural network \n\n\f426 \n\nBengio, Bengio, Pouliot and Agin \n\nP3 \n\nP1 \n\nP1 \n\nfilename : A22771.NEW \ninput sequence name: 19 epsilon chain C region - Human \nHOMOLOGY starting at 24 \nVTLGCLATGYFPEPVMVTWDTGSLNGTTMTLPATTLTLSGHYAT1SLLTVSGAWAKQMFTC \nP4 \nEnding at 84. Score = 3.581 \nHOMOLOGY starting at 130 \n1QLLC LVSGYTPGT1NITWLEDGQVMDVD LSTASTTQEGE LASTQSE LTLSQKHWLSDRTYT C \nP4 \nEnding at 192. Score = 3.825 \nHOMOLOGY starting at 234 \nPTITCLVVDLAPSKGTVNLTWSRASGKPVNHSTRKEEKQRNGTLTVTSTLPVGTRDW1EGETYQC \nP4 \nEnding at 298. Score = 3.351 \nHOMOLOGY starting at 340 \nRTLACLIQNFMPED1SVQWLHNEVQLPDARHSTTQPRKTKGSGFFVFSRLEVTRAEWEQKDEF1C \nP4 \nEnding at 404. Score - 3.402 \n\nP1 \n\nP1 \n\nP2 \n\nP2 \n\nP2 \n\nP2 \n\nP3 \n\nP3 \n\nP3 \n\nFigure 2: Sample output from a search of NEW. Ig domains \npresent within the constant region of an epsilon Ig chain \n(NBRF file number A22771) are listed with the position of \nP1-P4 (see text). The overall score for each domain is also list(cid:173)\ned. \n\nAs a training set we used a group of 30 proteins comprising bona fide Ig \ndomains (Williams and Barclay, 1987). In order to increase the size of the \ntraining set, additional sequences were stochastically generated by substituting \nresidues which are not in critical positions of the domain. These substitutions \nwere designed not to affect the local distribution of residues to minimize \nchanges in the overall chemical character of the region. \nThe program was evaluated and optimized by scanning the NBRF protein da(cid:173)\ntabases (PROTEIN and NEW) version 19. Results presented below are based \nupon searches of the NEW database (except where otherwise noted) and were \ngenerated with a cutoff value of 3.0. Only complete sequences from ver(cid:173)\ntebrates, insects (including Drosophila melanogaster) and eUkaryotic viruses \nwere scanned. This corresponds to 2422 sequences out of the 4718 present in \nthe NEW database. Trial runs with the program indicated that a cutoff thres(cid:173)\nhold of between 2.7 and 3.0 eliminates the vast majority of false positives with \nlittle effect upon the rate of false negatives. A sample output is listed in Fig(cid:173)\nure 2. \n\n3 RESULTS \nWhen the NEW protein sequence database of NBRF was searched as \ndescribed above, 191 proteins were identified to possess at least one Ig \ndomain. A scan of the 4718 proteins comprising the NEW database required \nan average of 20 hours of CPU time on a VAX 11/780. This is comparable to \nother computationally intensive programs (e.g., Profile Analysis). When run \non a SUN 4 computer, similar searches required 1.3 hours of CPU time. This \nis sufficiently fast to allow the user to alter the cutoff threshold repeatedly \nwhen searching for proteins with low homology. \n\n\fA Neural Network to Detect Homologies in Proteins \n\n427 \n\nTable 1: Output from a search of the NEW protein sequence database. \nDomains are sorted according to overall score. \n\n3.0017 ClAss II hlstocompatlb. ant'fen, Hl,A-OR bec:a- I chain precursor (REM) . Hu,.,.n 3.4295 \" bPPII chain V region - Mouse H 37-10 \n3.429519 bppa chlln V region - Moule H37-&4 \n3.014& NonsJMdf'k: cross\u00b7ructtng an,..,. precursor\u00b7 Human \n3.4295 Ig kappa chlln V regions - Moun Hn-C6 and H22f>2S \n3.0161 ,..teffl-dertylld growth factor receptor precursor \u00b7 Moun \n3.4331 T-uU rectPtOr alpha chain precursor V '~'on IP71) . Mouse \n3.0164 Til class I hlscocomp.alib. ,nUgen. Til-, alpha chain \u00b7 Mouse \n34572 T\u00b7ceU surface glycoprotein CO) epsilon chain - Human \n3.0164 Ta. class I hlstocomPliUb. ant.n. Tj\u00b7 b _Iphll chain\u00b7 Moust \n3.4594 T~en sI,.Ia,. gtycDprote .... CO. precursor\u00b7 Mouse \n3.0223 Vttronectln recept:or alph_ ,h.n precursor ' HUman \n3.4594 T'ul) surrace gtycoproteln lyt\u00b72 precursor\u00b7 Mouse \n3.0226 T-CtllsurfKe gtycoprotetn ly-3 precursor ' Moun \n3.0244 Klnase-,\". trlnstormlng ploteln (srd (EC 2.7. 1.\u00b7) . AVI,an urcomil VirUS \n3.4595 T-c:eII recePior .. ptta chain precursor V region (HAPO$ - Humin \n),0350 It alptt..\" chlln C region - Humin \n3.4606T-c:ell rec_or gamma-2 chlln C region eMHC& Ind MN(9) \u00b7 Mouse \nJ.OJ50 It alptt.., I chain C regIOn - Human \n3.4614 T-c:eII receptor g.nwna ch.ln C region (PfER) \u2022 Human \n3.0J\u00a70 It alph..,2 chlln C region. A2m( I) lilorype ' Human \n3.4614 T-c:ell receJKor gamrna-I chlln C region - Hu~n \n3.4614 T-cell receptOr gamrna-2 chlln C region - Human \n3.0-409 Gr.nulocyte-macroph.ge colonv\u00b7sUmulaUng flCcor I precursor - Moust \n3.4620 It heevy chain V regkln - Mouse H 146-2413 \nJ.04I' HLA dass 1 hlstocomparlb. ant~en. Ilph. chain precursor' Human \n3.0492 HADH-ubtquktOne ox~or.uctase (EC 1.6.S.3). chlln 5 - Fruit fly (Drosophila) \n3.4620. heavy chain V region - Mouse HI 5a-I9H4 \n3.0501 NAIlH\u00b7.biq.lnOn \u2022\u2022\u2022 Ido.ed.cu.. (Ee I.6.S.31. chain I \u2022 F .... IIy (O.os.pllilal 3.4620 19 heavy chain\" .eglon . M \u2022\u2022\u2022\u2022 H3S,C& \nl46lO I, heavy chain Pfecursor V region\u00b7 Mouse M~J3 \n3.0511 HLA clas \u2022 htstocomp.Mlb. ant'9tft. DP bet. chain precursor - clone \n3.0511 HLA cia, \u2022 hlstocompatlb. ant...,. DP4 bet. chain ptecunor - HUmin \n3.4690 T-c:eII rec_or beta- I ch.ln e regIOn' Human \n3.4690T-c:eIf receptor beta-I chain C regIOn\u00b7 Moyse \n3.0SISHLA cia, \u2022 hlstocomplltlb . \u2022 nt .... OPW4 bet. I chain ptecursor - Human \n3.0520 Class n histocompaUb . \u2022 nt'gen. HLA-OQ beta ch.ln precursor (REM) - Human 3.4690 T-c:en receptor bK~2 chain C regIOn - Hum.n \n3.4690 T-cell receptor bK~2 chain C regtOn - Human \n].0561 rroteln' ryroSlne kinase (Ee 2.7.1.1 12). lymphocyte - Moun \n3.4769 \u2022 ~3 chain e reg~on. G3m(b) allOrypa - Hum.n \n].0669 H-2 clas. hJstoc:ompaUb. ant'9tft, A\u00b7beca\u00b72 chain ptecursor - Mouse \n] .479& It k.ppa ,haln V region - Mouse H 146-2483 \n3.072] T-cell ree.,.or pnvna cham precursor'll 'eglOn (MNCI) - Mouse \n3.479& It k.ppa Ch.ln V region - Mouse H36-2 \n3.072J T~ reeepeor glfTV1'\\a cham ptecursor 'II regAon IRAeII} - Mouse \n3.479& It kappa ch.m V region - Mouse H37-62 \n3.072J T-cell ree_or glfTV1'\\a cha ... ptecursor 'II 'eglon IRAe4) . Mouse \n3.479& It kappa ch.m V region - Mouse HH\u00b712 \n3.072J T-eeR ree_or glfTV1'\\a chain ptecursor 'II region 'RAC42) . Mouse \n3.4110 It kappa chain V-I retlon . HUman WII( I) \n3.072J T-c:eII ree_or glfTV1'\\a chatn ptecursor 'II region (RACSo) . Mouse \n3.48-iO Peroxklase (Ee I.Il.l.n precursor - Human \n3.0750 T-ctl r\u00ab_or bet. cha'\" V region (C.F~ \u00b7 Mouse \n3.4&&& PIa~tv. 9rowth IKIM reeeptor precursor - Mouse \nJ.07&01g hefty Chain V retlon \u2022 Moule 251 .3 \n3._5 N.t<h prot ..... f \u2022\u2022 ,. fly \n3._5 N.tch pr ....... f \u2022\u2022 I. fly \n3.4983: T<\" recepror beta chain precursor V rt9lon (MT I-I) - Human \n3.491J T<eII receptOr beta-2 cheln precursor V regkMt MOlT' 4' Human \n].4991\", kappII chain Pfecursor V region - Mouse Set-. \n3.S035 Alkol ... pII .. pII .... (EC 3. 1.3.11 p.ecU\"Of \u2022 H.man \n3.5061 \"heavy choln\" 'eglo .. \u2022 M .... H 37\u00b7&2 \n\n3.0711 T-col( ,_or bOlA ch .. n \"'eglon (SUp\u00b7T 'I . Hum... \n3.0711 H\u00b7Z cia. I hi>.ocompotlb . .... Igen. Q7 olpllo ch.ln \" .. c.rs\" . Mo... \n3.0717 \"\u00b72 class I hlsrocompatlb . \u2022 nr'left, OS IIlpha ch.ln precursor - Mouse \n3.0912 MytiIn-assoclatld gtycoptoteln 11236 long form precurso' - Rilt \n] .0912 MyefIIt-.socl.rld g~oproteln 1&2]6 shon form precursor - R.t \n3.09&2 MyoIIrtoesoc ... ed 91\\'<.pr \u2022\u2022\u2022 ln precu .. or. b,aln . Ra. \n3.09&2 ~soc\".ed \"'90 glyc.pr .... n prec ...... Rat \n3.0991 Closs I hls.ocompotlb. \"\"'VOn. BolA .Iph. ch.ln prec \u2022\u2022\u2022\u2022\u2022 (BLI\u00b751 . Bovl.... \nJ.099aOass I htstocompatlb_ antigen, loLA alpha chilln precursor (BU\u00b7]) - Bovtne \nJ.I 04& H-2 clas I hlstocompatlb. IIntM1en. K\u00b7\" a6pha chilln precursor\u00b7 Mouse \nJ.I0&61g h.vy chain precursor'll regIOn - Mous. VCAM3 2 \nJ. I 128 T-cell rcepeor .Ipha ,haln precursor V region (MO I 3~ - Mouse \nJ. II29 T<ell ree_or detta chain V region ION\u00b74) . Moule \n3.1192 T<ell rcepeor bet. chain precursor'll region IVAk) - Mouse \n3.126S T -c:eU ree_or glfTV1'\\a cham ptecursor 'II regIOn IK20) \u2022 Human \n3.1 J47 T-c:eU ree_or alph. chilln precursor V region (HAPOS) - Human \n3. 1623 T-cen surface gtycoprotetn COl ptecurs.or . Human \n3.1623: T-c:eI surface gtvcoprolelft COl prottln precunor . Human \n3.1776 .... e-nma-3 chain C reQ1on. C]mlb) allotype Humin \n3.1931 HypothetICal proc\",n HQlf 2 \u00b7 C..,tome.;JaloY1rLls Istraln AD 169) \n3.2041 SodIum channel prottln II Rat \n3.20441g huvy chain'll re.;Jlon Afr\"An cla*fd \"og \n3.2141 SURF- I protein' Mouu \n].2207 T-cell recepc:or alpha chain pr~ Uls.or \\0' 1f\"910n (HAP 10~ . Human \n3.2300 1et.-2-mlCroglobulin pre<\",r~Or Hun-,an \n] .2300 Beu-2-mlCroglobulln. modtflfd Human \n3.2106 rreonancy-spt:Clflc bela I Qlyc opror~tn E prf'(u rs or Human \n3.2344lgE Fc receptor Ilpha ,haln prKufSor Hurnan \n3.2420 T-c:ell surflCe Qtycoprotetn C02 pte<unor Pat \n3.2422 H\u00b72 class N htuocompaub .nIl9~n I A I~OOI bf'ta cham precursor - Moun \n3.25;2 HLA elms II hlstocompaub .ntlgen . op.\"..e aloha I cham precursor\u00b7 Human \n3.2552 HLA class II hlsro(ompat lb ,ntlgf'n ~ 8 .Ipha O'laln precursor' Human \n3.2654 T-c,1/ surface glvcoproteln CO&' JI K (hJlln pfKur sor . Rat \n3.2726 Myelin PO ptoteln\u00b7 Bovtne \nJ.21141t .Iptt.., 1 chain e regton Huma\" \n3.21141g .Iph.., I chaIR C recJlon HumM \n3.2120 Thy-I membrane glycoprotein p,e<u,~or Mouse \n3.2&40 5mh clas II hlstocompilub anogen prlKurSor Ehrenberg smote-rat \n1.3039 X-lInkld chronk: granulomatous dlSeast plotem Human \n3.3013 rregnartCy-speclflC bela I Vl lyc oprot<t:ln ( pr<t:(unor Human \n) .3013 Prt9n .... cy-speclfiC beta' I gtycoprotetn (J pr KurSOf . Humiln \n3.30 .... T-cell recepror bell chain precUlsor \\I r~IOn t 16) Human \n3.3251 It pnrna- I Ill) garnma\u00b72b fe receptor p,KU'lor Mouse \n3.]414 HypodMlkll hvbr~ IQIT\u00b7cell receplOr prftuts.or \\I ff:9lon (SUp\u00b7T 1 ~ - Humiln \n3.]414. heavy chain precursor V II recJlon Human 71 2 \n3.]414 Ig heavy chain precursor V II reqton Human 71 \u2022 \n3.)417 Nellral celf IdhHk)n prot~ln pfftUnOr Mou se \n3.35 If Ig epsilon chatn C recJlon Human \n1 35 I I Ig epsYon chain C recJlon . HUman \n3.]S22 T-c:ell rectpc:or alpha chain V reqlon (80fl alpha I) Moun \n3.J605 lIft.ry gtycoptoteln I . Humil\" \n3.3131 T-c:eII receptor garrvnil-I chain C 'ecJlon IMNGI and MNCn - Moust \n3.]131 T-c:ell ,eceplor gamma I chain C IPglOn Mouu \n3.3861 T\u00b7cell 9IftVTIa chain precursor V rf:'CJlon ('II j) Moun \n3.4024 Ie ep51\"n chilln C recJlon . Human \n3.4024\" epSolkJn chain C region \u00b7 Human \n3.4110 Ig heavy chain V region \u00b7 Mouse Hl6-2 \n3.41 3J I, heavy chlln 'II region \u00b7 Mouse H]7\u00b760 \n3.41521g heavy chain V rec)lon . Mouse H 18-S.1 S \n3.41 S5 191 kappe chlln V region\u00b7 Mouse HP9 \n3.4171191 heavy chain'll region\u00b7 Mouse If6 \n3.4191 Ig kappa! chain'll region \u00b7 Mouse HieS .4l) I \n3.4199lg heavy cha\"l V region \u00b7 Mouse ]010 \n3.4199 \u2022 heavy cha,\" V regIon\u00b7 Mouse II CR kt I I \n3.4211 191 heavy Chal\" V r<t:9lon \u00b7 MOllse HPll and HP27 \n3.421] Prt9nancy,spt:Clflc b<t:ta\u00b7 I glycoprotein ( prKursor . Human \n3.4213 Prt9nancy\u00b7speclfK btla\u00b7 1 g tyC OPfO(tln 0 prKursor . HUman \n3421 I T\u00b7celt receplor beta chain prKUlsor V ffglon (4 C3) . Mouse \n].4211 T-cell receptor beta chain precursor 1/ region (810) Mouse \n34212 Sodium channel prott'ln II Rat \n3 429S Ig kappa ch.ln V rt'Qlon (HZ8-A.1) Mouse H28-A2 \n3429519 kilppa ch-lln V r~lon . Mous.e H I S& 89H4 \n3.429519 kappa chain V recJlon Mouse H 37 ] I I \n3.4295 Ig kappa chain V region \u00b7 MouS<t: H]] 40 \n3.429S Ig kappa chain V ft:qlon Mouse H 3 7 \") \n) 4295 ~ k.3ppa chain V rt:910n Mouse Hll 45 \n\n3.SO&2 Closs R hlsllOCompotlb. 1In_ HIA-DR botaoZ ch .... proc.rs.r (REMI . H.m ... \n\n3.5012 H-2 class. hlstocomPllttb. antigen. \u00a3.a/k bet .. 2 chain PfKyrsor - Mouse \n3.5012 H-2 class n hlstoCOlnpallb. .,It~. E1I beu-2 ch.ln precursor' Mouse \n] .SOI2 HLA class II hlstocompallb. anll9en, OR I beta chain (cklne 69) - Humin \n3.5012 HLA class. hlsbKompKIb. antigen. OR bela chain precursor \n3.S012 HLA class II hlstocompatlb_ anUgtft. Ollt beta chain precursor A5) - Hum.n \n3.5012 HLA class I hlstocomp.lltib. antlgen, Ollt- I bet. ch .... precursor - Human \n3.5012 HLA class I hlstocompilrlb. antlten. OR-4 betll chain' Human \n3.S012 HLA class h hlstocompatlb. anrlttft, DR-5 kli chain precursor' Hum.n \n3.509419 IiIm~S chlln C region - Mouse \n3.S 144lg .lphl\u00b72 ch.\", e region. A2m( I) alforype - Human \n3.5150 Ig heavy chain V region\u00b7 Mouse H2a-A2 \n3.5180 Biliary gtycoprDtein I- Human \n3.5193 Ig heavy chain V region - Mouse H37-45 \n3 5193 Ig heavy chain V regions - Mouse HJ7\u00b780 and H]7-43 \n35211 Ig IMnbda chain ptecursor V region' Rat \n15264 Ig huvy chain V region - Mouse H ]7-62 \n] S]161g heavy chain V region - Mouse H37\u00b7 311 \n] 533419 heavy chain V region \u00b7 Mouse HH\u00b74O \n] SJ72 T'cl'll receptor beta cha,n precyrsor V region (ATlI2'2) . Human \n3 S435 Ig heavy chain V region - Mouse HleS\u00b7401 \n3 SS79 Ig heavy chain V region - Mouse H]7-14 \n35603 Ig IMnbdl\u00b72 chain e region - Ral \n3.5666 J9 heavy chain V region - Moust 8 I\u00b7 , henEallve slquence) \n] 5709ll11ary glyCoprotein I- Human \n] 5741 Nonspecific cross-reacting antigen precursor - Human \n35115 Ig epsilon chain e region\u00b7 Human \n3.5115 Ig epsilon chain e rec,kln ' Human \n3.5194 Neur.1 cell adheskln ptoteln precursor\u00b7 Mouse \n3.5912 Ig bppa chain V region - Mouse H]7-60 \n35971 Ig kilppa chain precursor'll region - Rat IR2 \n36020 Ig kappa chain V region\u00b7 Mous. IF6 \n] 6020 fg kappa chilin V region\u00b7 Mouse 3010 \n36027 T'cell receptor beell chain V region (K~ATU - Human \n36071 19 heavy chlln V region \u00b7 Mouse HP20 \n36071 Ig heavy chain V regktn \u2022 Mouse HP25 \n1.6120 T-cl'll receptor alptta ch.ln V regIOn (5c.en - Mouse \n36'20 T-cell receptor alptta ch.'n V region (U~ \u2022 Mouse \n3.6120 T-cell receptor alph. ch.ln Pfecursor V region (214) \u2022 Mouse \n3.6120 T-cell receptor alptt. chain ptecursor V region (4.e]) . Mouse \n] 6120 T\u00b7cell rec.pc:or Ilptt. chilin precursor'll reqlon (810) \u2022 Mouse \n].6302 HLA class 1/ hlSloCompatlb. antigen OX alpha chain prKursor - Human \n] 6302 HLA class .. hlstocompatlb. anlAgen. OQ alph. chain precursor' Humiln \n36461 T-ul! receptor alptta chilln precursor V region (HAPSIij - Human \n] 646S Ig kappa chain precursor V ch.'n - Moys. s.e,-b \n36539 Heur.1 un adhesion ptotetn precursor\u00b7 Mouse \n3.6636Ig huvy chain V region - Mouse BI -&'VI1V2 (untatlve slquence) \n] 6771 Ig kappa chain precursor V-HI regAon - Human SU\u00b7OHl\u00b76 \n36791 Ig kappa chain V region - Mouse H Ia-S415 \n36&)J Myelln-assoclatld gtvc:optoteln 11236 tong form ptecursor \u00b7 Rae: \n3.6&)) Myelln'ilSsoc\"tld g~op,oteln IB236 shon form prKursor . Rat \n3.6&)3 Myelln-MsOClatld g~oproteln precursor. brain ~ Rar \n].6&)J Myebn\u00b7assocl.r:. lar,. gtyc:oproteln precursor \u00b7 Rat \n3.7102 It kappa chain V-III 'eglon - HUman C8 \n3.7170 Ig kappa chain V-I regIon ' HUman WII(2) \n] 7341 Ig lambdl chain e region' Chicken \n] 7505 Ig hppa chain precursor V\u00b7I region\u00b7 Human Natm-6 \n] 75351g heavy chain precursor V regIOn - Mouse 129 \n] 7600 Ig lambda\u00b75 chain C region - Mouse \n3.7779 19 h~avy chain V reg60n - Mouse HP 12 \n] 790719 kappa chain V region 30S precursor - Humiln \n] 790719 kappa chain precursor '1,111- Human Nalm-6 \n] 7909 19 heavy chain V region\u00b7 Mouse HP21 \n] &017 Ntural cell adhHk)n proUtn precursor' Mouse \n] 81 ao Ig mu chain e rtglon. b allele\u00b7 Mouse \n3824719 epSilon chain C region - Human \n3 8247 ~ epsilon chilln e region - Human \n3 &440 ~ kilppa chll\" precursor V region\u00b7 Mouse MAkH \n3867119 klppa chain precursor II region\u00b7 Rat IRI62 \n\n\f428 \n\nBengio, Bengio, Pouliot and Agin \n\nTable 2: Efficiency of detection for some Ig superfamily pro(cid:173)\nteins present in NEW. Mean scores of recognized Ig domains \nfor each protein type are listed. Recognition efficiency is cal(cid:173)\nculated by dividing the number of proteins correctly identified \n(Le., bearing at least one Ig domain) by the total number of \nproteins identified by their file description as containing an Ig \ndomain, multiplied by 100. Numbers in parentheses indicate \nthe number of complete protein sequences of each type for \neach species. All complete sequences for light and heavy im(cid:173)\nmunoglobulin chains of human and mouse origin were \nscanned. The threshold was set at 3.0. ND: not done. \n\nMean score of \n\nRecognition emciency for \n\ndetected domains \n\n(max 4.00) \n\n3.50 \n\n3.48 \n\n3.33 \n\n3.36 \n\n3.32 \n\n3.41 \n\nIg-bearing proteins \n\n(see le2end) \n98.2 % (55) \n\n93.8 % (16) \n\nND \n\nND \n\nND \n\nND \n\nProtein \n\nImmunoglobulins, \nmouse, \nall forms \nImmunoglobulins, \nhuman, \nall forms \nH-2 class II, \nall forms \nHLA class II, \nall forms \nT-cell receptor \nchains, \nmouse, \nall forms \nT-cell receptor \nchains, \nhuman, \nall forms \n\nThe vast majority of proteins which scored above 3.0 were of human, mouse, \nrat or rabbit origin. A few viral and insect proteins also scored above the \nthreshold. All proteins in the training set and present in either the NEW or \nPROTEIN databases were detected. Proteins detected in the NEW database \nare listed in Table I and sorted according to score. Even though only human \nMHC class I and II were included in the training set, both mouse H-2 class I \nand II were detected. Bovine and rat transplantation antigens were also \ndetected. These proteins are homologs of human MHC's. For proteins which \ninclude more than one Ig domain contiguously arranged (e.g., carcinoem(cid:173)\nbryonic antigen), all domains were detected if they were sufficiently well con(cid:173)\nserved. However, domains lacking a feature or possessing a degenerate \nfeature scored much lower (usually below 3.0) such that they are not recog(cid:173)\nnized when using a threshold value of 3. Recognition of human and mouse im(cid:173)\nmunoglobulin sequences was used to measure recognition efficiency. The rate \nof false negatives for immunoglobulins was very low for both species (Table \nII). Table III lists the 13 proteins categorized as false positives detected when \nsearching with a threshold of 3.0. Relative to the total number of domains \ndetected, this corresponds to a false positive rate of 6.8%. In the strict sense \nsome of these proteins are not false positives because they do exhibit the ex(cid:173)\npected features of the Ig domain in the correct order. However, inter-feature \n\n\fA Neural Network to Detect Homologies in Proteins \n\n429 \n\ndistances for these pseudo-domains are very different from those observed in \nbona fide Ig domains. Proteins which are rich in ,B-sheets, such as rat sodium \nchannel II and fruit-fly NADH-ubiquinone oxidoreductase chain 1 are also \nabundant among the set of false positives. This is not surprising since the Ig \ndomain is composed of ,B-strands. One solution to this problem lies in the use \nof a larger training set as well as the addition of a more intelligent second \nstage designed to evaluate inter-feature distances so as to increase the specifi(cid:173)\ncity of detection. \nTable 3: False positives obtained when searching NEW with a threshold of \n3.0. Proteins categorized as false positives are listed. See text for details. \n\n3.0244 Kinase-related transforming protein (src) (Ee 2.7.1.-) \n\n3.0409 Granulocyte-macrophage colony-stimulating \n\n3.0492 NADH-ubiquinone oxidoreductase (Ee 1.6.5.3), chain 5 \n\n3.0508 NADH-ubiquinone oxidoreductase (Ee 1.6.5.3), chain 1 \n\n3.0561 Protein-tyrosine kinase (Ee 2.7.1.112), lymphocyte - Mouse \n3.1931 Hypothetical protein HQLF2 - Cytomegalovirus (strain AD169) \n3.2041 Sodium channel protein II - Rat \n3.2147 SURF-1 protein - Mouse \n3.3039 X-linked chronic granulomatous disease protein - Human \n3.4840 Peroxidase (Ee 1.11.1.7) precursor - Human \n3.4965 Notch protein - Fruit fly \n3.4965 Notch protein - Fruit fly \n3.5035 Alkaline phosphatase (EC 3.1.3.1) precursor - Human \n\n5 DISCUSSION \nThe detection of specific protein domains is becoming increasingly important \nsince many proteins are constituted of a succession of domains. Unfortunate(cid:173)\nly, domains (Ig or otherwise) are often only weakly homologous with each \nother. We have designed a neural network to detect proteins which comprise \nIg domains to evaluate this approach in helping to solve this problem. Alter(cid:173)\nnatives to neural network-based search programs exist. Search programs can \nbe designed to recognize the flanking Cys-termini regions to the exclusion of \nother domain features since these flanks are the best conserved features of Ig \ndomains (c/. Wang et ai., 1989). However, even Cys-termini can exhibit poor \noverall homology and therefore generate statistically insignificant homology \nscores when analyzed with the ALIGN program (NBRF) (cf. Williams and \nBarclay, 1987). Other search programs (such as Profile Analysis) cannot effi(cid:173)\nciently handle the large variations in domain size exhibited by the Ig domain \n(mostly comprised between 45 and 70 residues). Search results become cor(cid:173)\nrupted by high rates of false positives and negatives. Since the size of the \nNBRF protein databases increases considerably each year, the problem of \nfalse positives promises to become crippling if these rates are not substantially \ndecreased. In view of these problems we have found the application of a \nneural network to the detection of Ig domains to be an advantageous solution. \nAs the state of biological knowledge advances, new Ig domains can be added \nto the training set and training resumed. They can learn the statistical features \n\n\f430 \n\nBengio, Bengio, Pouliot and Agio \n\nof the conserved subregions that permit detection of an Ig domain and gen(cid:173)\neralize to new examples of this domain that have a similar distribution. Previ(cid:173)\nously unrecognized and possibly degenerate homologous sequences are there(cid:173)\nfore likely to be detected. \nAcknowledgments \nThis research was supported by a grant from the Canadian Natural Sciences \nand Engineering Research Council to Y.B. We thank CISTI for graciously al(cid:173)\nlowing us access to their experimental BIOMOLE' system. \nReferences \nBengio Y., Cardin R., De Mori R., Merlo E. (1989) Programmable execution \nof multi-layered networks for automatic speech recognition, Communications \nof the Association for Computing Machinery, 32 (2). \nBengio Y., Cardin R., De Mori R., (1990), Speaker independent speech \nrecognition with neural networks and speech knowledge, in D.S. Touretzky \n(ed.), Advances in Neural Networks Information Processing System,s 2 \nBlaschuk O.W., Pouliot Y., Holland P.C., (1990). Identification of a con(cid:173)\nserved region common to cadherins and influenza strain A hemagglutinins. J. \nMolec. Biology, 1990, in press. \nDevereux, J., Haeberli, P. and Smithies, O. (1984) A comprehensive set of \nsequence analysis programs for the VAX. Nucl. Acids Res. 12, 387-395. \nGribskov, M., McLachlan, M., and Eisenber, D. (1987) Profile analysis: \nDetection of distantly related proteins. Proc. Natl. Acad. Sci. USA, \n84 :4355-4358. \nNeedleman, S. B. and Wunsch, C. D. (1970) A general method applicable to \nthe search for similarities in the amino acid sequence of two proteins. J. Mol. \nBioi. 48, 443-453. \nQian, N. and Sejnowski, T. J. (1988) Predicting the secondary structure of \nglobular proteins using neural network models. J. Mol. Bioi. 202, 865-884. \nRumelhart D.E., Hinton G.E. & Williams R.J. (1986) Learning internal \nrepresentation by error propagation. Parallel Distributed Processing, Vol. 1, \nMIT Press, Cambridge, pp. 318-362. \nSmith, T. F. and Waterman, W. S. (1981). Identification of common molecu(cid:173)\nlar subsequences. J. Mol. Bioi. 147 , 195-197. \nStormo, G. D., Schneider, T. D., Gold, L. and Ehrenfeucht, A. Use of the \n\"perceptron\" algorithm to distinguish translational initiation sites in E. coli. \nNucl. Acids Res. 10 , 2997-3010. \nWang, H., Wu, J. and Tang, P. (1989) Superfamily expands. Nature, 337, 514. \nWilbur, W. J. and Lipman, D. J. (1983). Rapid similarity searches of nucleic \nacids and protein data banks. Proc. Natl. Acad. Sci. USA 80, 726-730. \nWilliams, A. F. and Barclay, N. A. (1988) The immunoglobulin superfamily(cid:173)\ndomains for cell surface recognition. Ann. Rev. Immunol., 6, 381-405. \n\n\f", "award": [], "sourceid": 214, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}, {"given_name": "Yannick", "family_name": "Pouliot", "institution": null}, {"given_name": "Patrick", "family_name": "Agin", "institution": null}]}