{"title": "Prediction of Beta Sheets in Proteins", "book": "Advances in Neural Information Processing Systems", "page_first": 917, "page_last": 923, "abstract": null, "full_text": "Prediction of Beta Sheets in Proteins \n\nAnders  Krogh \nThe Sanger Centre \n\nHinxton,  Carobs CBIO  IRQ,  UK. \n\nEmail:  krogh@sanger.ac. uk \n\nS~ren Kamaric Riis \n\nElectronics  Institute,  Building 349 \nTechnical  University of Denmark \n\n2800  Lyngby,  Denmark \nEmail:  riis@ei.dtu.dk \n\nAbstract \n\nMost current methods for prediction of protein secondary structure \nuse a small window of the protein sequence to predict the structure \nof the central amino acid.  We describe a new method for prediction \nof the  non-local  structure  called  ,8-sheet,  which  consists  of two  or \nmore  ,8-strands  that  are  connected  by  hydrogen  bonds.  Since,8-\nstrands  are  often  widely separated in the  protein  chain, a network \nwith two windows is introduced.  After training on a set of proteins \nthe network  predicts  the sheets  well,  but there are many false  pos(cid:173)\nitives.  By  using  a  global energy  function  the ,8-sheet  prediction is \ncombined with a  local prediction of the three secondary structures \na-helix, ,8-strand and coil.  The energy function is minimized using \nsimulated annealing to give  a  final  prediction. \n\n1 \n\nINTRODUCTION \n\nProteins are long sequences  of amino acids.  There are 20  different  amino acids with \nvarying  chemical  properties,  e. g. , some are  hydrophobic  (dislikes  water)  and some \nare  hydrophilic  [1].  It is  convenient  to  represent  each  amino acid  by  a  letter  and \nthe sequence  of amino acids in  a  protein  (the  primary structure) can be written  as \na  string  with  a  typical  length of 100  to  500  letters.  A  protein  chain folds  back  on \nitself,  and the resulting 3D structure (the  tertiary structure) is  highly correlated  to \nthe  function  of the  protein.  The prediction  of the  3D  structure  from  the  primary \nstructure  is  one  of the  long-standing  unsolved  problems  in  molecular  biology.  As \nan  important  step  on  the  way  a  lot  of work  has  been  devoted  to  predicting  the \nlocal  conformation  of  the  protein  chain,  which  is  called  the  secondary  structure. \nNeural  network methods are  currently  the most successful  for  predicting secondary \nstructure.  The approach  was  pioneered  by  Qian  and Sejnowski  [2]  and Bohr  et  al. \n[3],  but later extended  in  various ways,  see  e.g.  [4]  for  an overview.  In  most of this \nwork,  only  the  two regular  secondary  structure  elements  a-helix  and  ,8-strand  are \nbeing  distinguished,  and  everything else  is  labeled  coil.  Thus,  the  methods based \n\n\f918 \n\nA.  KROGH, S. K. RIIS \n\nH-\\ \n!\" \n,,=c \n\nH-\\ \nt \n{-~  {-H \no=c \n/=0 \nH-\\ \nH-\n\\\nf \nfa \n/o=c \no=c \n~-: \n~-o \n\n' \n\nFigure 1:  Left:  Anti-parallel,B-sheet.  The vertical lines correspond to the backbone \nof the protein.  An amino acid consists of N-Ca-C  and a side chain on the Ca  that \nis  not shown (the 20  amino acids  are distinguished  by  different  side  chains).  In  the \nanti-parallel sheet  the  directions  of the  strands  alternate,  which  is  here  indicated \nquite explicitly  by showing the middle strand up-side down.  The H-bonds  between \nthe  strands  are  shown  by  11111111.  A  sheet  has  two  or  more strands,  here  the  anti(cid:173)\nparallel sheet  is shown with three strands.  Right:  Parallel ,B-sheet  consisting of two \nstrands. \n\non  a  local  window  of amino  acids  give  a  three-state  prediction  of the  secondary \nstructure  of the  central amino acid in  the window. \n\nCurrent predictions of secondary structure based on single sequences  as input have \naccuracies  of  about  65-66%.  It is  widely  believed  that  this  accuracy  is  close  to \nthe limit of what  can  be  done  from  a  local window  (using only single sequences  as \ninput)  [5],  because  interactions  between  amino acids far  apart in  the  protein chain \nare  important  to  the  structure.  A  good  example  of such  non-local  interactions \nare  the  ,B-sheets  consisting  of two  or  more  ,B-strands  interconnected  by  H-bonds, \nsee  fig.  1.  Often  the  ,B-strands  in  a  sheet  are  widely  separated  in  the  sequence, \nimplying that only  part of the  available sequence  information about  a ,B-sheet  can \nbe contained in a  window of, say,  13 amino acids.  This is one of the reasons why the \naccuracy  of ,B-strand  predictions  are  generally  lower  than  the  accuracy  of a-helix \npredictions.  The aim of this  work  is  to improve prediction of secondary  structures \nby combining local predictions of a-helix, ,B-strand and coil with a non-local method \npredicting ,B-sheets. \n\nOther  work  along  the same  directions  include  [6]  in  which  ,B-sheet  predictions  are \ndone by  linear  methods and  [7]  where  a  so-called  density  network is  applied to  the \nproblem. \n\n2  A  NEURAL NETWORK  WITH  TWO  WINDOWS \n\nWe  aim  at  capturing  correlations  in  the  ,B-sheets  by  using  a  neural  network  with \ntwo  windows,  see  fig.  2.  While window  1 is  centered  around  amino acid number  i \n(ai),  window 2 slides along the rest of the chain.  When the amino acids centered in \neach of the two  windows sit opposite each other in a ,B-sheet  the target output is  1, \nand otherwise O.  After the whole protein has been traversed by window 2,  window 1 \nis  moved to the next position (i + 1)  and the procedure  is repeated.  If the protein is \nL  amino acids long this procedure yields  an output value for each of the  L(L -1)/2 \n\n\fPrediction of Beta Sheets  in  Proteins \n\n919 \n\nFigure 2:  Neural network for pre(cid:173)\ndicting  ,B-sheets.  The  network \nemploys  weight  sharing  to  im(cid:173)\nprove  the encoding  of the amino \nacids  and  to reduce  the  number \nof adjustable parameters. \n\npairs of amino acids.  We  display the output in a L x L  gray-scale image as shown in \nfig.  3.  We assume symmetry of sheets,  i.e., if the two windows are interchanged, the \noutput does not change.  This symmetry is ensured (approximately) during training \nby  presenting all inputs in both directions. \n\nEach window of the network sees  K  amino acids.  An amino acid is represented  by a \nvector of20 binary numbers all being zero, except one, which is 1.  That is, the amino \nacid A is represented  by  the vector  1,0,0, ... ,0 and so on.  This coding ensures  that \nthe input representations  are  un correlated , but it is  a  very  inefficient  coding, since \n20  amino acids  could  in principle  be  represented  by  only  5  bit.  Therefore,  we  use \nweight  sharing  [8]  to learn a  better encoding  [4].  The 20  input units corresponding \nto one  window  position are fully  connected  to three hidden units.  The 3 x  (20 + 1) \nweights to these  units are shared by all window  positions,  i.e., the activation of the \n3  hidden  units  is  a  new  learned  encoding  of the  amino acids,  so  instead  of being \nrepresented  by  20  binary values they  are represented  by 3 real values.  Of course the \nnumber of units for this encoding can be varied,  but initial experiments showed that \n3 was optimal [4].  The  two  windows  of the  network  are  made the same way  with \nthe same number of inputs  etc ..  The first  layer of hidden  units in the two  windows \nare fully connected to a hidden layer which is fully  connected to the output unit, see \nfig.  2.  Furthermore,  two  structurally  identical networks  are  used:  one  for  parallel \nand one for  anti-parallel ,B-sheets. \n\nThe basis for  the training set in this study is the set of 126 non-homologous protein \nchains used in [9],  but chains forming ,B-sheets  with  other chains are excluded.  This \nleaves us with 85 proteins in our data set.  For a protein of length L only a very small \nfraction  of the  L(L - 1)/2 pairs  are  positive  examples of ,B-sheet  pairs.  Therefore \nit  is  very  important  to  balance  the  positive  and  negative  examples  to  avoid  the \nsituation  where  the  network  always  predicts  no  ,B-sheet.  Furthermore,  there  are \nseveral  types  of negative  examples  with  quite  different  occurrences:  1)  two  amino \nacids of which none belong to a ,B-sheet;  2)  one in a ,B-sheet  and one which  is not in \na ,B-sheet;  3)  two sitting in ,B-sheets,  but not opposite to each other.  The balancing \nwas  done  in  the  following  way.  For  each  positive  example  selected  at  random  a \nnegative example from each  of the three categories were selected  at random. \n\nIf the  network  does  not  have  a  second  layer of hidden  units,  it  turns out  that  the \nresult  is  no  better  than  a  network  with  only  one  input  window,  i.e.,  the  network \ncannot capture correlations between the two windows.  Initial experiments indicated \nthat about  10  units in  the second  hidden layer  and two identical input  windows  of \nsize  K  =  9 gave the best results.  In fig.  3(left)  the prediction of anti-parallel sheets \nis  shown  for  the  protein  identified  as  1acx  in the  Brookhaven  Protein  Data Bank \n\n\f920 \n\nA.  KROGH, S.  K.  RIIS \n\n120 \n\n100 \n\n..  80 \n:g \n'\" o \n.!: \n~ 60 \n\n/ \n\n40 \n\n\". \n\n20 \n\nFigure 3:  Left:  The prediction of anti-parallel ,8-sheets  in  the  protein  laex.  In the \nupper  triangle  the  correct  structure  is  shown  by  a  black  square  for  each  ,8-sheet \npair.  The  lower  triangle  shows  the  prediction  by  the  two-window  network.  For \nany  pair of amino acids  the network  output is  a  number between  zero  (white)  and \none  (black),  and  it  is  displayed  by  a  linear  gray-scale.  The  diagonal  shows  the \nprediction of a-helices.  Right:  The same display for parallel ,8-sheets  in the protein \n4fxn.  Notice  that the  correct  structure  are  lines  parallel to  the  diagonal,  whereas \nthey  are  perpendicular  for  anti-parallel  sheets.  For  both  cases  the  network  was \ntrained  on  a  training set  that  did  not  contain  the  protein  for  which  the  result  is \nshown. \n\n[10].  First  of all,  one  notices  the  checker  board  structure  of the  prediction  of ,8-\nsheets.  This  is  related  to  the structure  of ,8-sheets.  Many  sheets  are  hydrophobic \non  one  side  and  hydrophilic  on  the  other.  The  side  chains  of the  amino acids  in \na  strand  alternates  between  the  two  sides  of the  sheet,  and  this  gives  rise  to  the \nperiodicity  responsible  for  the pattern. \n\nAnother  network  was  trained  on  parallel  ,8-sheets.  These  are  rare  compared  to \nthe  anti-parallel  ones,  so  the  amount  of training  data is  limited.  In  fig.  3(right) \nthe  result  is  shown  for  protein  4fxn.  This  prediction  seems  better  than  the  one \nobtained  for  anti-parallel sheets,  although false  positive predictions  still occurs  at \nsome positions with strands that do not pair.  Strands that bind in parallel ,8-sheets \nare  generally  more  widely  separated  in  the  sequence  than  strands  in  anti-parallel \nsheets.  Therefore,  one  can  imagine that  the strands  in  parallel  sheets  have  to  be \nmore correlated  to find  each  other  in  the folding  process,  which  would  explain the \nbetter prediction  accuracy. \n\nThe results shown in fig.  3 are fairly representative.  The network misses some of the \nsheets,  but false  positives present a more severe problem.  By calculating correlation \ncoefficients  we  can show  that the network  doe!>  capture some correlations,  but they \nseem  to  be  weak.  Based  on  these  results,  we  hypothesize  that the formation of ,8-\nsheets  is  only  weakly  dependent  on  correlations  between  corresponding  ,8-strands. \nThis  is  quite surprising.  However  weak  these  correlations are,  we  believe  they  can \nstill  improve  the  accuracy  of the  three  state  secondary  structure  prediction.  In \norder to combine local methods with the non-local ,8-sheet  prediction,  we introduce \na  global energy  function  as  described  below. \n\n\fPrediction of Beta Sheets in  Proteins \n\n921 \n\n3  A  GLOBAL ENERGY FUNCTION \n\nWe use a  newly  developed local neural network method based on one  input window \n[4]  to  give  an  initial prediction  of the  three  possible  structures.  The output  from \nthis  network  is  constrained  by  soft max  [11],  and  can  thus  be  interpreted  as  the \nprobabilities for  each  of the  three  structures.  That  is,  for  amino acid  ai,  it yields \nthree  numbers  Pi,n,  n  =  1,2 or  3  indicating  the  probability  of a-helix  (Pi,l) ,  (3-\nsheet  (pi,2),  or  coil  (pi,3).  Define  Si,n  = 1 if amino  acid  i  is  assigned  structure  n \nand  Si,n  = 0 otherwise.  Also  define  hi,n  = 10gPi,n.  We  now  construct  the  'energy \nfunction' \n\ni \n\nn \n\n(1) \n\nwhere weights Un  are introduced for later usage.  Assuming the probabilities Pi,n  are \nindependent for any two amino acids in a sequence, this is the negative log likelihood \nof the  assigned  secondary  structure  represented  by  s,  provided  that Un  = 1.  As  it \nstands,  alone,  it  is  a  fairly  trivial  energy  function,  because  the  minimum is  the \nassignment  which  corresponds  to  the  prediction  with  the  maximum Pi,n  at  each \nposition  i-the assignment  of secondary  structure  that  one  would  probably  use \nanyway. \n\nFor  amino  acids  ai  and  aj  the  logarithm  of  the  output  of  the  (3-sheet  network \ndescribed previously is called qfj  for parallel (3-sheets  and qfj  for anti-parallel sheets. \nWe  interpret  these  numbers  as the  gain in  energy  if a  (3-sheet  pair is formed.  (As \nmore terms are added to the energy,  the interpretation  as a  log-likelihood function \nis  gradually fading.)  If the  two  amino acids  form  a  pair  in  a  parallel  (3-sheet,  we \nset  the  variable T~ equal to  1,  and otherwise  to 0,  and similarly with Tii  for  anti-\nparallel sheets.  Thus  the  Tii  and  T~ are  sparse  binary  matrices.  Now  the  total \nenergy  of the (3-sheets  can  be expressed  as \n\nHf3(s, T a, TP)  =  - ~[CaqfjTij + CpqfjT~], \n\n(2) \n\n'J \n\nwhere  Ca  and  Cp  determine  the  weights  of the  two  terms  in  the  function.  Since \nan  amino  acid  can  only  be  in  one  structure,  the  dynamic  T  and  S  variables  are \nconstrained:  Only Tii  or T~ can be  1 for  the same (i, j), and if any of them is  1 the \namino acids involved must be in  a (3-sheet,  so  Si,2  = Sj,2 = 1.  Also,  Si ,2  can only be \n1 if there  exists  a  j  with either Iii or  T~ equal to  1.  Because  of these  constraints \nwe  have indicated  an  S  dependence  of H f3. \n\nThe last  term  in  our  energy  function  introduces  correlations  between  neighboring \namino acids.  The above assumption that the secondary structure of the amino acids \nare  independent  is of course  a  bad assumption, and we  try to repair  it with a  term \n(3) \n\nHn(s) = L: L: Jnm Si,n Si+l,m, \n\ni  nm \n\nthat  introduces  nearest  neighbor  interactions  in  the  chain.  A  negative  J11,  for \ninstance,  means that  a  following  a  is  favored,  and  e.g.,  a  positive  h2 discourages \na  (3  following an a. \n\nNow  the total energy  is \n\n(4) \nSince (3-sheets  are introduced in two ways, through hi ,2  and qij, we need the weights \nUn  in  (1)  to be  different  from  1. \nThe  total  energy  function  (4)  has  some  resemblance  with  a  so-called  Potts  glass \nin  an  external  field  [12].  The  crucial  difference  is  that  the  couplings  between  the \n\n\f922 \n\nA.  KROGH, S. K. RIIS \n\n'spins' Si  are dependent on the dynamic variables T.  Another analogy of the energy \nfunction  is  to  image analysis,  where  couplings  like  the  T's  are  sometimes used  as \nedge  elements. \n\n3.1  PARAMETER ESTIMATION \n\nThe energy function  contains a  number of parameters,  Un,  Ca ,  Cp  and  Jnm .  These \nparameters  were  estimated  by  a  method  inspired  by  Boltzmann learning  [13].  In \nthe Boltzmann machine the estimation of the weights can be formulated as  a  min(cid:173)\nimization  of the  difference  between  the  free  energy  of the  'clamped'  system  and \nthat of the  'free-running' system  [14].  If we  think of our energy  function  as  a  free \nenergy  (at  zero  temperature),  it  corresponds  to minimizing the  difference  between \nthe energy  of the  correct  protein structure and the  minimum energy, \n\nwhere p is the total number of proteins in the training set.  Here the correct structure \nof protein J-l  is  called  S(J-l) , Ta(J-l), TP(p),  whereas  s(J-l), Ta(J-l) , TP(J-l)  represents  the \nstructure  that  minimizes the energy  Htotal.  By  definition  the second  term of C  is \nless  than the first,  so  C  is  bounded from below  by zero. \n\nThe  cost  function  C  is  minimized  by  gradient  descent  in  the  parameters.  This  is \nin  principle  straightforward,  because  all  the  parameters  appear  linearly  in  Htotal. \nHowever, a problem with this approach is that C  is minimal when all the parameters \nare set to zero,  because  then the energy  is zero.  It is  cured by  constraining some of \nthe parameters in Htotal.  We  chose  the constraint  l:n Un  =  1.  This may not be the \nperfect solution from a theoretical point of view,  but it works well.  Another problem \nwith this approach is  that one has  to find  the minimum of the energy  Htotal  in the \ndynamic variables in  each iteration of the gradient  descent  procedure.  To globally \nminimize  the  function  by  simulated  annealing  each  time  would  be  very  costly  in \nterms of computer  time.  Instead  of using  the  (global)  minimum of the energy  for \neach protein, we  use  the energy obtained by minimizing the energy from the correct \nstructure.  This  minimization is  done  by  a  greedy  algorithm in  the  following  way. \nIn  each  iteration  the  change  in  s, Ta, TP  which  results  in  the  largest  decrease  in \nHtotal  is  carried  out.  This  is  repeated  until  any  change  will  increase  Htotal.  This \nalgorithm works  towards  a  local  stability  of the  protein  structures  in  the  training \nset.  We  believe  it is  not only  an  efficient  way  of doing it,  but  also  a  very  sensible \nway.  In fact,  the method may well be applicable in other models, such as Boltzmann \nmachines. \n\n3.2  STRUCTURE PREDICTION  BY  SIMULATED  ANNEALING \n\nAfter estimation of the parameters on which the energy function  Htotal  depends,  we \ncan proceed to predict the structure of new proteins.  This was done using simulated \nannealing and the EBSA  package [15].  The total procedure for  prediction is, \n\n1.  A  neural  net  predicts  a-helix,  ,8-strand  or  coil.  The  logarithm  of  these \n\npredictions give all the  hi,n  for  that protein. \n\n2.  The two-window neural networks predict  the ,8-sheets.  The result is the qfj \n\nfrom one network  and the  qfj  from the other. \n\n3.  A  random configuration of S, Ta, TP  variables is  generated from  which  the \nsimulated annealing minimization of Htotal  was started.  During annealing, \nall constraints on s, Ta, TP  variables are strictly enforced. \n\n\fPrediction of Beta Sheets  in  Proteins \n\n923 \n\n4.  The final minimum configuration s is the prediction of the secondary struc-\n\nture.  The ,B-sheets  are predicted  by t a  and tv. \n\nUsing  the  above scheme,  an  average  secondary  structure  accuracy  of 66.5%  is  ob(cid:173)\ntained  by seven-fold  cross  validation.  This should  be  compared to 66.3% obtained \nby the local neural network based method [4]  on the same data set.  Although these \npreliminary results  do  not  represent  a  significant  improvement, we  consider  them \nvery  encouraging for  future  work.  Because  the  method not  only  predicts  the  sec(cid:173)\nondary  structure,  but  also  which  strands  actually  binds  to  form  ,B-sheets,  even  a \nmodest result  may be an important step  on the  way  to full  3D  predictions. \n\n4  CONCLUSION \n\nIn  this  paper  we  introduced  several  novel  ideas  which  may  be  applicable in  other \ncontexts than prediction of protein structure.  Firstly, we  described a neural network \nwith  two input windows that was  used for  predicting the non-local structure called \n,B-sheets.  Secondly,  we  combined  local  predictions  of a-helix,  ,B-strand  and  coil \nwith  the  ,B-sheet  prediction  by  minimization of a  global energy  function.  Thirdly, \nwe showed how the adjustable parameters in the energy function could be estimated \nby a  method similar to Boltzmann learning. \n\nWe  found  that  correlations  between  ,B-strands  in  ,B-sheets  are  surprisingly  weak. \nUsing  the  energy  function  to  combine  predictions  improves  performance  a  little. \nAlthough  we  have  not  solved  the  protein folding  problem,  we  consider  the  results \nvery encouraging for future work.  This will include attempts to improve the  perfor(cid:173)\nmance of the two-window network as well as experimenting with the energy function, \nand  maybe add more terms to incorporate new  constraints. \n\nAcknowledgments:  We  would  like  to  thank  Tim Hubbard,  Richard  Durbin  and \nBenny  Lautrup  for  interesting  comments  on  this  work  and  Peter  Salamon  and \nRichard  Frost  for  assisting  with  simulated  annealing.  This  work  was  supported \nby  a  grant from  the  Novo  Nordisk Foundation. \n\nReferences \n[1]  C.  Branden  and  J.  Tooze,  Introduction  to  Protein  Structure  (Garland  Publishing, \n\nInc.,  New  York,  1991). \n\n[2]  N.  Qian  and  T.  Sejnowski,  Journal of Molecular  Biology  202,  865  (1988). \n[3]  H.  Bohr  et al.,  FEBS  Letters 241,  223  (1988). \n[4]  S.  Riis  and  A.  Krogh,  Nordita Preprint  95/34  S,  submitted  to  J.  Compo  BioI. \n[5]  B.  Rost,  C.  Sander,  and  R.  Schneider,  J  Mol.  BioI.  235,  13  (1994). \n[6]  T. Hubbard, in  Proc.  of the 27th HICSS, edited by  R.  Lathrop  (IEEE Computer Soc. \n\nPress,  1994),  pp.  336-354. \n\n[7]  D.  J.  C.  MacKay,  in  Maximum  Entropy  and  Bayesian  Methods,  Cambridge  1994, \n\nedited  by  J.  Skilling  and  S.  Sibisi  (Kluwer,  Dordrecht,  1995). \n\n[8]  Y.  Le  Cun  et al.,  Neural Computation  1, 541  (1989). \n[9]  B.  Rost  and  C.  Sander,  Proteins  19,  55  (1994). \n[10]  F.  Bernstein  et al.,  J  Mol.  BioI.  112,535  (1977). \n[11]  J.  Bridle,  in  Neural Information Processing Systems 2,  edited  by  D.  Touretzky  (Mor-\n\ngan  Kaufmann,  San  Mateo,  CA,  1990),  pp.  211-217. \n\n[12]  K.  Fisher  and  J.  Hertz,  Spin glasses (Cambridge  University  Press,  1991). \n[13]  D.  Ackley,  G.  Hinton,  and T. Sejnowski,  Cognitive  Science  9,  147  (1985). \n[14]  J. Hertz, A.  Krogh,  and R.  Palmer,  Introduction to  the  Theory of Neural Computation \n\n(Addison-Wesley,  Redwood  City,  1991). \n\n[15]  R.  Frost,  SDSC  EBSA,  C Library  Documentation,  version  2.1.  SDSC  Techreport. \n\n\f", "award": [], "sourceid": 1082, "authors": [{"given_name": "Anders", "family_name": "Krogh", "institution": null}, {"given_name": "Soren", "family_name": "Riis", "institution": null}]}