Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Sean O'rourke, Gal Chechik, Robin Friedman, Eleazar Eskin*

Amino acid profiles, which capture position-specific mutation prob- abilities, are a richer encoding of biological sequences than the in- dividual sequences themselves. However, profile comparisons are much more computationally expensive than discrete symbol com- parisons, making profiles impractical for many large datasets. Fur- thermore, because they are such a rich representation, profiles can be difficult to visualize. To overcome these problems, we propose a discretization for profiles using an expanded alphabet representing not just individual amino acids, but common profiles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we find an informationally optimal alphabet. This discretization yields a concise, informative textual representation for profile sequences. Also alignments be- tween these sequences, while nearly as accurate as the full profile- profile alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise align- ment of SwissProt would take years using profiles, but less than 3 days using a discrete IB encoding, illustrating how discrete en- coding can expand the range of sequence problems to which profile information can be applied.

1 Introduction

One of the most powerful techniques in protein analysis is the comparison of a target amino acid sequence with phylogenetically related or homologous proteins. Such comparisons give insight into which portions of the protein are important by revealing the parts that were conserved through natural selection. While mutations in non-functional regions may be harmless, mutations in functional regions are often lethal. For this reason, functional regions of a protein tend to be conserved between organisms while non-functional regions diverge.

```
Department of Computer Science and Engineering, University of California San Diego Department of Computer Science, Stanford University
```

Many of the state-of-the-art protein analysis techniques incorporate homologous sequences by representing a set of homologous sequences as a probabilistic profile, a sequence of the marginal distributions of amino acids at each position in the sequence. For example, Yona et al.[10] uses profiles to align distant homologues from the SCOP database[3]; the resulting alignments are similar to results from structural alignments, and tend to reflect both secondary and tertiary protein structure. The PHD algorithm[5] uses profiles purely for structure prediction. PSIBLAST[6] uses them to refine database searches.

Although profiles provide a lot of information about the sequence, the use of pro- files comes at a steep price. While extremely efficient string algorithms exist for aligning protein sequences (Smith-Waterman[8]) and performing database queries (BLAST[6]), these algorithms operate on strings and are not immediately applica- ble to profile alignment or profile database queries. While profile-based methods can be substantially more accurate than sequence-based ones, they can require at least an order of magnitude more computation time, since substitution penalties must be calculated by computing distances between probability distributions. This makes profiles impractical for use with large bioinformatics databases like SwissProt, which recently passed 150,000 sequences. Another drawback of profile as compared to string representations is that it is much more difficult to visually interpret a sequence of 20 dimensional vectors than a sequence of letters.

Discretizing the profiles addresses both of these problems. First, once a profile is rep- resented using a discrete alphabet, alignment and database search can be performed using the efficient string algorithms developed for sequences. For example, when aligning sequences of 1000 elements, runtime decreases from 20 seconds for profiles to 2 for discrete sequences. Second, by representing each class as a letter, discretized profiles can be presented in plain text like the original or consensus sequences, while conveying more information about the underlying profiles. This makes them more accurate than consensus sequences, and more dense than sequence logos (see figure 1). To make this representation intuitive, we want the discretization not only to minimize information loss, but also to reflect biologically meaningful categories by forming a superset of the standard 20-character amino acid alphabet. For example, we use "A" and "a" for strongly- and weakly-conserved Alanine. This formulation demands two types of constraints: similarities of the centroids to predefined values, and specific structural similarities between strongly- and weakly-conserved variants. We show below how these constraints can be added to the original IB formalism.

In this paper, we present a new discrete representation of proteins that takes into account information from homologues. The main idea behind our approach is to compress the space of probabilistic profiles in a data-dependent manner by clustering the actual profiles and representing them by a small alphabet of distributions. Since this discretization removes some of the information carried by the full profiles, we cluster the distribution in a way that is directly targeted at minimizing the information loss. This is achieved using a variant of Information Bottleneck (IB)[9], a distributional clustering approach for informationally optimal discretization.

We apply our algorithm to a subset of MEROPS[4], a database of peptidases or- ganized structurally by family and clan, and analyze the results in terms of both information loss and alignment quality. We show that multivariate IB in particular preserves much of the information in the original profiles using a small number of classes. Furthermore, optimal alignments for profile sequences encoded with these classes are much closer to the original profile-profile alignments than are alignments between the seed proteins. IB discretization is therefore an attractive way to gain some of the additional sensitivity of profiles with less computational cost.

0.0 0.0 0.0 0.09 0.34 0.23 0.12 0.0 0.0 0.0 0.0 0.0 0.0 0.04 0.01 0.01 0.03 0.0 0.0 0.0 0.0 0.0 1.0 0.01 0.05 0.14 0.09 0.0 1.0 0.0 0.0 0.0 0.0 0.38 0.04 0.00 0.04 0.0 0.0 0.0 0.0 0.0 0.0 0.06 0.00 0.08 0.04 0.0 0.0 1.0 0.0 0.0 0.0 0.00 0.06 0.01 0.03 1.0 0.0 0.0 0.0 0.0 0.0 0.02 0.00 0.04 0.00 0.0 0.0 0.0 N 0.0 0.0 0.0 0.00 0.00 0.03 0.00 0.0 0.0 0.0 ND GDF 0.0 0.0 0.0 0.04 0.01 0.01 0.00 0.0 0.0 0.0 S EAAS S V PT S A A T D F F D N G S L K D Q

```
E
T R N H
F
Y
V 0.0 0.0 0.0 0.01 0.01 0.00 0.09 0.0 0.0 0.0 Q Y
E
A A A A A A
```

0.0 0.0 0.0 0.00 0.00 0.03 0.00 0.0 0.0 0.0 (b) 0.5 1.0 0.0 0.05 0.05 0.01 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.02 0.00 0.23 0.00 0.0 0.0 0.0 P00790 Seq.: ---EAPT--- 0.0 0.0 0.0 0.04 0.05 0.00 0.00 0.0 0.0 0.0 Consensus Seq.: NNDEAASGDF 0.0 0.0 0.0 0.04 0.01 0.00 0.00 0.0 0.0 0.0 0.5 0.0 0.0 0.16 0.10 0.06 0.29 0.0 0.0 0.0 IB Seq.: NNDeaptGDF 0.0 0.0 0.0 0.02 0.10 0.05 0.20 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.14 0.03 0.04 0.0 0.0 0.0 (c) 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.00 0.04 0.04 0.0 0.0 0.0

```
(a)
```

Figure 1: (a) Profile, (b) sequence logo[2], and (c) textual representations for part of an alignment of Pepsin A precursor P00790, showing IB's concision compared to profiles and logos, and its precision compared to single sequences.

2 Information Bottleneck

Information Bottleneck [9] is an information theoretic approach for distributional clustering. Given a joint distribution p(X, Y ) of two random variables X and Y , the goal is to obtain a compressed representation C of X, while preserving the informa- tion about Y . The two goals of compression and information preservation are quan- tified by the same measure of mutual information I(X; Y ) = p(x, y) log p(x,y) x,y p(x)p(y) and the problem is therefore defined as the constrained optimization problem minp(c|x):I(C;Y )>K I(C; X) where K is a constraint on the level of information preserved about Y , and the problem should also obey the constraints p(y|c) = p(y|x)p(x|c) and p(y) = p(y|x)p(x). This constrained optimization can be x x reformulated using Lagrange multipliers, and turned into a tradeoff optimization function with Lagrange multiplier :

```
min L def = I(C; X) - I(C; Y ) (1) p(c|x)
```

As an unsupervised learning technique, IB aims to characterize the set of solutions for the complete spectrum of constraint values K. This set of solutions is identical to the set of solutions of the tradeoff optimization problem obtained for the spectrum of values.

When X is discrete, its natural compression is fuzzy clustering. In this case, the problem is not convex and cannot be guaranteed to contain a single global minimum. Fortunately, its solutions can be characterized analytically by a set of self consistent equations. These self consistent equations can then be used in an iterative algorithm that is guaranteed to converge to a local minimum. While the optimal solutions of the IB functional are in general soft clusters, in practice, hard cluster solutions are sometimes more easily interpreted. A series of algorithms was developed for hard IB, including an algorithm that can be viewed as a one-step look-ahead sequential version of K-Means [7].

To apply IB to the problem of profiles discretization discussed here, X is a given set of probabilistic profiles obtained from a set of aligned sequences and Y is the set of 20 amino acids.

2.1 Constraints on centroids' semantics

The application studied in this paper differs from standard IB applications in that we are interested in obtaining a representation that is both efficient and biologi- cally meaningful. This requires that we add two kinds of constraints on clusters' distributions, discussed below.

First, some clusters' meanings are naturally determined by limiting them to corre- spond to the common 20-letter alphabet used to describe amino acids. From the point of view of distributions over amino acids, each of these symbols is used today as the delta function distribution which is fully concentrated on a single amino acid. For the goal of finding an efficient representation, we require the centroids to be close to these delta distributions. More generally, we require the centroids to be close to some predefined values ^ ci, thus adding constraints to the IB target function of the form DKL[p(y|^ ci)||p(y|ci)] < Ki for each constrained centroid. While solving the constrained optimization problem is difficult, the corresponding tradeoff opti- mization problem can be made very similar to standard IB. With the additional constraints, the IB trade-off optimization problem becomes

```
min L I(C; X) - I(C; Y ) + (ci)DKL[p(y|^ ci)||p(y|ci)] . (2) p(c|x) ciC
```

We now use the following identity

```
p(x, c)DKL[p(y|x)||p(y|c)] x,c
= p(x) p(y|x) log p(y|x) - p(c) log p(y|c) p(y|x)p(x|c)
x y c y x
= -H(Y |X) + H(Y |C) = I(X; Y ) - I(Y ; C)
```

to rewrite the IB functional of Eq. (1) as

```
L = I(C; X) + p(x, c)DKL[p(y|x)||p(y|c)] - I(X; Y ) cC xX
```

When (ci) 1 we can similarly rewrite Eq. (2) as

```
L = I(C; X) + p(x) p(ci|x)DKL[p(y|x)||p(y|ci)] (3) xX ciC
+ (ci)DKL[p(y|^ ci)||p(y|ci)] - I(X; Y ) ciC
= I(C; X) + p(x ) p(ci|x )DKL[p(y|x )||p(y|ci)] - I(X; Y ) x X ciC
```

The optimization problem therefore becomes equivalent to the original IB problem, but with a modified set of samples x X , containing X plus additional "pseudo- counts" or biases. This is similar to the inclusion of priors in Bayesian estimation. Formulated this way, the biases can be easily incorporated in standard IB algorithms by adding additional pseudo-counts x with prior probability p(x ) = i(c).

2.2 Constraints on relations between centroids

We want our discretization to capture correlations between strongly- and weakly- conserved variants of the same symbol. This can be done with standard IB using

separate classes for the alternatives. However, since the distributions of other amino acids in these two variants are likely to be related, it is preferable to define a single shared prior for both variants, and to learn a model capturing their correlation.

Friedman et al.[1] describe multivariate information bottleneck (mIB), an extension of information bottleneck to joint distributions over several correlated input and cluster variables. For profile discretization, we define two compression variables connected as in Friedman's "parallel IB": an amino acid class C {A, C, . . .} with an associated prior, and a strength S {0, 1}. Since this model correlates strong and weak variants of each category, it requires fewer priors than simple IB. It also has fewer parameters: a multivariate model with ns strengths and nc classes has as many categories as a univariate one with nc = nsnc classes, but has only ns +nc -2 free parameters for each x, instead of nsnc - 1.

Do not remove: This comment is monitored to verify that the site is working properly