Alignment Independent Classification of Proteins

School of Medicine @ University of Pittsburgh

1. Introduction

Conventional approaches for protein sequence classification usually employ sequence alignment methods. However, alignment-based methods have limitations. They are based on the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40% and become unusable when this level reaches 20-25%.

We propose alignment-independent classification of protein (AICP) based on N-gram patterns. N-gram patterns in protein sequences are combinations of n residues and m gaps in widow of size m+n (NP{n,m}). Features of interest in NP{4,2} patterns include: (1) the inclusion of all possible n-gram combinations for n = {1-4}; (2) a window wide enough to capture the n+2 and n+4 sequence periodicities associated with alpha helices and beta sheets; (3) an implied scoring matrix due to the presence of gaps at variable positions; (4) a low probability for finding redundant n-gram patterns in the same sequence; (5) a high probability of family membership for two sequences that contain the same pair of non-overlapping NP{4,2} patterns; and (6) the existence of all theoretically possible NP{4,2} patterns in nature. These features reflect a mixture of statistical properties related to local propensity as well as evolution.

In 2004, we demonstrated that non-alignment based protein classification algorithms using the NP{4,2} pattern could successfully identify homologous relationships. Since N-gram patterns (n=4, m=2) are statistical units well suited for correlation with other parameters derived from sequence and 3-dimensional coordinate data, we are extending the previous work into other fields, including (1) correlating n-gram patterns to protein secondary structure prediction; (2) correlating n-gram patterns to slow modes (functional motions) predicted by Gaussian Network Model (GNM); (3) a new method for conservation profile; and others.

2. N-gram patterns and secondary structure prediction

Since NP{4,2} patterns capture all combinations of n-grams for n = {1-4} as well as n+2 and n+4 periodicities, they should also correlate with secondary structure propensity. In recent years, improved performance has been achieved in secondary structure prediction algorithms by combining evolutionary and local propensity information. Because NP{4,2} patterns reflect both types of information, they might be useful in secondary structure classification algorithms. To quantify the relationship between NP{4,2} patterns and secondary structure, studies were carried out on the coordinate files of the protein chains in the Protein Data Bank (PDB).

3. N-gram conservation profile

Conservation of certain regions of the sequences across members of protein family, points to important roles of these regions, which are often part of the structural core, or of the functional sites of the protein. In essence, the main motivation for building and analyzing phylogenetic data is to be able to identify such regions.

Current methods usually require construction of multiple sequence alignments. Following the construction of the alignment, one can inspect on the amino acid composition of each column. Various methods can then be applied to estimate the degree of conservation of the sites using entropy or by means of substitution matrices.  Construction of multiple sequence alignment for a protein family, is however not a trivial, nor a simple task. Additional limitation is that for optimal results, multiple alignments should be constructed separately for each domain in the protein.

We are currently working on a method, based on Ngram distributions for alternative construction of conservation profiles. The basic idea is to find sequences that possess two common Ngrams in the same offset as in a query sequence. Counting the number of times each position in the query sequence is a part of such Ngrams is the basis for construction of the conservation profile. Using this strategy there is no need to find family members by sequence alignment and to construct explicit multiple alignments. In addition, there is no need to deal independently with different domains. Reliable conservation profile for the entire protein can be potentially obtained at once.

Conservation Profile Service (Please check back later)

 

References

"A Sequence Alignment-Independent Method For Protein Classification" John K. Vries  Rajan Munshi, Dror Tobi Judith Klein-Seetharaman, Panayiotis V. Benos and Ivet Bahar .  Applied Bioinformatics, 2004;3(2-3):137-48.
"The Relationship between N-gram Patterns and Secondary Structure" John K. Vries, Xiong Liu and Ivet Bahar, in submission.

 

Back

University of Pittsburgh ---------- School of Medicine
Suite 3064 BST3  3501 5th Avenue, Pittsburgh, PA 15261.     Phone : (412) 648-3333,  Fax: (412) 648-3163

Tel : (412) 648-6671,  Fax: (412) 648-6676