Alignment
Independent Classification of Proteins
School of
Medicine @ University of Pittsburgh
1.
Introduction
Conventional approaches for protein sequence classification usually
employ sequence alignment methods. However, alignment-based methods
have limitations. They are based on the assumption that contiguity
is conserved between homologous segments, which may not be true in
genetic recombination or horizontal transfer. Alignments also become
ambiguous when sequence similarity drops below 40% and become
unusable when this level reaches 20-25%.
We
propose alignment-independent classification of protein (AICP) based
on N-gram patterns. N-gram patterns in protein sequences are
combinations of n residues and m gaps in widow of size m+n (NP{n,m}).
Features of interest in NP{4,2} patterns include: (1) the inclusion
of all possible n-gram combinations for n = {1-4}; (2) a window wide
enough to capture the n+2 and n+4 sequence periodicities associated
with alpha helices and beta sheets; (3) an implied scoring matrix
due to the presence of gaps at variable positions; (4) a low
probability for finding redundant n-gram patterns in the same
sequence; (5) a high probability of family membership for two
sequences that contain the same pair of non-overlapping NP{4,2}
patterns; and (6) the existence of all theoretically possible
NP{4,2} patterns in nature. These features reflect a mixture of
statistical properties related to local propensity as well as
evolution.
In
2004, we demonstrated that non-alignment based protein
classification algorithms using the NP{4,2} pattern could
successfully identify homologous relationships. Since N-gram
patterns (n=4, m=2) are statistical units well suited for
correlation with other parameters derived from sequence and
3-dimensional coordinate data, we are extending the previous work
into other fields, including (1) correlating n-gram patterns to
protein secondary structure prediction; (2) correlating n-gram
patterns to slow modes (functional motions) predicted by Gaussian
Network Model (GNM); (3) a new method for conservation profile; and
others.
2.
N-gram patterns and secondary structure prediction
Since NP{4,2} patterns capture all combinations of n-grams for n =
{1-4} as well as n+2 and n+4 periodicities, they should also
correlate with secondary structure propensity. In recent years,
improved performance has been achieved in secondary structure
prediction algorithms by combining evolutionary and local propensity
information. Because NP{4,2} patterns reflect both types of
information, they might be useful in secondary structure
classification algorithms. To quantify the relationship between
NP{4,2} patterns and secondary structure, studies were carried out
on the coordinate files of the protein chains in the Protein Data
Bank (PDB).
3.
N-gram conservation profile
Conservation of certain regions of the sequences across members of
protein family, points to important roles of these regions, which
are often part of the structural core, or of the functional sites of
the protein. In essence, the main motivation for building and
analyzing phylogenetic data is to be able to identify such regions.
Current methods usually require construction of multiple sequence
alignments. Following the construction of the alignment, one can
inspect on the amino acid composition of each column. Various
methods can then be applied to estimate the degree of conservation
of the sites using entropy or by means of substitution matrices.
Construction of multiple sequence alignment for a protein family, is
however not a trivial, nor a simple task. Additional limitation is
that for optimal results, multiple alignments should be constructed
separately for each domain in the protein.
We
are currently working on a method, based on Ngram distributions for
alternative construction of conservation profiles. The basic idea is
to find sequences that possess two common Ngrams in the same offset
as in a query sequence. Counting the number of times each position
in the query sequence is a part of such Ngrams is the basis for
construction of the conservation profile. Using this strategy there
is no need to find family members by sequence alignment and to
construct explicit multiple alignments. In addition, there is no
need to deal independently with different domains. Reliable
conservation profile for the entire protein can be potentially
obtained at once.
Conservation Profile Service (Please check back later)
References
"A Sequence Alignment-Independent Method
For Protein Classification" John K. Vries Rajan Munshi, Dror Tobi
Judith Klein-Seetharaman, Panayiotis V. Benos and Ivet Bahar .
Applied Bioinformatics, 2004;3(2-3):137-48.
"The Relationship between N-gram Patterns and Secondary Structure"
John K. Vries, Xiong Liu and Ivet Bahar, in submission.
Back
|