Homepage

A.N.Belozersky Institute

GeneBee

Russian EMBnet Node

Putative phylogenetic trees prediction for sequence multiple alignment by cluster and topological algorithms.


REFERENCES:


ALGORITHM:

Because of possible long duration of the treatment the result will be sent to you by E-mail along with it's WWW demonstration.

The source sequence multiple alignment could be formed from low and upper case letters.
Upper case letters will show that there is a homology between correspondent fragments in alignment.

Building of probable phylogenetic trees is bases on the matrix of pair distances between sequences. These distances are calculated from counting residual change weights in aligned sequences according to the following formulas:


d(i,j)=1-(S(i,j) - Sr(i,j))/(Smax(i,j) - Sr(i,j)),

where

S is the sum of weights for residue exchange of a given pair of sequences i and j; Sr - analogous sum for the same randomized sequences; Smax - maximum possible sum (reached when sequences coincide).

This is mostly used formula. It take into account that similarity of the sequences can't be less than some definite (random) level.

Another variant of the formula (Sr = 0):


d(i,j) = 1 - S(i,j)/Smax(i,j),

The third variant of the formula smooths the distinction of the small similarities.


d(i,j) = -ln (S(i,j)/(Smax(i,j)).


We use the following weight MATRICES for aminoacids exchange:

  • Dayhoff matrix (modified 250 PAM matrix from Atlas of Protein sequence and structure,v.5, suppl. 3, pp.345-358):
         A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
    A   12
    C    8 22
    D   10  5 14
    E   10  5 13 14
    F    6  6  4  5 19
    G   11  7 11 10  5 15
    H    9  7 11 11  8  8 16
    I    9  8  8  8 11  7  8 15
    K    9  5 10 10  5  8 10  8 15
    L    8  4  6  7 12  6  8 12  7 16
    M    9  5  7  8 10  7  8 12 10 14 16
    N   10  6 12 11  6 10 12  8 11  7  8 12
    P   11  7  9  9  5  9 10  8  9  7  8  9 16
    Q   10  5 12 12  5  9 13  8 11  8  9 11 10 14
    R    8  6  9  9  6  7 12  8 13  7 10 10 10 11 16
    S   11 10 10 10  7 11  9  9 10  7  8 11 11  9 10 12
    T   11  8 10 10  7 10  9 10 10  8  9 10 10  9  9 11 13
    V   10  8  8  8  9  9  8 14  8 12 12  8  9  8  8  9 10 14
    W    4  2  3  3 10  3  7  5  7  8  6  6  4  5 12  8  5  4 27
    Y    7 10  6  6 17  5 10  9  6  9  8  8  5  6  6  7  7  8 10 20
    
    
  • BLOSUM62 matrix constructed from 2000 multiple aligned blocks (equivalent to 160 PAM, Henikoff S. et al, Proc. Nat. Acad. Sci. USA, v.89, pp.10915 -10919):
         A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
    A    8
    C    4 13
    D    2  1 10
    E    3  0  6  9
    F    2  2  1  1 10
    G    4  1  3  2  1 10
    H    2  1  3  4  3  2 12
    I    3  3  1  1  4  0  1  8
    K    3  1  3  5  1  2  3  1  9
    L    3  3  0  1  4  0  1  6  2  8
    M    3  3  1  2  4  1  2  5  3  6  9
    N    2  1  5  4  1  4  5  1  4  1  2 10
    P    3  1  3  3  0  2  2  1  3  1  2  2 11
    Q    3  1  4  6  1  2  4  1  5  2  4  4  3  9
    R    3  1  2  4  1  2  4  1  6  2  3  4  2  5  9
    S    5  3  4  4  2  4  3  2  4  2  3  5  3  4  3  8
    T    4  3  3  3  2  2  2  3  3  3  3  4  3  3  3  5  9
    V    4  3  1  2  3  1  1  7  2  5  5  1  2  2  1  2  4  8
    W    1  2  0  1  5  2  2  1  1  2  3  0  0  2  1  1  2  1 15
    Y    2  2  1  2  7  1  6  3  2  3  3  2  1  3  2  2  2  3  6 11
    
    
  • Johnson matrix based on AA replacements in 65 sets structure aligned 3D structures (M.S.Johnson et al., J.Mol.Biol. (1993), v.233, pp. 716 - 738) :
        A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
    A  16
    C   6 26
    D   8  0 18
    E   9  3 12 18
    F   7  5  3  3 20
    G   9  2  8  7  1 18
    H   7  2  9  7  8  7 22
    I   8  2  5  5 10  4  5 18
    K   9  1  8 11  4  6 10  5 17
    L   6  1  2  4 12  3  5 12  6 17
    M   8  5  4  7  9  5  7 12  8 14 21
    N   8  2 12  9  6  8 11  5 10  5  6 18
    P   9  1  9  8  5  7  5  4  9  7  0  7 20
    Q   9  3  9 12  3  7 11  3 11  5  9  9  6 19
    R   8  4  6 10  4  7 10  4 13  6  6  8  6 12 20
    S  10  2 10  8  5  8  7  5  8  5  5 11  9  9  9 16
    T   9  4  8  9  5  6  7  7 10  5  7 10  8  9  8 12 17
    V   9  5  5  6  8  4  6 14  6 12 10  4  5  6  5  5  8 17
    W   4  1  4  2 13  3  6  6  4  9  9  4  2  2  6  4  0  5 25
    Y   6  2  6  6 13  4  9  7  6  7  8  7  3  5  8  6  7  8 12 20
    
    
  • and the unitary matrix for DNA/RNA:
         A  C  G  T  U
    A   10
    C    0 10
    G    0  0 10
    T    0  0  0 10
    U    0  0  0 10 10
    
    

In the described module there are two algorithms for building phylogenetic trees: topological and cluster.

The main feature of the topological algorithms is the fact that they optimize the tree structure (i.e. the way the tree nodes are connected) first without consideration of branch lengths, that are reconstituted once the topological structure have been established.

The topological algorithms employed in the module are based on the so-called topological similarity principle. This approach is tailored for the most precise representation of the internal structure of the analyzed distance matrix. In order to do that the number of sequence quartets {i,j,k,l} is calculated, for which the inequality


d(i,j) + d(k,l) 

holds in the matrix, but does not hold in the tree, and vice versa, i.e. the topological deviation value. The algorithm is aimed at construction of the tree, for which this number is minimal, i.e the maximum topological similarity tree. Although in general the algorithm does not guarantee that the global minimum of the topological deviation is obtained, the constructed trees corresponding to local minima approximate the maximum topological similarity tree reasonably well.

In cluster algorithm the notion of distance between groups of sequences is used for the setting of the branching order. This distance is defined as the arithmetic mean of pairwise distances between elements of the two groups.

Contary to topological algorithm, in the cluster one the order of node connections is reconstituted together with the corresponding branch lengths. The root is also determined in the natural way as a point on one of the branches such that the distances from it to all hanging nodes (corresponding to sequences) are equal. This property of cluster trees allows to introduce the distance from the root to each node and to draw the tree using this distance as a node abscissa.

The query alignment should be written in one-letter code (low or upper case) and can be divided to several strings. It have to be blank string between sequental batches and at the begining of every batch all sequence names should be repeated:

KAG2_CAVPO    -----------------------------CGGVLVDPQWVLTAAHCIND--S---N
KAG_PIG       -----------------------------CGGVLVNPKWVLTAAHCKND--N---Y
PLMN_PIG      parvvggcvsiphswpwqislryryrgHFCGGTLISPEWVLTAKHC----------
TRYP_PIG      -----------------------nsgsHFCGGSLINSQWVVSAAHCYKS--R---I
UROK_PIG      -----------------------------CGGSLISPCWVVSATHCFINYQQKEDY

KAG2_CAVPO    QVKLGRHNLFEDEDTA-----QHFLVSQSVPHPDFN
KAG_PIG       EVGWLRHNLFENENTA-----QFFGVTADFPHPGFN
PLMN_PIG      ------------lekssspssykvilgaheeyhlge
TRYP_PIG      QVRLGEHNIDVLEGNE-----QFINAAKIITHPNFN
UROK_PIG      IVYLGRQTLHSSTHGEMKFEVEKLILHEDYSADSLA

Phylogenetic trees can be presented in one or several graphical forms (picture types):

- slanted cladogram, two versions;
- rectangular cladogram, two versions;
- phylogram, that is a rectangular cladogram with branches scaled by their length (weight);
- unrooted, two versions (unscaled and scaled braches).
All images have same size. Width and height of the images may be set from 320 to 2000 and 240 to 1500 pixels, respectively. Defaults are 640 and 480.

Unrooted tree with scaled branches (Unrooted 2) has special option: max/min factor. Scaled unrooted tree looks not so good when branches (edges) have very different lengths. This option restrains the difference so that very short branches are plotted with length only at factor times less than maximum plotted. Such branches are dispayed in orange color. Also very long branches (three at most) are plotted with shorter, partly dashed, green lines.
Max/min factor can be set different for cluster and topological algorithms.

In the bootstrap a multiple alignment is resampled 100 times. That is 100 trees are generated. Bootstrap values are expressed in percentages and placed at nodes for all grahpical forms except unrooted trees.