ClustalW 1.83 Help

Reference

Thompson J.D., Higgins D.G., Gibson T.J.;
"CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice.";
Nucleic Acids Res. 22:4673-4680(1994).

Your Sequences

The sequences (cut & paste) must all be in ONE of the following formats:

FASTA (Pearson), NBRF/PIR, EMBL/Swiss Prot, GDE, CLUSTAL, GCG/MSF, GCG9/RSF.

The program tries to "guess" which format is being used and whether the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The format is recognised by the first characters in the file. This is kind of stupid/crude but works most of the time and it is difficult to do reliably, any other way.

Format           First non blank word or character in the file.
...............................................................
FASTA            >
NBRF             >P1;  or >D1;
EMBL/SWISS       ID
GDE protein      %
GDE nucleotide   #
CLUSTAL          CLUSTAL (blocked multiple alignments)
GCG/MSF          PILEUP  or !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
                 or MSF on the first line, and '..' at the end of line
GCG9/RSF         !!RICH_SEQUENCE

Example in FASTA format:

>FOSB_HUMAN P53539 homo sapiens (human). fosb protein
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS
GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY
TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL
>FOSB_MOUSE P13346 mus musculus (mouse). fosb protein.
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

Note, that the only way of spotting that a file is MSF format is if the word PILEUP appears at the very beginning of the file. If you produce this format from software other than the GCG pileup program, then you will have to insert the word PILEUP at the start of the file. Similarly, if you use clustal format, the word CLUSTAL must appear first.

All of these formats can be used to read in AN EXISTING FULL ALIGNMENT. With CLUSTAL format, this is just the same as the output format of this program and Clustal V. If you use PILEUP or CLUSTAL format, all sequences must be the same length, INCLUDING GAPS ("-" in clustal format; "." in MSF). With the other formats, sequences can be gapped with "-" characters. If you read in any gaps these are kept during any later alignments. You can use this facility to read in an alignment in order to calculate a phylogenetic tree OR to output the same alignment in a different format (from the output format options menu of the multiple alignment menu) e.g. read in a GCG/MSF format alignment and output a PHYLIP format alignment. This is also useful to read in one reference alignment and to add one or more new sequences to it using the "profile alignment" facilities.

DNA vs. PROTEIN: The program will count the number of A,C,G,T,U and N characters. If 85% or more of the characters in a sequence are as above, then DNA / RNA is assumed, protein otherwise.

Alignment title

Type in a title for this alignment session for you to remember.

Your E-Mail

A valid internet email address in the form somebody@somewhere.domain.country. You must type your email address in this text box. You don't have to fill in this box if you want to run your search interactively.

Alignment

You may choose to run a full alignment or using a stringent algorithm for generating the tree guide or a fast algorithm.

By default, the initial pairwise alignments are now carried out using a full dynamic programming algorithm. This is more accurate than the older hash/ k-tuple based alignments (Wilbur and Lipman) but is MUCH slower. On a fast workstation you may not notice but on a slow box, the difference is extreme.

Show dendrogram

To do a complete multiple alignment, we need to know the approximate relationships of the sequences to each other (which ones are most similar to each other). We do this by calculating a crude phylogenetic tree which we call a dendrogram (to distinguish it from the more sensitive trees available under the phylogenetic tree option). This dendrogram is used as a guide to align bigger and bigger groups of sequences during the multiple alignment. The dendrogram is calculated in 2 stages: 1) all pairs of sequence are compared using the fast/approximate method of Wilbur and Lipman (1983); the result of each comparison is a similarity score. 2) the similarity scores are used to construct the dendrogram using the UPGMA cluster analysis method of Sneath and Sokal (1973).

The construction of the dendrogram can be very time consuming if you wish to align many sequences (e.g. for 100 sequences you need to carry out 100x99/2 sequence comparisons = 4950). During every multiple alignment, a dendrogram is constructed and saved to a file (something.dnd).

This option let you choose to show denrogam in results or not.

Output Format

Here you decide which output format you want your multiple sequence alignment in. The options are CLUSTAL, GCG, PHYLIP, PIR and GDE.

CLUSTAL FORMAT: This is a self explanatory alignment. The alignment is written out in blocks. Identities are highlighted and (if you use a PAM 250 matrix) positions in the alignment where all of the residues are "similar" to each other (PAM 250 score of 8 or more) are indicated.

GCG FORMAT: In version 7 of the Wisconsin GCG package, a new multiple sequence format was introduced. This is the MSF (Multiple Sequence Format) format. It can be used as input to the GCG sequence editor or any of the GCG programs that make use of multiple alignments.
This format is only supported in version 7 of the GCG package or later.

PHYLIP FORMAT: This format can be used by the Phylip package of Joe Felsenstein. Phylip allows you to do a huge range of phylogenetic analyses (we just offer one method in this program) and is probably the most widely used set of programs for drawing trees. It also works on just about every computer you can think of, providing you have a decent Pascal compiler.

NBRF/PIR FORMAT: This is the usual NBRF/PIR format with gaps indicated by hyphens ("-"). This format is exactly compatible with the sequence input format. Therefore you can read in these alignments again for profile alignments or for calculating phylogenetic trees.

GDE FORMAT: This format is used by Steven Smith's GDE package.

Output Order

This option is used to control the order of the sequences in the output alignments. By default, the order corresponds to the order in which the sequences were aligned (from the guide tree/dendrogram), thus automatically grouping closely related sequences. This switch can be used to set the order to the same as the input file.

The consensus line:

"*" = identical or conserved residues in all sequences in the alignment
":" = indicates conserved subsitutions

"." = indicates semi-conserved substitutions.

Pairwise alignment options:

A distance is calculated between every pair of sequences and these are used to construct the dendrogram which guides the final multiple alignment. The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate).

Fast alignment options:

These similarity scores are calculated from fast, approximate, global align- ments, which are controlled by 4 parameters. 2 techniques are used to make these alignments very fast: 1) only exactly matching fragments (k-tuples) are considered; 2) only the 'best' diagonals (the ones with most k-tuple matches) are used.

K-Tuple (Word size)

This option allows you to choose which 'word-length' to use when calulating fast pairwise alignments. Can be 1 or 2 for proteins and 1 to 4 for DNA. Increase this to increase speed; decrease to improve sensitivity.

Window length

This is a number of diagonals around each "top" diagonal that are considered. Decrease for speed; increase for greater sensitivity. The allowed range is 1 to 50.

TOPDIAG

This is the number of best diagonals in the imaginary dot-matrix plot that are considered. Decrease (must be greater than zero) to increase speed; increase to improve sensitivity. The allowed range is 1 to 50.

PAIRGAP

Here you can set the gap penalty. This is the number of matching residues that must be found in order to introduce a gap. This should be larger than K-Tuple size. This has little effect on speed or sensitivity. The allowed range is 1 to 500.

Score type

The similarity scores may be expressed as raw scores (number of identical residues minus a "gap penalty" for each gap) or as percentage scores. If the sequences are of very different lengths, percentage scores make more sense.

Slow alignment options:

These parameters do not have any affect on the speed of the alignments. They are used to give initial alignments which are then rescored to give percent identity scores. These % scores are the ones which are displayed on the screen. The scores are converted to distances for the trees.

Protein weight matrix

Here you can select the scoring table which describes the similarity of each amino acid to each other.

DNA weight matrix

Here you can select the matrix with the scores assigned to matches and mismatches (including IUB ambiguity codes).

Gap open

Here you can set the penalty for opening a gap in the alignment. The allowed range is 0.0 to 100.0

Gap extension

Here you can set the penalty for extending a gap by 1 residue. The allowed range is 0.0 to 10.0

Multiple sequence alignment options:

These parameters control the final multiple alignment. This is the core of the program and the details are complicated. To fully understand the use of the parameters and the scoring system, you will have to refer to the documentation.

Type

It is critically important for the program to know whether or not it is aligning DNA or protein sequences. The input routines attempt to guess which type of sequence is being used by counting the number of A,C,G,T or U's in the sequences. If the total is more than 85% of the sequence length then DNA is assumed. If you use very bizarre sequences (proteins with really strange a compositions or DNA sequences with loads of strange ambiguity codes) you might confuse the program. Here you can define the sequence type.

Protein weight matrix

This option allows you to choose which matrix series to use when generating the mulitple sequence alignment. The program goes through the choosen matrix series, spanning the full range of amino acid distances.

BLOSUM (Henikoff): These matrices appear to be the best available for carrying out data base similarity (homology searches). The matrices used are: Blosum80, 62, 40 and 30.

PAM (Dayhoff): These have been extremely widely used since the late '70s.We use the PAM 120, 160, 250 and 350 matrices.

GONNET: These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger data set. They appear to be more sensitive than the Dayhoff series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrices.

ID:We also supply an identity matrix which gives a score of 10 to two identical amino acids and a score of zero otherwise.

DNA weight matrix

For DNA, a single matrix (not a series) is used. Two hard-coded matrices are available:

IUB: This is the default scoring matrix used by BESTFIT for the comparison of nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.

CLUSTALW (1.6): The previous system used by ClustalW, in which matches score 1.0 and mismatches score 0. All matches for IUB symbols also score 0.

Gap open

This option control the cost of opening up every new gap. Increasing the gap opening penalty will make gaps less frequent. The allowed range is 0.0 to 100.0

Gap extension

This option control the cost of every item in a gap. Increasing the gap extension penalty will make gaps shorter. The allowed range is 0.0 to 10.0

Phylogenetic tree

The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First you calculate distances (percent divergence) between all pairs of sequence from a multiple alignment; second you apply the NJ method to the distance matrix.

Tree type

Three output formats are offered: Clustal, Phylip and Just the distances.

Each of these formats can be presented graphically.

Clustal format: This format is verbose and lists all of the distances between the sequences and the number of alignment positions used for each. The tree is described at the end of the file. It lists the sequences that are joined at each alignment step and the branch lengths. After two sequences are joined, it is referred to later as a NODE. The number of a NODE is the number of the lowest sequence in that NODE.

Phylip format: This format is the New Hampshire format, used by many phylogenetic analysis packages. It consists of a series of nested parentheses, describing the branching order, with the sequence names and branch lengths. It can be used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the trees graphically. This is the same format used during multiple alignment for the guide trees. Some other packages that can read and display New Hampshire format are TreeTool, TreeView, Phylowin and NJPlot.

The distances only: This format just outputs a matrix of all the pairwise distances in a format that can be used by the Phylip package. It used to be useful when one could not produce distances from protein sequences in the Phylip package but is now redundant (Protdist of Phylip 3.5 now does this).

Kimura's correction of distances

For small divergence (less than 10%) this option makes no difference. For greater divergence, this option corrects for the fact that observed distances underestimate actual evolutionary distances. This is because, as sequences diverge, more than one substitution will happen at many sites. However, you only see one difference when you look at the present day sequences. Therefore, this option has the effect of stretching branch lengths in trees (especially long branches). The corrections used here (for DNA or proteins) are both due to Motoo Kimura. See the documentation for details.

Ignore gaps in alignment

With this option, any alignment positions where ANY of the sequences have a gap will be ignored. This means that 'like' will be compared to 'like' in all distances. It also, automatically throws away the most ambiguous parts of the alignment, which are concentrated around gaps (usually). The disadvantage is that you may throw away much of the data if there are many gaps.

Picture formats

Phylogenetic trees can be presented in one or several graphical forms (picture types):
- slanted cladogram, two versions;
- rectangular cladogram, two versions;
- phylogram, that is a rectangular cladogram with branches scaled by their length (weight);
- unrooted, two versions (unscaled and scaled braches).
All images have same size. Width and height of the images may be set from 320 to 2000 and 240 to 1500 pixels, respectively. Defaults are 640 and 480.

Unrooted tree with scaled branches (Unrooted 2) has special option: max/min factor. Scaled unrooted tree looks not so good when branches (edges) have very different lengths. This option restrains the difference so that very short branches are plotted with length only at factor times less than maximum plotted. Such branches are dispayed in orange color. Also very long branches (three at most) are plotted with shorter, partly dashed, green lines.

Other advanced options

General settings:
-case=       LOWER or UPPER (for GDE output only)
-seqnos=     OFF or ON (for Clustal output only)
-negative    protein alignment with negative values in matrix

Multiple Alignments:
-endgaps        no end gap separation pen.
-gapdist=n      gap separation pen. range
-nopgap         residue-specific gaps off
-nohgap         hydrophilic gaps off
-hgapresidues=  list hydrophilic res.
-maxdiv=n       % ident. for delay
-transweight=f  transitions weighting

Structure Alignments:
-secstrout=     STRUCTURE or MASK or BOTH or NONE  output in alignment file
-helixcap=n     gap penalty for helix core residues
-strandgap=n    gap penalty for strand core residues
-loopgap=n      gap penalty for loop regions
-terminalgap=n  gap penalty for structure termini
-helixendin=n   number of residues inside helix to be treated as terminal
-helixendout=n  number of residues outside helix to be treated as terminal
-strandendin=n  number of residues inside strand to be treated as terminal
-strandendout=n number of residues outside strand to be treated as terminal
Last updated: July 29, 1999.