Genebee Predicted Annotation Help

Reference

1. A.M.Leontovich, L.I.Brodsky, V.A.Drachev, and V.K.Nikolaev, Adaptive algorithm of automated annotation, Bioinformatics 2002 18: 838-844.

2. Leontovich A.M.,Brodsky L.I., Gorbalenya A.E., Construction of the full local similarity map for two biopolymers, 1993, Biosystems, 30,57-63.[Word97 doc]

3. Brodsky L.I., Vasiliev A.V., Kalaidzidis Ya.L., Osipov Yu.S., Tatuzov R.L., Feranchuk S.I. GeneBee: the program package for biopolymer structure analysis, 1992, Dimacs, 8, 127-139. [Word97 doc]

4. Andrey M Leontovich, Konstantin Y Tokmachev, and Hans C van Houwelingen, The comparative analysis of statistics, based on the likelihood ratio criterion, in the automated annotation problem, BMC Bioinformatics, 2008 Jan 22;9:31. [PDF doc]

Predicting Annotation:

An automatic annotation of a sequence is based on statistics assembled from the result of the homology search, that is for a prediction description elements (DEL) of the given sequence. The theoretical approach and the algorithms fixed in a basis of an automatic sequence annotation, are detailed stated in the report of A.M.Leontovich [?].

Your Sequence

The sequence (cut & paste) must be in FASTA format.

Example in FASTA format:

>FOSB_HUMAN P53539 homo sapiens (human). fosb protein
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS
GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY
TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL

DNA vs. PROTEIN: The program will count the number of A,C,G,T,U and N characters. If 80% or more of the characters in a sequence are as above, then DNA / RNA is assumed, protein otherwise.

General options:

Annotation title

Type in a title for this session for you to remember.

Start/End position of the sequence

Set the borders of the query sequence zone (Start & End), for which the searching for homologies search will be done.

Additional output

In addition to the text form of prediction there are the following output:
Graphical alignment - graphical inmage of the found sequences with the most
                      frequent KW, DE, FT keywords;

FT fragments & DR keywords - graphical image of the found sequences with FT
                             fragments and DR keywords;

Cross-reference map - cross-reference map of the found sequences;

Alignments description - brief description of the found sequences and 
                         the first 10 supermotifs;

Statistics - statictics for the most frequent keywords.

DotHelix options:

Motif's length threshold

Enter a minimum length of motifs (7 is recommended).

Motif's power threshold

Enter a minimum power of motifs (3 - 5 is recommended).

Accurate DotHelix

Dothelix procedure is not much less effective in comparison with "window method": as a rule it demands N * ln(N) operation istead of N for window method, but in sophisticated cases the number could be equal N * N. To eliminate such possibilities there is procedure parameter "Accurate DotHelix" that decreases the number of operations (case "Off").

Supermotif's options:

Alignment's power threshold

Enter a minimum power of selected local alignemnts (6 - 9 is recommended).

Gap penalty

Enter a gap penalty in the units of standard deviation (1 is recommended).

Best shiefts

Enter a number of best shifts (5-10 is recommended).

Reported alignments

Restricts the number of matching sequences reported to the number specified. Default limit is 100 sequences.

FT fragment picture options:

Max number of FT keywords

Restricts the number of FT keys to be displayed. First the most frequent keys are displayed. The less this value the better color distiction. The maximum value is 20.

Excluded fragments

There is a possibility to exclude some fragments to clarify the picture of other ones:
- Secondary structure: HELIX, TURN, STRAND;
- CHAIN;
- DOMAIN.

DOMAIN extension to qualifier

The qualifier is added to the DOMAIN key to specify the key in more details.

Other options:

Coincidance ratio

The desired percent of coinciding (from the maximum in the case of complete similarity) "one-color" pair-patterns on the selected shift (in 1/100 of 1) (0.02 for protein and 0.08 for nucleotide query sequence).

Min. homology ratio

Min. homology ratio may be set in the range of 0 to 1, and the value 0.01 is recommended.

Motif frequences recalc

Power of motif highly depends on frequences of letters in comparing sequences. If frequences of letters in selected stretches of matching are significantly deviate from values in begining of culculations (for example the stretch is polyA), then it's necessary to recalculate the power with new frequences and this will decrease power of such unsignificant motif as, for example, the match of polyA stretch in query sequence with polyA stretch in databank sequence.

So, "On" is recommended.

Strand

This option sets which frames will be processed. If 'Only forward' is choosen then three forward frames will be processed in the case protein against nucleotide databanks and single forward strand will be processed in the case nucleotide against nucleotide databanks. If 'Both' is choosen all six frames will be processed in the case protein against nucleotide databanks and both forward and backward strands will be processed in the case nucleotide against nucleotide databanks.

Clusterization type

This option has sense only in the case nucleotide against protein databanks OR protein against nucleotide databanks. If 'Each frame separately' is chosen then found motifs will be clustered (into supermotifs) separatly for each frame. If 'Codirectional joinly' is choosen then motifs found on codirectional frames (forward and backward) will be clustered joinly. So it is possible obtain supermotif containing motifs from 2 or 3 forward (or backward) frames. The reason of this option - propable errors in query or databanks seuences.

Weight Matrices:

There are 3 matrices inplemented in GeneBee. You may choose any of them - Dayhoff, Blosum62, or Johnson - at the prompt in the full query page. The default matrix is Dayhoff.

Dayhoff Matrix

(modified 250 PAM matrix from Atlas of Protein sequence and structure,v.5, suppl. 3, pp.345-358):
     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A   12
C    8 22
D   10  5 14
E   10  5 13 14
F    6  6  4  5 19
G   11  7 11 10  5 15
H    9  7 11 11  8  8 16
I    9  8  8  8 11  7  8 15
K    9  5 10 10  5  8 10  8 15
L    8  4  6  7 12  6  8 12  7 16
M    9  5  7  8 10  7  8 12 10 14 16
N   10  6 12 11  6 10 12  8 11  7  8 12
P   11  7  9  9  5  9 10  8  9  7  8  9 16
Q   10  5 12 12  5  9 13  8 11  8  9 11 10 14
R    8  6  9  9  6  7 12  8 13  7 10 10 10 11 16
S   11 10 10 10  7 11  9  9 10  7  8 11 11  9 10 12
T   11  8 10 10  7 10  9 10 10  8  9 10 10  9  9 11 13
V   10  8  8  8  9  9  8 14  8 12 12  8  9  8  8  9 10 14
W    4  2  3  3 10  3  7  5  7  8  6  6  4  5 12  8  5  4 27
Y    7 10  6  6 17  5 10  9  6  9  8  8  5  6  6  7  7  8 10 20

Blosum62 Matrix

Unique Identifier: 93066354 (MEDLINE)
Authors: Henikoff S. Henikoff J. G.
Institution: Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, WA 98104.
Title: Amino acid substitution matrices from protein blocks.
Source: Proceedings of the National Academy of Sciences of the United States of America. 89(22):10915-9, 1992 Nov 15.
Abstract:
Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.

     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A    8
C    4 13
D    2  1 10
E    3  0  6  9
F    2  2  1  1 10
G    4  1  3  2  1 10
H    2  1  3  4  3  2 12
I    3  3  1  1  4  0  1  8
K    3  1  3  5  1  2  3  1  9
L    3  3  0  1  4  0  1  6  2  8
M    3  3  1  2  4  1  2  5  3  6  9
N    2  1  5  4  1  4  5  1  4  1  2 10
P    3  1  3  3  0  2  2  1  3  1  2  2 11
Q    3  1  4  6  1  2  4  1  5  2  4  4  3  9
R    3  1  2  4  1  2  4  1  6  2  3  4  2  5  9
S    5  3  4  4  2  4  3  2  4  2  3  5  3  4  3  8
T    4  3  3  3  2  2  2  3  3  3  3  4  3  3  3  5  9
V    4  3  1  2  3  1  1  7  2  5  5  1  2  2  1  2  4  8
W    1  2  0  1  5  2  2  1  1  2  3  0  0  2  1  1  2  1 15
Y    2  2  1  2  7  1  6  3  2  3  3  2  1  3  2  2  2  3  6 11

Johnson Matrix

Unique Identifier: 94016587 (MEDLINE)
Authors: Johnson M. S. Overington J. P.
Institution: Department of Crystallography, Birkbeck College, University of London, U.K.
Title: A structural basis for sequence comparisons. An evaluation of scoring methodologies. Source: Journal of Molecular Biology. 233(4):716-38, 1993 Oct 20.
Abstract:
A residue-exchange matrix has been derived that is suitable for comparison of amino acid sequences. This matrix is based on the tabulation of 207,795 amino acid replacements observed in 65 homologous sets of structurally aligned three-dimensional structures (235 proteins). The majority of the data is from structural comparisons where there is between 15 and 40% sequence identity. As a result, a scoring matrix such as the one devised here should provide a sensitive basis for the comparison of amino acid sequences and the search for homologous sequences in amino acid databases. In order to assess the value of this matrix we have made a comparative analysis with 12 other published scoring matrices that have been used for the alignment of protein amino acid sequences. We find that the matrix derived here is among the better performers in terms of alignment significance, detection of homologous sequences and the accuracy of alignments.
    A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A  16
C   6 26
D   8  0 18
E   9  3 12 18
F   7  5  3  3 20
G   9  2  8  7  1 18
H   7  2  9  7  8  7 22
I   8  2  5  5 10  4  5 18
K   9  1  8 11  4  6 10  5 17
L   6  1  2  4 12  3  5 12  6 17
M   8  5  4  7  9  5  7 12  8 14 21
N   8  2 12  9  6  8 11  5 10  5  6 18
P   9  1  9  8  5  7  5  4  9  7  0  7 20
Q   9  3  9 12  3  7 11  3 11  5  9  9  6 19
R   8  4  6 10  4  7 10  4 13  6  6  8  6 12 20
S  10  2 10  8  5  8  7  5  8  5  5 11  9  9  9 16
T   9  4  8  9  5  6  7  7 10  5  7 10  8  9  8 12 17
V   9  5  5  6  8  4  6 14  6 12 10  4  5  6  5  5  8 17
W   4  1  4  2 13  3  6  6  4  9  9  4  2  2  6  4  0  5 25
Y   6  2  6  6 13  4  9  7  6  7  8  7  3  5  8  6  7  8 12 20

Unitary Matrix for DNA/RNA:

     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A   10
C    0 10
D    0  0 10
E    0  0  0 10
F    0  0  0  0 10
G    0  0  0  0  0 10
H    0  0  0  0  0  0 10
I    0  0  0  0  0  0  0 10
K    0  0  0  0  0  0  0  0 10
L    0  0  0  0  0  0  0  0  0 10
M    0  0  0  0  0  0  0  0  0  0 10
N    0  0  0  0  0  0  0  0  0  0  0 10
P    0  0  0  0  0  0  0  0  0  0  0  0 10
Q    0  0  0  0  0  0  0  0  0  0  0  0  0 10
R    0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
S    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
T    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
V    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
W    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
Y    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10

Last updated: August 20, 2001.