Genebee BLAST 2.2.8 Services Help
BLAST Frequently Asked Questions (FAQ)
Q: Causes for "No significant similarity found"
Below are several reasons that a BLAST search can result in the "No significant similarity found"
message. Note: You may need to use more than one of these options at the same time (example: increase
the Expect value AND turn off filtering).
Short Sequences: Depending on sequence composition, a short sequence is a sequence under 20
residues
- Try increasing the Expect value.
- You may also need to decrease the Word Size from the default
(11 for nucleotides or 3 for proteins). You can decrease the word size using the -W option in the
Other Advanced Options Box. Example: -W 2
You should also consult the "How do I perform a similarity search with a short
peptide/nucleotide sequence?" section below.
Filtering: BLAST filters regions of low-complexity (for a description of low-complexity see
What is low-complexity sequence? below). If you sequence contains large regions of
"low complexity" it may not significant hits to the database. You can turn off filtering by checking
off all flags in the "Filter" options section.
Query Format: Another reason you may see the "No Significant Similarity found" message is using
the wrong type of sequence in your search.
- FASTA. Check that you have the Input Data set to the correct format for your Query. For more
information on FASTA format, click here.
- Sequence type and Program combination. You can search with an amino acid query sequence using
the blastp and tblastn programs. With nucleotide query sequences you can use blastn,
blastx, and tblastx.
For more information on the BLAST programs, click here.
Q: Why does my search timeout on the BLAST servers?
Certain combinations of BLAST searches with large sequences against large databases can cause the
BLAST servers to timeout. This has to do with a limit on the server CPU's which prevents sequences
which generate many HSPs from hoarding server resources.
However there are some things you can do to prevent timeout and generate results from large sequences.
- Increase the Word Size to 20 - 25. With a default Word Size of 7, the BLAST algorithm finds initial
HSPs of 7 bases in length and begins extension of these from either end. In a large sequence this can
generate 100's of initial HSPs between the query sequence and even a single large genomic sequence in
the databases. Increasing the Word Size to 25 makes the initial HSP smaller, limiting the number small
initial fragments to be extended.
- Decrease the Expect value to 1.0 or lower. Many hits from large sequences are to many small
fragments in the database. The expect value for these searches is such that decreasing the expect
value will eliminate these results, and concentrate on results which are more likely to contain large
coding regions and genomic fragments.
If you are still seeing a "timeout" error message after making the above changes, please contact
nik@genebee.msu.su with the description of your search.
Q: Why do I get the message "ERROR: BLASTSetUpSearch: Unable to calculate Karlin-Altschul params,
check query sequence"?
This will happen if your entire query sequence has been masked by low complexity filtering. You will
need to turn filtering off to get hits. For further information on filtering, please read the sections
of the BLAST FAQs on Q: What is low-complexity sequence? and also Q:
After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put
there?
Q: Why do I get the message "ERROR: Blast: No valid letters to be indexed"?
You can see this error message if too many ambiguity codes (R,Y,K,W,N, etc. for nucleotides) are
present in your query sequence. Although BLAST allows ambiguity codes, be aware that these will always
contribute a negative score in nucleic acid searches. Thus, sequences such as degenerate PCR primers
with ambiguity codes may not find any significant hits even though they may be designed from sequences
that are present in the database.
Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did
not put there?
You are seeing the result of automatic filtering of your query for low-complexity sequence that is
performed to prevent artifactual hits. The filter substitutes any low-complexity sequence that it
finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein
sequences (e.g., "XXXXXXXXX"). Low-complexity regions can result in high scores that reflect
compositional bias rather than significant position-by-position alignment
(Wootton
& Federhen, 1996). Filter programs can eliminate these potentially confounding matches from
the blast reports, leaving regions whose BLAST statistics reflect the specificity of their parities
alignment. Queries searched with the blastn program are filtered with DUST. The other BLAST programs
use SEG.
You can change the default and remove these filters if you like. On the BLAST Web interface you will
see a checkbox to click that will remove the filter.
You can search with short query sequences using BLAST after changing a few parameters (see "
Q: How do I perform a similarity search with a short peptide/nucleotide sequence?" above). You may
also be interested in checking out other molecular biology web sites, such as those mentioned in the
Other Resources section at the end of this FAQ, for motif searching software.
Q: How do I perform a similarity search with a short peptide/nucleotide sequence?
First, you will probably need to increase the Expect (E) value in your search. A short query is more
likely to occur by chance in the database. Therefore, even a perfect match can have low statistical
significance and may not be reported. Increasing the E value allows you to look farther down in the
hit list and see matches that would normally be discarded because of low statistical significance.
For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E
value farther by typing -e 10000, for example, in the Other Advanced
Options Box.
If you still do not get results after increasing the E value, you may want to try decreasing the Word
size (W), another parameter that becomes important with a short query. The BLAST algorithm uses "words"
to nucleate regions of similarity. The default Word size for a protein sequence is 3 residues and for
nucleotide sequences it is 11 bp. A blastn search will not work with a Word size of less than 7. A
good rule of thumb is that the query length must be at least twice the Word size. For example, if your
query is a protein sequence of 4 residues, than the Word size should be reduced to 2. Please note that
the smaller the Word size, the slower your search will be.
You can lower the default word size in the Other Advanced Options,
type -W some_number (for example, -W 9).
Sometimes a short query does not produce results because it contains low-complexity sequence. Often
this type of sequence can be recognized by the human eye because it looks very redundant, for example
the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter for low complexity
sequence is applied by default to BLAST nucleotide and protein searches. If your query has regions of
low-complexity sequence, then large portions of your query may be filtered out, essentially making your
query shorter than you might have expected. Removing the filter will help in these cases.
Finally, you can change the matrix to optimize for searching with short protein sequences. For information
on query length and the matrix see
this document.
You can use the BLAST 2.2.8 Sequences service
to compare two nucleotide or two protein sequences against each other using the BLAST 2.2.8 algorithm.
The BLAST 2.2.8 algorithm performs a Gapped BLAST search between the two sequences allowing for the
introduction of gaps (deletions and insertions) in the resulting alignment. At this time only the
blastn and blastp programs are available. Using sequences greater than 150 Kb is not recommended.
To compare one sequence against a specific sequence or set of sequences, you can also use a separate
multiple sequence alignment program. There are many such software tools available to do this. NCBI has
developed a tool, MACAW, which will do multiple sequence alignments on PC or Mac platforms. The
latest version of MACAW is available on the NCBI
anonymous FTP site (ftp://ncbi.nlm.nih.gov) under /pub/macaw/. The instructions are included with the
program. You may also be interested in checking out other molecular biology web sites, such as those
mentioned in the Other Resources section at the end of this FAQ.
Regions with low-complexity sequence have an unusual composition and this can create problems in
sequence similarity searching (
Wootton & Federhen, 1996). Low-complexity sequence can often be recognized by visual inspection.
For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide
sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause
artifactual hits (please also see Q: After running a search why do I see a string of "X"s
(or "N"s) in my query sequence that I did not put there?)
In BLAST searches performed without a filter, often certain hits will be reported with high scores only
because of the presence of a low-complexity region. Most often, this type of match cannot be thought of
as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is
"sticky" and is pulling out many sequences that are not truly related.
The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by
chance when searching a database of a particular size. It decreases exponentially with the Score (S)
that is assigned to a match between two sequences. Essentially, the E value describes the random
background noise that exists for matches between sequences.
The Expect value is used as a convenient way to create a significance threshold for reporting results.
When the Expect value is increased from the default value of 10, a larger list with more low-scoring
hits can be reported.
In BLAST 2.2.8, the Expect value is also used instead of the P value (probability) to report the significance
of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a
database of the current size one might expect to see 1 match with a similar score simply by chance.
You have many choices to make between different BLAST programs and how to access them. Please see the
Overview for more information on this topic. The easiest way to search
is to use the BLAST Web pages. There are many additional parameters that can be controlled, but for a
basic search, the default options work well.
Q: How can I see low-similarity matches when there are many strong hits to my query sequence?
Often, when the query is a member of a large sequence family, the summary hit list and the alignments
returned only contain very high scoring hits. To look at low-similarity matches, you must increase the
maximum number of results returned. On the BLAST Web pages, often it is sufficient to increase the
size of the summary hit list and the number of alignments shown using the menus on the Blast Web pages.
However, it is possible to increase the lists even further using the
Other Advanced Options box on the Web BLAST pages. For BLAST 2.2.8, "-v 2000", for example, will
increase the number of descriptions returned in the summary hit list to 2000. The option "-b 2000"
will similarly increase the number of alignments returned.
Q: How do I perform a similarity search with a short peptide/nucleotide sequence?
First, you will probably need to increase the Expect (E) value in your search. A short query is more
likely to occur by chance in the database. Therefore, even a perfect match can have low statistical
significance and may not be reported. Increasing the E value allows you to look farther down in the
hit list and see matches that would normally be discarded because of low statistical significance.
For most searches, an Expect value up to 1000 is enough to see results. However, you can raise the E
value farther on the BLAST Web page by typing -e 10000, for example, in the
Other Advanced Options Box.
If you still do not get results after increasing the E value, you may want to try decreasing the Word
size (W), another parameter that becomes important with a short query. The BLAST algorithm uses "words"
to nucleate regions of similarity. The default Word size for a protein sequence is 3 residues and for
nucleotide sequences it is 11 bp. A blastn search will not work with a Word size of less than 7. A good
rule of thumb is that the query length must be at least twice the Word size. For example, if your query
is a protein sequence of 4 residues, than the Word size should be reduced to 2. Please note that the
smaller the Word size, the slower your search will be.
You can lower the default word size on the BLAST Web page. In the Other
Advanced Options, type -W some_number (for example, -W 9).
Sometimes a short query does not produce results because it contains low-complexity sequence. Often
this type of sequence can be recognized by the human eye because it looks very redundant, for example
the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. A filter for low complexity
sequence is applied by default to BLAST nucleotide and protein searches. If your query has regions of
low-complexity sequence, then large portions of your query may be filtered out, essentially making your
query shorter than you might have expected. Removing the filter will help in these cases.
Finally, you can change the matrix to optimize for searching with short protein sequences. For information
on query length and the matrix see this
document.
The option to limit a search to organism and even taxonomic classification is now available in BLAST 2.2.8.
There is a editor field to input the species name, or classification (example: "eubacteria").
Q: How can I search a batch of sequences with BLAST?
There are basically two ways to run Batch BLAST searches.
- Install the BLAST 2.0 Server Locally:
You can install the Standalone BLAST 2.0 server on your own machine if you have a Windows or a UNIX
platform. Installing this executable allows you to search local databases as well as public ones that
you have downloaded and installed. Standalone BLAST 2.2.8 also allows you to do gapped BLAST and
PSI-BLAST searches. The Standalone BLAST 2.2.8 server and its new capabilities are described in
Altschul et al., 1997. There is information about Standalone BLAST in the "Overview" available from
the sidebar of the main BLAST page and also
here.
There is also some information on setting up the programs at the NGHRI site at:
http://genome.nhgri.nih.gov/blastall/blast_install/
The Standalone executables are available at the anonymous FTP location.
- Install the BLAST 2.0 Network client software locally:
The BLAST 2.0 Network client will allow you to submit a file of FASTA sequences over an internet
connection to the NCBI BLAST databases. The BLAST Network client executables are located
here. There are executables for Mac, PC,
and various UNIX platforms.
Other Resources:
The on-line BLAST Course was
written by Dr. Stephen Altschul and discusses the basics of the Gapped BLAST algorithm. In addition
the full text of the 1997 Nucleic
Acids Research paper "Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs" is also available on-line.
Chapter 7 of Cold Spring Harbor Genome Analysis Laboratory Manual also provides helpful introductory
information for users of molecular biology databases and software. This chapter is available over the
WWW or from the Cold Spring Harbor Laboratory
WWW home page under CSHL Press.
There are many sites which offer software tools for molecular biologists and for manipulating sequence
data. Some of the larger of these are listed below: