Bioinformatics Tools Training Program

Teacher

LAB

Bioinformatics Tools Training Program

BLAST

BLAST identifies homologous sequences using a heuristic method which initially finds short matches between two sequences; thus, the method does not take the entire sequence space into account. After initial match, BLAST attempts to start local alignments from these initial matches. This also means that BLAST does not guarantee the optimal alignment, thus some sequence hits may be missed. In order to find optimal alignments, the Smith-Waterman algorithm should be used (see below). In the following, the BLAST algorithm is described in more detail-

Seeding – During the initial BLAST seeding, the algorithm finds all common words between the query sequence and the hit sequence(s). Only regions with a word hit will be used to build on an alignment.

E-value

The expect value (E-value) can be changed in order to limit the number of hits to the most significant ones. The lower the E-value, the better the hit. The E-value is dependent on the length of the query sequence and the size of the database. For example, an alignment obtaining an E-value of 0.05 means that there is a 5 in 100 chance of occurring by chance alone.

E-values are very dependent on the query sequence length and the database size. Short identical sequence may have a high E-value and may be regarded as “false positive” hits. This is often seen if one searches for short primer regions, small domain regions etc. The default threshold for the E-value on the BLAST web page is 10.

Increasing this value will most likely generate more hits. Below are some rules of thumb which can be used as a guide but should be considered with common sense.

•E-value < 10e-100Identical sequences. You will get long alignments across the entire query and hit sequence.•10e-50 < E-value < 10e-100Almost identical sequences. A long stretch of the query proteins matched to the database.•10e-10 < E-value < 10e-50Closely related sequences, could be a domain match or similar.•1 < E-value < 10e-6Could be a true homologue but it is a gray area.E-value > 1Proteins are most likely not related

•E-value > 10Hits are most likely junk unless the query sequence is very short

Step 1: Select the BLAST program
Users have to specify the type of BLAST programs from the database like BLASTp, BLASTn, BLASTx, tBLASTn, tBLASTx.

Step 2: Enter a query sequence or upload a file containing sequence
Enter a query sequence by pasting the sequence in the query box or uploading a FASTA file which is having the sequence for similarity search. This step is similar for all BLAST programs. The user can give the accession number or gi number or even a raw FASTA sequence. Go to simulator tab to know more about how to retrieve query sequence.

Step 3: Select database to search
User first has to know what all databases are available and what type of sequences are present in those databases. Sequence similarity search involves searching of similar sequences of the query sequence from the selected databases (Figure 2).

Step 4: Select the algorithm and the parameters of the algorithm for the search

Step 5: Run the BLAST program

Submission of the BLAST program can be done by clicking the BLAST button at the end of the page.

BLAST Result
After submitting the query sequence for sequence similarity search, the result page will appear along with the information like Query id, Description, Molecule type, Length of sequence, Database name and BLAST program. It shows the putative conserved domains that have been detected while undergoing sequence similarity search.

References

The BLAST web page hosted at NCBIhttp://www.ncbi.nlm.nih.gov/BLASTDownload pages for the BLAST programshttp://www.ncbi.nlm.nih.gov/BLAST/download.shtml

Prev Introduction

Next Multiple Sequence Alignment