Bioinformatics Tools Training Program

Teacher

LAB

Bioinformatics Tools Training Program

Pairwise and Global Alignment

There are two types of pairwise alignments: local and global alignments.

A Local Alignment. A local alignment is an alignment of two sub-regions of a pair of sequences.
This type of alignment is appropriate when aligning two segments of genomic DNA that may have
local regions of similarity embedded in a background of a non-homologous sequence.
A Global Alignment. A global alignment is a sequence alignment over the entire length of two or
more nucleic acid or protein sequences. In a global alignment, the sequences are assumed to be
homologous along their entire length.

Scoring systems in pairwise alignments
In order to align a pair of sequences, a scoring system is required to score matches and mismatches.
The scoring system can be as simple as “+1” for a match and “-1” for a mismatch between the pair
of sequences at any given site of comparison. However substitutions, insertions and deletions occur
at different rates over evolutionary time. This variation in rates is the result of a large number of
factors, including the mutation process, genetic drift and natural selection. For protein sequences,
the relative rates of different substitutions can be empirically determined by comparing a large
number of related sequences. These empirical measurements can then form the basis of a scoring
system for aligning subsequent sequences. Many scoring systems have been developed in this way.
These matrices incorporate the evolutionary preferences for certain substitutions over other kinds of
substitutions in the form of log-odd scores. Popular matrices used for protein alignments are
BLOSUM and PAM1 matrices.

Algorithms for pairwise alignments
Once a scoring system has been chosen, we need an algorithm to find the optimal alignment of two
sequences. This is done by inserting gaps in order to maximize the alignment score. If the sequences
are related along their entire sequence, a global alignment is appropriate. However, if the
relatedness of the sequences is unknown or they are expected to share only small regions of
similarity, (such as a common domain) then a local alignment is more appropriate.
An efficient algorithm for global alignment was described by Needlemen and Wunsch and their
algorithms was later extended by Gotoh 1982 model gaps more accurately. For local alignments, the
Smith-Waterman algorithm is the most commonly used. See the references at the links provided for
further information on these algorithms.

Local Alignment: Smith-Waterman
Real life is often complicated, and we observe that genes, and the proteins they encode, have
undergone exon-shuffling, recombination, insertions, deletions, and even fusions. Many proteins
exhibit modular architecture. In searching databases for similar sequences, it is useful to find
sequences that have similar domains or functional motifs. Smith & Waterman (1981) published an
application of dynamic programming to find optimal local alignments. The algorithm is similar to
Needleman-Wunsch, but negative cell values are reset to zero, and the traceback procedures starts
from the highest scoring cell, anywhere in the matrix, and ends when the path encounters a cell with
a value of zero.

Scoring Matrices
The Needleman-Wunsch and Smith-Waterman algorithms require a scoring matrix. The scoring
matrix assigns a positive score for a match, and a penalty for a mismatch. For nucleotide sequence
alignments, the simplest scoring matrix awards +1 for a match, and -1 for a mismatch. The blastn
algorithm at NCBI scores +5 for a match and -4 for a mismatch. These scoring matrices treat all
mutations (mismatches) equally. In reality, transitions (pyrimidine -> pyrimidine and purine ->
purine) occur much more frequently than transversions (pyrimidine -> purine and vice versa). For
aligning non-protein coding DNA sequences, a transition/transversion scoring matrix may be more
appropriate. For aligning DNA sequences that encode proteins, alignment of the protein amino acid
sequences will almost always be more reliable.

PROCEDURE –
The two sequences can be aligned globally using different algorithms. Needleman-Wunsch
algorthim is one of the best algorithm for global alignment, which can be performed using the
online tool EMBOSS Needle (European Molecular Biology Open Software Suite).

1. Go to NCBI and download the sequence
2. Get Access to the tool entitle EMBOSS NEEDLE
3. Copy and paste the FASTA formatted (Computational representation of the DNA sequence)
nucleotide sequence in the step 2 dialog box
4.One can also choose the file through “Choose File” option and can upload the sequence file.
5. Similarly copy and paste or upload the second sequence for the alignment.
6.EMBOSS needle is predefined with the scoring matrices DNAfull for nucleotide sequence,
BLOSUM65 for protein sequence.
7. The gap open and gap extend penalty can be changed by user defined values. In this example it
kept as default values.
8. The user can be notified with the results through email, if the checkbox is checked and the mail
address is submitted.

Prev Multiple Sequence Alignment

Next Primer Designing