Next – Generation Sequencing

LAB
Free
  • 27 lessons
  • 0 quizzes
  • 10 week duration

Indexing and Alignment process

Getting Reference Sequence

Obtaining a genome version with proper annotations of exon, transcript, gene symbol, name,
ontology, etc. is crucial for reference-based RNA-Seq data analysis. The annotation information of a
genome is stored in a separate file in specific format. There are two widely used file formats for
annotation: (1). GFF and (2) GTF; most of the applications used in RNA-Seq analysis would accept
both of these formats. The Ensembl data base provides well annotated genomes for most of the
sequenced organisms but for some organisms, dedicated web resources are available, which are more
frequently updated as compared to ensemble database. For example, most widely used Rice genome
(Oryza sativa japonica) is maintained by MSU Rice Genome Annotation Project team
(http://rice.plantbiology.msu.edu/). Similarly, PlasmoDB maintains well-annotated genomes of
different Plasmodium strains. A literature search before downloading a reference genome would help
in obtaining a good annotated genome.

Indexing the Genome

The RNA-Seq reads can be mapped to transcriptome; nonetheless, even for well-studied species such
as human, we still do not know all transcripts, hence mapping the reads to the genome enables
identification of novel transcripts and estimate their expression levels. Aligning reads to genome
means, comparing the reads to the reference genome and finding a best match for each read but, NGS
data contains billions of reads and mapping each read to the genome in a convectional manner
(BLAST) requires humongous computational power and time. To answer this issue most NGS read
mapping algorithms use Burrows–Wheeler transform (BWT), also known as block-sorting
compression. In this approach the reference sequence will be transformed into small chunks of quick
search compatible format which enable faster alignment of reads to the genome and this is know as
“indexing”.