Next - Generation Sequencing

Teacher

LAB

Next – Generation Sequencing

Practice

Practice

We have our quality filtered raw sequence data and well annotated genome, now we are ready to map
(align) our sequence reads to the genome. Rename the genome and annotation files to
“human_grch37.fasta” and “human_grch37.gtf” respectively and copy them to the analysis folder
(the folder where our QC filtered fastq files were saved). Open a terminal window and change the
directory to our analysis folder using command cd. The read mapping algorithm for RNA-Seq should
be selected based on their ability to map reads generated from different splice forms of genes. In this
chapter we use HISAT2 [17] due to its speed and low memory requirement.

Index the genome by typing the following command in terminal.

hisat2-build –t 2 human_grch37.fasta human_grch37

–t to specify number of processor cores to be used. This command will generate multiple files with
“.ht2” extension. The indexing process will take considerable time (~1–2 h) on a laptop depending
on the configuration. We can generate reference index on any of the high end systems and copy it to
any other computer. You may also try indexing the genome using Galaxy by installing HISAT2 tool.

Aligning Reads

Once genome is indexed, read mapping is straight forward, as most of the algorithms work fine with
the default parameters. However, depending on the genome and NGS data, fine-tuning read mapping
by altering few parameters may improve overall results. Sometimes comparing different algorithms
would provide better insight into the results.

Practice

Rename the quality filtered fastq files for better understanding and tracking purpose to
“trimmed_normal_1.fastq”, “trimmed_normal_2.fastq”, “trimmed_tumor_1.fastq,” and
“trimmed_tumor_2. fastq”. To start alignment, type the following command in the terminal window.

hisat2 -p 4 –rna-strandness RF –dta -x human_grch37 -1 trimmed_normal_1.fastq -2
trimmerd_normal_2.fastq -S normal.sam

-p is to specify number of processors to use
–rna-strandness specifies whether the library is strand specific or not since our data was generated
using stranded library we use RF
–dta enables reporting of alignments that can be used to identify novel transcripts.
-x to specify the base name of index files
-1 forward (left) reads file
-2 reverse (right) reads file
-S out put alignment file in SAM format.

Prev Indexing and Alignment process

Next PCR Duplicates and Statistical approach