Next – Generation Sequencing

LAB
Free
  • 27 lessons
  • 0 quizzes
  • 10 week duration

Complete Pipeline

Data Analysis

The sequence data should be filtered based on quality and further analyzed to generate interpretable results. Many biological researchers are not or vaguely familiar with the NGS data analysis workflows, causing delay in interpreting results. Since most of the analysis programs are command line applications, may researchers face difficulty in using them, and hence, in this chapter we discuss the data analysis process in a step-by-step manner (Fig. 1) using example data sets. To make the data analysis simple, we start with using Galaxy platform [11] and gradually migrate to command line tools.

Data Quality Check

Due to vast amounts of the data, it is practically impossible to check quality of each base present in fastq file(s). Summarized quality parameters like quality value, read length, base distribution across the reads, and presence of adapter sequences and duplicated sequences would provide overall information of data quality. To understand these parameters, we need to understand what they represent. Quality value: The logarithmic probability of base calling error (Q ¼ log10 P) [12]. To put it in perspective, Q value 30 means the probability of the base (nucleotide) being wrongly called is 0.001 and Q value 20 means probability of the base being wrongly called 0.01.

Read length distribution percent of reads with their respective lengths. Nucleotide distribution: It visualizes how A, T, G, C are distributed across all the reads at a nucleotide position. If all the reads have same nucleotide at a given position it could be a sequencing artifact. If all the reads have same sequence towards their 30 end it could be adapter sequence. Read Duplication: Few sequences representing majority of the data indicates presence of rRNA contamination or PCR duplicates. Since we know data format and which quality parameters to check, let us analyze an example data set containing four fastq files. This data is generated by sequencing a paired end stranded library using Illumina HiSeq platform

Practice

To practice this data analysis, a computer with 4–8 GB RAM, 100 GB disk space and quad core processor is required. All the analyses mentioned in this chapter are performed on a MacBook Air. These analyses can also be performed using any computer with Unix-based Operating System (like Linux, Biolinux [13], Ubuntu, RHL, Fedora, and Linux Mint) or virtual drive with any Linux distribution. It is essential, to know the administrator/root password of the computer to install few software used in this chapter. We use Linuxbrew to install most of the applications utilized in this chapter. An Internet connection is a must for installing applications. Read every message shown at command prompt (terminal) as it will help in understanding the cause of error. Most of the time errors are caused due to spelling mistakes.

Install Few Helpful Software
Follow the links given below to install Conda and Miniconda using jupyter notebook.
1. Fastqc
2. Multiqc
3. fastp
4. Bwa
5. Sambamba
6. featureCounts or subread
7. Differential Gene Expression using edgeR

Installing SoS Workflow and Notebook

If you are using conda, you can install sos using command
conda install sos -c conda-forge
You can also install a majority of the SoS suite of tools using commands such as

conda install sos sos-pbs sos-notebook jupyterlab-sos sos-bash sos-python sos-r -c conda-forge

This will install SoS Workflow, SoS Notebook and its JupyterLab extension, language modules for Bash, Python2, Python3, R, R itself (r-base and r-irkernel etc) and needed libraries (e.g. feather) if needed.

If you are not using conda, you will have to install all the pieces one by one, but you will also have the freedom to use non-conda installation of R, Julia etc. With a working Python 3.6 installation, you can install SoS Workflow with command.
Or
conda install jupyterlab-sos -c conda-forge
conda install nodejs==15.12.0 -c conda-forge
conda install sos-r sos-python sos-pbs sos-bash markdown-kernel -c conda-forge
jupyter labextension update –all

In terms of fastqc, (for quality check)

Fastqc -t 2 -o QC *.gz

Multiqc, (to merge all the files and creating the compile report)
Multiqc -p .

Trimming and Filtering using fastp

fastp -w 10 -a AGATCGGAAGAGC -i $i”R1″ -I $i”R2″ -o $i”_Trimmed_R1.fq.gz” -O “$i”_Trimmed_R2.fq.gz”

Download and index the reference genome

Search on NCBI genome or UCSC genome browser and download the .gff file and keep in the same working directory.

Aligning the Reads to reference genome

For indexing use the command

bwa index reference genome

Alignment against indexed genome and use samtools for convert the SAM file to BAM file;

bwa mem -t 6 -R ‘@RG\tID:’$i’\tSM:Illumina’ GCF_000005845.2_ASM584v2_genomic.fna $i”R1″ $i”R2″|samtools view -@ 2 -b |samtools sort -@ 2 -o $i”_sorted.bam”; done

Removal of PCR duplicates using Sambamba markdup

sambamba markdup -r -t 10 $i”_sorted.bam” $i”_rmdup.bam”

By default, the tool expects BAM file as an input. In order to work with SAM file, specify -S|–sam-input command-line option.
However, only syntax is checked, use –valid for full validation.

Generate the gene expression or counts file for DGE Analysis using featureCount function

featureCounts -T 10 -p -g ID -t gene -a GCF_000005845.2_ASM584v2_genomic.gff.gz -o counts rmdup.bam

#extract .gff file in the folder

#Genome folder should be outside files folder
Open the Rstudio or R in jupyter notebook and run the following commands:

Set the working directory using

setwd()
getwd()
head(counts)
Install the edgeR tools
Install.packages(“edgeR”)
To activate the edgeR
Require(“edgeR”)
dim(counts)
Extract the required columns;
x = counts[c(1,7:12)]
head(x)
row.names(x)=x$Geneid
x= x[,-1]
group <- factor(c(1,1,1,2,2,2)) for ~conditions y <-DGEList (counts=x,group=group) head(y) y <- estimateDisp(y,design) fit <- glmFit(y,design) lrt <- glmLRT(fit,coef=2) lrt$table DEresults=data.frame(topTags(lrt,p.value = 1,n = 4419)) DEresults=lrt$table with(DEresults, plot(logFC, -log10(PValue), pch=20, main="Volcano plot", xlim=c(-2.5,2)))

# Add colored points: red if padj<0.05, orange of log2FC>1, green if both)
with(subset(DEresults, PValue<.05 ), points(logFC, -log10(PValue), pch=20, col="red")) with(subset(DEresults, abs(logFC)>1), points(logFC, -log10(PValue), pch=20, col=”orange”))
with(subset(DEresults, PValue<.05 & abs(logFC)>1), points(logFC, -log10(PValue), pch=20, col=”green”))

# Label points with the textxy function from the calibrate plot
library(calibrate)
with(subset(DEresults, PValue<.05 & abs(logFC)>1), textxy(logFC, -log10(PValue), labs=rownames(DEresults), cex=.8

head(DEresults)
#Install the heatmap for better identifications
install.packages(‘pheatmap’)
require(pheatmap)
normalizedReadCounts=cpm(y, normalized.lib.sizes=T)
pheatmap(na.omit(normalizedReadCounts[1:20,]),scale = ‘row’,)
NormforHeatmap=normalizedReadCounts[!rowSums(normalizedReadCounts)<=0,] pheatmap(NormforHeatmap[1:1000,],scale = 'row')