Next – Generation Sequencing

LAB
Free
  • 27 lessons
  • 0 quizzes
  • 10 week duration

Next – Generation Sequencing

File Format

VCF

VCF (Variant Call Format) contains info concerning genetic variants found at specific positions according to reference genome. The VCF header includes the VCF file format version and also the variant caller version. The header lists the annotations utilized in the rest of the file. The VCF header includes the reference genome file and BAM file. The last line within the header contains the column headings for the information lines (Danecek et al., 2011).

VCF File information Lines—Each data line contains info of a few single variant. VCF Tools is a program designed for operating with VCF files to perform the mentioned operations on VCF files:

•Filter out specific variants.
•Compare files.
•Summarize variants.
•Convert to completely different file sorts.
•Validate and merge files.
•Create intersections and subsets of variants.
For example:
Each information line contains associate info of a particular position within the genome. Such as;

1. A SNP (G!A) with a top quality of 29.
2. A possible SNP (T!A) that has been filtered out as a result of its quality is lower than 10.
3. A location at that 2 alternate alleles are called.
4. A location that’s known as monomorphic reference (i.e., with no alternate alleles).
5. A microsatellite with 2 various alleles, one as a deletion of two bases (TC), and other as an alternative associate insertion of 1 base (T).

BED
The BedGraph (*.bg) format is also the BED format with a couple of variations and permits show of continuous-valued information in track format. This display method is helpful for probability scores and transcriptome information. the information are preceded by a track definition line, that adds variety of choices for dominant the default show of this track. The 4th column of the file format provides info concerning regions of the genome with suffucient read coverage. Thus, once changing this format into a bigWig format (a binary indexed version) it’s appropriate for visualizing sequencing information within the UCSC order Browser.

The BED format provides a less complicated manner of representing the features of a molecule. every line represents a feature of a molecule and it’s solely 3 needed details: name (of chromosome or scaffold), start, and end. The BED format adopt 0-based coordinates for the starts and use 1-based for the ends. Headers are also allowed. Those lines ought to be preceded by # and that they are ignored. the primary 3 columns during a BED file are needed, further columns are nonmandatory. (Quinlan, 2014)

The GFF/GTF/BED formats ar the supposed interval formats that retain solely the coordinate positions for a vicinity during a order. A order interval sequencing formatting will describe additional or less all genetic structures, alterations, variants, etc.:

•Genes: exons, introns, UTRs, promoters
•Conservation
•Genetic variation
•Transposons
•Origins of replication
•TF binding sites
•CpG islands
•Segmental duplications
•Sequence alignments
•Chromatin annotations
•Gene expression information

Due to the very fact, that we have a tendency to handling intervals, several advanced analyses reduced genome arithmetic. Majority of the tools have some basic mathematical operations like addition, subtraction, multiplication, and division. Therefore, some clever bioinformaticians developed a tool for order “calculations”—BEDTools (https://bedtools.readthedocs.io/en/latest/).

GFF

The term GFF stands for General Feature Format and term GTF stands for Gene Transfer Format. each are annotation files. An annotation is considereded as a label applied to a particular region of a molecule. The GFF/GTF formats ar nine column tab-delimited formats. each single line represents a section on the annotated sequence and these regions are known as features.

Features are essential components (e.g., genes), genetic polymorphisms (e.g., SNPs, INDELs, or structural variants), or the other annotations. every feature ought to have a type associated. examples are: SNPs, introns, ORFs, UTRs, etc. within the GFF format each the beginning and also the finish of the options are 1-based. The GTF format is similar to its second version of GFF format. The feature field is that the same as GFF, with the exception that it conjointly includes the subsequent nonmandatory values: 5’UTR, 3’UTR, inter, inter_CNS, and intron_CNS. The cluster field has been coupled into an inventory of attributes. every attribute consists pair of a type/value. Attributes should finish during a semi-colon, and be separated from any following attribute by specifically 1 space.

A BAM file (*.bam) is the compressed binary version of a SAM file. BAM files are binary files, that mean they can not be opened like text files; they’re compressed and might be sorted and/or indexed. They encompass a header section and an alignment section. The header contains info concerning the complete file, like sample name and length. Alignments contain the name, sequence, and quality of a scan. Alignment info and custom tags can be found within the alignment section.
Currently, the GFF/GTF and BED format are among of the most typically used data formats to supply genomic annotation info for next-generation sequencing data analysis. The GFF stands for General Feature Format. GFF format was created by Richard Durbin and David Haussler in Wellcome Trust Sanger Institute in England. The GTF (General Transfer Format) format is similar for GFF version 2. GFF file uses file extension as “.gff” and GTF files have extension as “.gtf”.

FASTA
FASTA format could be a text-based format for representing either nucleotide sequences or peptide sequences, during which nucleotides or amino acids can be mentioned as a single-letter code. The simplicity of FASTA format makes it straightforward to govern and analyze the text-processing tools and scripting languages like R, Python, Ruby, and Perl. A sequence in FASTA format begins with a single-line description, followed by lines of sequence information. The questionable line starts with a “>” image and may so be distinguished from the sequence information.

FASTA format is additionally a text primarily based format and unremarkably uses file extension as either “.fa” or “.fasta”. FASTA has been a common format for nucleotide sequence since the primary generation sequencing. For next-generation sequencing, it’s the quality format for reference genome sequences utilized by mapping/alignment tools.

Sequence lines encompass characters representing nucleotide bases within the sequence and are measure typically no over eighty characters per line however haven’t any limitation for total variety of lines. Same as sequence in FASTQ files, every nucleotide base is encoded as one character (one of A, T, G, C, and N if undetermined, with case insensitive). FASTA format may be used to show sequence of one cistron, one gene, or multiple ones.

FASTQ

The FASTQ format could be a text-based common format for storing each DNA sequence and its corresponding quality scores from NGS (Cock et al., 2010). There are 4 lines per sequencing read in the format.

FASTQ format example:

@M02605:65:000000000-JMM64:1:1101:18168:1561 2:N:0:64
TCTTTCTACTTCTTCTTTTACCTTTTTCTTTCCCTTGCTTCTTTCCCGTTCCTTTCTTTTTTGACCTTTTTTTTTCTTTCACCTTTTTTTTTTTCCTTTTCTTCGCTCTTTTCCCCCTTCCATGTTTCTTTTCC>
+
—,8,

The FASTQ format could be a text-based commonplace format for storing each, a desoxyribonucleic acid sequence and its corresponding quality scores from NGS. There square measure four lines per sequencing browse.

FASTQ files are plain document with extension “.fq” or “.fastq” and will be viewed directly from program line on computers with Unix/Linux software system. In FASTQ file, every sequence or short reads of NGS is outlined by four lines of text:

The first line starts with a “@” character and that is followed by a sequence symbol and an optional description.
The second line is that the raw sequence letters: A, T, G, C, and N (unknown).
The third line begins with a “+” image and is optionally followed by a similar sequence symbol with any other description once more. The “+” sign is a marker indicating the top of sequence.
The fourth line is that the quality values for sequence within the second line, and should contain a similar variety of symbols as letters within the sequence.

The FASTQ format has no restriction on the total variety of single sequences in any FASTQ file however since the amount of short reads from next-generation sequencing experiment is extremely immense, for convenience, every sample can use many FASTQ files to carry all reads. In most case, 5the FASTQ files are compressed with computer code to GNU zip format (with .gz file extension) to scale back in file size. because of the high volume of sequence reads in a very FASTQ file and no genomic position info for every reads, it’s not sensible to look at or check one reads manually. throughout the analysis, the standard analysis of short reads is sometimes performed with software tools. the foremost unremarkably used one is FASTQC tool developed by Simon Andrews at Babraham Institute that takes FASTQ files as input and generates outline graphs and tables for a fast summary of the raw sequence browse quality.

BAM and guided missile

BAM/SAM file has key role in analysis of next-generation sequencing information as a result of the raw sequence reads from sequencers haven’t any genomic position info associated and that they should be mapped/aligned to better-known reference genome. The mapping/alignment outputs are measure of BAM/SAM files which will function inputs for numerous downstream analysis like feature counts and variant career.

SAM stands for Sequence Alignment/Map format, and could be a generic alignment format for storing browse alignments against reference sequences, supporting short and long reads (up to 128 Mbp) created by completely different sequencing platforms. SAM usually uses file extension as “.sam” and could be a tab-delimited document that contains sequence alignment information. BAM is that the binary version of a guided SAM file and frequently contains a file extension in “.bam”. each BAM file and SAM file contain same info and with SAMtools software tool, a BAM file may be simply reborn to SAM file (Li et al., 2009).

A BAM/SAM file has 2 sections: header section and alignment section. Details of BAM/SAM format is represented on the SAMtools web site (http://samtools.sourceforge.net) and here could be a short description of header section and alignment section.

In the alignment section of a SAM file, CIGAR (Concise individual Gapped Alignment Report) is one among the eleven necessary fields of every alignment line. This string could be a sequence of base lengths and associated operations. CIGAR string is employed to point whether or not there’s a match or pair between the bases of browse and therefore the reference sequence. it’s quite helpful for locating insertions or deletions (Indels). The string is browse primarily based and there are 2 completely different versions available too.
SAM stands for Sequence Alignment/Map format. because the name suggests, you get this file format once mapping the fastq files to a reference genome. it’s a TAB-delimited text format. Consisting of a header section, that is optional however recommended to incorporate, and an alignment section. The header lines begin with “@,”while alignment lines don’t.
Each alignment line has eleven necessary fields for essential alignment info like mapping position, and has variety of optional fields for versatile or aligner specific info .