Common File Formats in Bioinformatics

Online Inquiry

CD Genomics offers in-depth and comprehensive bioinformatics analysis services and supports the output and analysis of data in multiple file formats. The following are some of the most common file formats used in bioinformatics:

FASTQ: The FASTQ format is the industry standard for data that has been lightly stored and comes from an Illumina machine. When performing whole-genome sequencing, the Illumina processing pipeline typically separates all reads with various barcodes into different files. .fq is a common file extension for FASTQ files. To save space, almost all FASTQ files that can be obtained from a sequencer should be gzipped. As a result, the file must be uncompressed in order to be viewed. When only a few lines need to be visually examined, it's best to use gzcat and pipe the output to head.

FASTA: The FASTQ format is designed to represent short DNA sequences generated by high-throughput sequencing machines, as well as their associated quality scores. Longer DNA sequences that have usually been formed through a lot of sequencing and no longer travel with their quality scores are represented in a simplified, leaner template. This is the FASTA format used to store the DNA sequence from reference genomes. The file extensions .fa, .fasta, or .fna are commonly used for FASTA files, with the latter indicating that they are nucleotide files.

Genbank: The Genbank format, which is commonly utilized by public databases such as NCBI, is arguably the industry standard in sequence file template. The Genbank file template is very adaptable, allowing you to include annotations, comments, and references. Because the file is plain text, it can be viewed using a text editor. The file extension '.gb' or '.genbank' is commonly used for Genbank files.

EMBL: The EMBL format, which is similar to the Genbank file in appearance, is used by public databases such as the European Molecular Biology Laboratory. The Genbank file format is very flexible, allowing you to include annotations, comments, and references. Because the file is plain text, it can be viewed using a text editor. The file extension '.gb' or '.genbank' is commonly used for Genbank files.

ABI: ABI is a binary file template that stores sanger sequencing sequences as well as trace data. The template is employed by sequencing facilities and viewing the trace data and extracting the sequence necessitates special readers that can read the file format. Because of its binary nature and the spec's intricacy, the file template is hard to discern.

PDB: The PDB file template is employed to keep both sequence data and, more importantly, three-dimensional structure data. This information can be utilized to conceptualize a molecule's crystal structure (typically a protein). PDB files are merely text files that can be considered with a text editor and usually have the extension '.pdb'.

MDL: While the MDL file template does not theoretically involve sequence data, it is worth such as in this list. The MDL mol file template contains information about small molecules, with a spec that is very similar to the PDB file format. The MDL mol file contains information about the structure of a two-dimensional (and possibly three-dimensional) molecule, such as atom type and atom connectivity.

BAM/SAM: Next-generation sequencing data is stored in BAM/SAM template. The BAM file template is binary, whereas the SAM file template is text-based but contains the same details. SAMTools, a piece of command-line open-source equipment, and IGV, a user interface instrument, can both evaluate and display these files. Both the BAM and SAM formats are capable of stashing not only sequence data for next-generation sequencing reads but also alignment data for those reads to a reference sequence.

SFF: The SFF file template specifies a binary file that includes information about next-generation sequences. The name refers to the standard flow gram template, which includes the actual flow data used on numerous next-generation DNA sequencers, such as Ion-Torrent and Roche's '454'.

References

Mills L. Common file formats. Current protocols in bioinformatics. 2014, 45(1).
Fourment M, Gillings MR. A comparison of common programming languages used in bioinformatics. BMC bioinformatics. 2008, 9(1).

* For Research Use Only. Not for use in diagnostic procedures.