Methylation of DNA at the fifth cytosine position (5-mC) is a stable epigenetic alteration that is involved in a variety of biological processes, such as gene silencing, transposable element suppression, genomic imprinting, and X chromosome inactivation. Methylation identification and quantification are important for understanding gene expression and other epigenetic regulation.
By classifying the DNA with sodium bisulfite before sequencing, whole-genome bisulfite sequencing, can identify methylated cytosines. For researching genome-wide methylation at single-base resolution, WGBS has become the gold requirement.
Figure 1. Workflow for analysis of DNA methylation using data from bisulfite sequencing experiments. (Wreczycka, 2017)
Many computational tools have been created to make WGBS data analysis easier since it was first used to measure genome-wide trends of DNA methylation In general, there are several processes to assessing WGBS data.
Preprocessing of sequencing reads is required first. Second, reads are mapped to a reference genome, which allows for differences in reads and reference sequences caused by bisulfite transformation. This can be accomplished by aligning to a three-base genome in which all cytosines have been substituted by thymine. Third, the reads plotted to each cytosine base must be used to calculate DNA methylation levels across the genome. Finally, a more in-depth analysis of the biological question Tim Stuart et al. 301 of concern is required, which usually entails identifying areas of differential DNA methylation between specimens or areas of the genome.
Trim Reads: Before aligning, remove adapter bases and low-quality bases from reads. The sequence of the adapters used in the experiment must be known in order to trim the adapter sequence properly. The Illumina TruSeq adapter, which has the following sequence: AGATCGGAAGAGCAC ACGTCTGAACTCCAGTCAC, is generally the most popular adapter.
Compress Reads: FASTQ files should be gzip compacted at this point to save disk space because DNA sequencing data can be quite huge.
Sequencing Quality Control: Run FastQC on each FASTQ file to determine the quality of your sequencing reads. This will produce an html report with a summary of each quality check. Examine the html report that FastQC generates. The base quality scores and overrepresented sequences are the most important elements of this report. Inadequate read trimming could result in overrepresented sequences. Poor quality scores indicate a bad sequencing run, which could be caused by a variety of factors.
Alignment: The reads must then be aligned against a reference genome. The mapped reads should be organized by the situation after alignment. For genome coverages standard of WGBS experimentation, PCR duplicates can be computationally recognized after alignment to the reference genome with fairly increased accuracy.
Quantifying DNA Methylation: Next, proceed to DNA methylation.
Post Alignment Quality Control: The status of the methylated cytosine in a sequencing read should have no bearing on DNA methylation levels. Any methylation bias along the length of a read implies that the adapter was not properly trimmed prior to alignment. The methylation bias in the mapped reads is an essential quality check. Because methylation information is collected by BS-Seeker2 under the XM read tag, methylation bias along reads can be effectively evaluated.
Differential DNA Methylation: When evaluating numerous specimens, as is almost always done in WGBS experiments, the first process should be to identify cytosines that are differentially methylated. In WGBS experimentation, it is suggested to use DSS to identify differential methylation.
Interpretation/Data Visualization: Data visualization is an essential step in any genomic data analysis. Genome browsers are well appropriate for this job because various data kinds, such as gene annotations, DMR positions, ChIP-seq data, and RNA-seq data, can be loaded by adding different browser tracks.
The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.
References