Epigenetics Data Analysis: The Steps in Analyzing ChIP-Seq Datasets

Epigenetics Data Analysis: The Steps in Analyzing ChIP-Seq Datasets

Online Inquiry


ChIP, also defined as binding site analysis, is an effective tool for studying the connection of proteins and DNA in vivo. It is mostly used to investigate transcription factor binding sites or histone-specific modification sites. ChIP-Seq combines ChIP and next-generation sequencing innovation to identify DNA areas that interact with histones and transcription factors across the entire genome. ChIP sequencing (ChIP-Seq) works by first using ChIP technology to primarily enhance the DNA fragments linked by the target protein, then purifying the DNA fragments linked to the protein and constructing a high-throughput sequencing library before sequencing.

ChIP-seq workflow and data analysis.Figure 1. ChIP-seq workflow and data analysis. (Oriov, 2012)

Steps in Analyzing ChIP-Seq Datasets

Map the reads back to the reference genome

The mapping of reads to a reference genome is almost often the first process in a ChIP-seq data assessment The aim of this process is to find all the areas in a reference genome that display perfect or close matches to each short read in the dataset.

There are several programs that can map short reads to the reference genome. Mapping is a simple task because, given the raw reads, the reference genome assembly, and the number of mismatches allowed, there is only one right outcome. As a result, it makes no difference which short-read mapping software is used. The main distinctions between the software are the algorithm styles and computational efficiencies. Bowtie is one of the quickest short-read mapping programs; Maq can take advantage of read quality scores, and SeqMap takes insertions and deletions into account (indels).

Background estimation

All reads should optimally be sequenced from the ends of the ChIP fragments that were bonded by the target TF, according to the ChIP-seq protocol and technique. Even so, a significant portion of reads in any ChIP-seq dataset may not have come from these ChIP fragments. Library contamination, PCR amplification selection, linker/adapter contamination, and image processing errors are all possible causes of extraneous reads. Furthermore, sequencing errors may induce a read that emerged in one part of the genome to be distinctively mapped to a distinct part of the genome with sequence similarity to the original source.

Peak Calling

In the ChIP-seq data analysis pipeline, this is the most important task. This is done to find the genomic regions that are enriched in ChIP signals. To put it another way, where did the TF bind? Because the count of reads from the same DNA strand in a TF-bound fragment should demonstrate a peak near the binding site, this procedure is termed as peak calling.

If a protein factor has a sharply focused binding site, the bi-horned peaks should be visible in a successful experiment. Both the Watson and the Crick strands will produce nice bell-shaped peaks. This is since a fragment is always sequenced from the ends to the middle. The 50 -end of a ChIP fragment is represented by a Watson read, while the 30 -end is represented by a Crick read. As a result, the Watson and Crick peaks are located on opposite sides of the TF-binding site (TFBS). As a result, the two peaks can be used to identify a potential binding area. If the ChIP-seq experiment fails due to the antibody's low affinity, the result will be extremely high, block-shaped peaks at repetitive areas or a relatively flat signal across the whole genome.

Gene assignment and peak annotation

It is critical to investigate the biological ramifications of protein–DNA bindings after acquiring a database of peak coordinates. Specific questions have always been raised, such as: what are the genomic annotations for these peak areas, and what are their features?

De novo motif assessment

De novo motif exploration is a crucial issue in the assessment of predicted peak areas. The actual sequence to which the TF binds is identified in some research, or even better, a sequence of verified binding sites is obtainable.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.


  1. Stark R, Hadfield J. Characterization of DNA-protein interactions: design and analysis of ChIP-seq experiments. InField Guidelines for Genetic Experimental Designs in High-Throughput Sequencing 2016 (pp. 223-260). Springer, Cham.
  2. Bailey T, Krajewski P, Ladunga I, et al. Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol. 2013, 9(11).
  3. Orlov Y, Xu H, Afonnikov D, et al. Computer and statistical analysis of transcription factor binding and chromatin modifications by ChIP-seq data in embryonic stem cell. Journal of integrative bioinformatics. 2012 Jun, 9(2).
* For Research Use Only. Not for use in diagnostic procedures.
Refresh to display the verification code
Online Inquiry