Home
Resources
Support Documents
Hi-C Sequencing Data Analysis: Introduction, Methods, and Protocol

Hi-C Sequencing Data Analysis: Introduction, Methods, and Protocol

Introduction to Hi-C Sequencing

Lieberman-Aiden et al. published Hi-C, a genome-wide (all-to-all) edition of 3C in 2009. Up until the restriction enzyme digestion phase, the procedure is similar to 3C. The overhangs, on the other hand, are loaded with biotinylated residue and blunt-end ligated (as opposed to sticky-end ligation in 3C). To enhance DNA fragments that included a ligation junction, the DNA is snipped with sonication and the biotinylated fragments are pulled down with Streptavidin beads. The fragments are PCR amplified and brought to next-generation sequencing using generic primers. The main stages in the Hi-C protocol are depicted in Figure 1.

Figure 1. Overview of the Hi-C Protocol (Lieberman-Aiden, 2009)

Hi-C sequencing generates a genome-wide map of all restriction areas on all chromosomes, as well as a genome-wide interaction map. Hi-C, unlike Chia-PET, is site-neutral, meaning it can detect interactions between any two genomic loci in the genome. Although this provides the most detailed signal, due to the immense intricacy of Hi-C libraries, sequencing depth between any two restriction areas is restricted Because the number of potential interactions scales quadratically with the number of binding areas, mammalian genomes could have hundreds of billions of fragment-to-fragment encounters. As a result, while Hi-C can produce a high-resolution (2-4kb) interaction matrix for smaller genomes (such as yeast), it is better suited to identifying encounters in larger (>100 kilobase pair) genomic areas in mammalian cells using current sequencing technologies. Hi-C is also influenced by noisy ligation occurrences and coverage biases, as are all ‘C-type assays.

Overview of the Hi-C Protocol Figure 2. A schematic representation of Hi-C data analysis. (Pal, 2019)

Availability of Hi-C Data: Increasing Size and Resolution

Hi-C data enables multi-scale analysis of the genome's 3D organization. The genome is divided into different "compartments" on a large scale. Hi-C contact maps assessment defined active (“A”) and inactive (“B”) compartments, which coincide with the appearance of active or inactive chromatin domains, respectively The active compartment is made up of genomic areas with transcriptional or epigenetic marks linked with open chromatin. Instead, the inactive compartment protects areas with dense heterochromatin and epigenetic marks that silence gene expression. Instead, when looking at local trends in the contact matrix, topologically affiliating domains (TADs), or areas with a high intradomain contact frequency but few interdomain contacts, arise as a key attribute. Hi-C data has been utilized to classify particular points of contact between remote chromatin areas on an even finer scale. When relating to intrachromosomal (cis) contacts, encounters are sometimes referred to as chromatin loops. The resolution limit of Hi-C data makes this rate of analysis particularly difficult.

The restriction enzymes employed in the experiment and the sequencing depth are the main determinants of Hi-C data resolution. Over the years, we've seen attempts to improve the resolution of Hi-C data by adjusting these parameters, leading to datasets that have grown in size and resolution, achieving extremely high numbers of sequenced reads, particularly for mammalian genomes. In additament, protocol variants with the purpose of enhancing resolution have been suggested.

Analysis of Hi-C Data: FASTQ to Interaction Maps

Hi-C data assessment is a multi-step method that can be divided into two parts: preprocessing (from raw data to the Hi-C contact matrix) and downstream analyses.

Preprocessing begins with FASTQ files of paired-end reads acquired from high-throughput sequencing, which are (1) connected to the reference genome, (2) processed to delete spurious signal, and finally read counts are (3) binned and (4) normalized. The last two processes are frequently done at the same time, but they involve different decisions that affect the properties of the normalized contact matrix that is obtained as a final output.

Due to the peculiarity of this info, Hi-C paired-end reads are aligned individually because they are expected to map in different unrelated areas of the genome. The configuration can be done with standard instruments like bowtie or bwa, with the goal of aligning the entire read, as was done in earlier analysis pipelines. Following that, the aligned reads are filtered to erase any spurious signal caused by experimental artifacts While reads filtering is a popular assessment stage in many high-throughput sequencing applications, it is especially essential for Hi-C data because numerous stages in the experimental procedure can cause biases in sequencing outcomes.

Despite the fact that reads are mapped and tabulated on specific restriction fragment ends, Hi-C data is rarely analyzed at this rate Rather, read counts are typically summed up at the stage of genomic bins, which is a continuous partitioning of the genome into fixed-size intervals

Normalization is the last step in the preprocessing process. Binning and normalization of read counts are frequently combined and conducted by the same tools There are two types of Hi-C normalization strategies: explicit and implicit (or matrix-balancing) normalization techniques. Instead, the implicit or matrix-balancing normalization approach does not make any generalizations about the sources of biases in Hi-C read counts.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.

References

Zhou Y, Cheng X, Yang Y, et al. Modeling and analysis of Hi-C data by HiSIF identifies characteristic promoter-distal loops. Genome medicine. 2020, 12(1).
Pal K, Forcato M, Ferrari F. Hi-C analysis: from data generation to integration. Biophysical reviews. 2019, 11(1):67-78.
Lieberman-Aiden E, Van Berkum NL, Williams L, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. science. 2009, 326(5950).

* For Research Use Only. Not for use in diagnostic procedures.