Copy Number Variations (CNVs) are genomic alterations that involve changes in the number of copies of specific DNA segments. They play a significant role in human health and disease, making CNV analysis an important field of genomic research. In this beginner's guide, we will introduce you to the fundamental concepts and steps involved in CNV analysis, empowering you to understand and explore this exciting area of study.
Data Preprocessing
The first step in CNV analysis is to preprocess the raw data. This typically involves quality control measures to ensure the reliability of the data, such as filtering out low-quality reads or probes, removing batch effects, and correcting for biases introduced during data generation or sequencing.
Read Alignment/Probe Mapping
In CNV analysis using next-generation sequencing (NGS) data, the reads need to be aligned or mapped to a reference genome. This step involves aligning short sequence reads to a reference genome using alignment algorithms such as Burrows-Wheeler Aligner (BWA) or Bowtie. In microarray-based approaches, the probe intensities are mapped to specific genomic locations.
CNV Detection: Once the data is preprocessed and mapped, the next step is to detect CNVs. There are various algorithms and tools available for CNV detection, and the choice of method depends on the data type (NGS or microarray) and the specific goals of the analysis. Commonly used algorithms include segmentation-based methods (e.g., circular binary segmentation) and statistical tests (e.g., Fisher's exact test, t-test) to identify regions with significant copy number changes.
CNV Characterization
After CNVs are detected, they need to be characterized to determine their boundaries and functional impact. This step involves defining the breakpoints of the CNVs and annotating them with relevant genomic features such as genes, exons, regulatory elements, or repetitive elements. Tools like the Genome Reference Consortium (GRC) or the University of California Santa Cruz (UCSC) Genome Browser can be used for visualization and annotation of CNVs.
CNV Visualization
Once CNVs are detected, annotation and visualization are important steps in CNV calling pipelines. Annotation involves annotating CNVs with relevant genomic features such as genes, exons, regulatory regions, or known functional elements. Tools like the Genome Reference Consortium (GRC) or UCSC Genome Browser are often used for annotation and visualization of CNVs in the context of genomic annotations.
CNV Interpretation
The final step is to interpret the detected CNVs in the context of the research question or clinical application. This involves assessing the functional impact of CNVs on gene expression, protein function, or regulatory elements. Functional enrichment analysis can be performed to identify biological pathways or gene sets that are enriched for CNVs. Additionally, association studies can be conducted to determine if CNVs are associated with specific phenotypes or diseases
Computational approaches play a crucial role in CNV analysis by leveraging various algorithms and tools to detect, analyze, and interpret CNV events. Here are some common computational approaches used in CNV analysis:
Read Depth Methods
These methods utilize the depth of sequencing coverage to infer copy number changes. By comparing the observed read depth at a given genomic location to the expected read depth, CNV events can be detected. Examples of read depth-based algorithms include CNVnator, FREEC, and CoNIFER.
Paired-End Mapping
Paired-end sequencing reads provide information about the distance between the ends of DNA fragments. CNV detection algorithms can use the abnormal insert size or discordant read pairs to identify structural variations. Popular tools in this category include BreakDancer, DELLY, and GASVPro.
Four main methods for detecting CNVs with NGS data. (Pirooznia et al., 2015)
Split-Read Methods
Split-read methods identify CNVs by examining reads that span a genomic breakpoint. These algorithms map the split-read segments to different genomic locations, allowing the detection of structural variations. Notable tools include Pindel, MoDIL, and SoftSV.
Assembly-Based Methods
These approaches leverage de novo assembly of sequencing reads to detect CNVs. By aligning reads against a reference genome, these methods can identify gaps or variations in the assembly, indicating potential CNV events. Examples include Cortex, SVDetect, and VariationHunter.
Hybrid Approaches
Some CNV analysis methods combine multiple types of evidence, such as read depth, paired-end mapping, and split-read analysis, to improve sensitivity and accuracy. These hybrid methods include CNVkit, Control-FREEC, and LUMPY.
Machine Learning and Statistical Methods
Computational approaches also employ machine learning algorithms and statistical models to classify and interpret CNV events. These methods can utilize various features, including read depth, mapping quality, sequence composition, and more. Examples include ADTEx, ExomeCNV, and XHMM.
Visualization tools allow researchers to visually explore CNV data, compare different samples, and identify potential patterns or recurrent CNVs. They facilitate the interpretation of CNV findings and aid in identifying potentially relevant genomic regions or genes affected by CNVs. Popular visualization tools include Integrative Genomics Viewer (IGV), UCSC Genome Browser, and Genetic Data Viewer (GDV).
Gene Expression Integration
Comparing CNV data with gene expression profiles allows researchers to identify genes whose expression levels may be influenced by CNVs. Correlating CNV status with gene expression changes can provide insights into the functional consequences of CNVs on gene regulation. Tools like GenePattern, R packages (e.g., limma), or web-based platforms (e.g., TCGA) offer functionalities for integrating CNV and gene expression data.
DNA Methylation Integration
DNA methylation is an important epigenetic modification that can impact gene expression. Integrating CNV data with DNA methylation profiles enables researchers to explore the relationship between CNVs and epigenetic regulation. Analysis tools such as MethylKit, DMAP, or R packages (e.g., minfi) can be used for DNA methylation analysis and integration with CNV data.
Proteomic Data Integration
CNVs can influence protein expression levels and protein-protein interactions. Integrating CNV data with proteomic data allows researchers to explore the impact of CNVs on protein abundance, post-translational modifications, and protein interactions. Tools such as STRING, Cytoscape, or pathway analysis tools (e.g., DAVID, Reactome) can aid in integrating CNV and proteomic data.
Several databases and resources are available to support CNV analysis and interpretation:
Database of Genomic Variants (DGV)
DGV is a comprehensive repository of curated CNV data from various studies and populations. It provides a valuable resource for comparing CNV findings and assessing their frequency in different populations.
ClinVar
ClinVar is a public database that collects and curates genetic variants, including CNVs, along with associated clinical information. It helps researchers interpret the clinical significance of CNVs and their association with diseases or phenotypes.
DECIPHER
DECIPHER is a database and platform that allows researchers to share and analyze clinically relevant CNVs associated with developmental disorders. It facilitates collaboration and the sharing of anonymized patient data.
Reference: