Why Perform GWAS Data Mining?

Genome-Wide Association Studies(GWAS ) is a robust statistical method used to identify genetic loci associated with specific phenotypes by analyzing the relationships between single nucleotide polymorphisms (SNPs) and traits throughout the genome. GWAS data mining involves collecting extensive genotype and corresponding phenotype data from numerous individuals to identify genetic variations significantly associated with specific phenotypes, such as disease states. By analyzing large datasets, GWAS helps uncover subtle genetic risk factors that traditional methods may overlook, facilitating discoveries that can lead to improved disease prevention and treatment strategies. Additionally, in agricultural contexts, GWAS data mining aids in identifying traits that enhance crop yields and livestock productivity, contributing to food security and sustainability. Overall, this service supports a comprehensive approach to genetics, offering valuable insights for both healthcare and agricultural advancements.

What Can We Offer?

We offer a comprehensive suite of GWAS data mining services, encompassing public data collection, advanced data analysis, and result interpretation. Our team specializes in the extensive gathering of publicly available datasets tailored to the client's specific interests, employing sophisticated and efficient bioinformatics analytical methods. This approach enables clients to thoroughly understand and extract valuable insights from the intricacies of GWAS data.

Workflow for GWAS Data Mining

GWAS Data Download

In public databases such as the European Variation Archive (EVA) of the European Bioinformatics Institute (EMBL-EBI) or the National Center for Biotechnology Information's (NCBI) dbGaP database, researchers can find Variant Call Format(VCF) files containing results from GWAS. In the EVA database, these files can be downloaded through their FTP server, while in the NCBI dbGaP database, files can be accessed through the data download page. Additionally, two key databases that specialize in curating GWAS data are GWAS Catalog and OpenGWAS, both of which provide GWAS summary data that can be downloaded for meta-analyses and post-GWAS analyses.

Bioinformatics Analysis Content

Data preprocessing	PLINK
	VCFtools
	bcftools
Association Analysis	Logistic Regression (binary traits)
	Linear Regression(continuous traits)
Fine Mapping	PAINTOR
	FINEMAP
Polygenic Risk Score (PRS) Calculation	PLINK
	PRSice
Functional Annotation and Pathway Analysis	Pathway database (e.g., GO, KEGG, RECTOME)
	ANNOVAR, SnpEff, or MAGMA
Mendelian Randomization Analysis	GSMR
	TwoSampleMR

What Are the Advantages of Our Service?

Comprehensive Data Access

Our GWAS Data Mining Service offers comprehensive data integration by providing access to a diverse array of publicly available GWAS datasets from reputable sources, enhancing research robustness. We incorporate multifaceted phenotypic information alongside standardized genetic data, facilitating cross-study comparisons and increasing statistical power for analyses.

Advanced Analytical Tools

We utilize advanced bioinformatics tools and algorithms to process and analyze GWAS data. This ensures high accuracy in data interpretation and helps uncover significant biological insights that can drive research forward.

Customizable Solutions

We understand that each research project has unique requirements. Our service offers customizable data mining solutions tailored to specific research questions, allowing clients to focus on their particular areas of interest.

Expert Support and Consultation

Our team of biological specialists provides expert guidance throughout the data analysis process. We assist clients in interpreting results and translating them into actionable insights, enhancing their research outcomes.

Integration of Multi-Omics Data

Our service seamlessly integrates GWAS data with other omics data (e.g., genomics, transcriptomics, and metabolomics), offering a comprehensive view of biological systems. This integration helps to uncover complex interactions within microbial communities and their hosts.

High-Quality Data Processing

We adhere to rigorous quality control standards throughout the data processing pipeline, ensuring that the results are reliable and reproducible. This high-quality data processing is crucial for making informed decisions based on the analysis.

Scalable Solutions

Whether working with small pilot studies or large-scale projects, our service can scale to meet the demands of any research endeavor. We have the infrastructure and expertise to handle diverse datasets effectively.

Efficient Data Management

Our streamlined processes and robust data management systems ensure quick access to relevant data and results, saving time and resources for researchers.

Commitment to Ethical Practices

We are dedicated to conducting research with integrity and ethical considerations. Our service adheres to the highest standards of data privacy and confidentiality, ensuring that client data is handled securely.

By capitalizing on these advantages, CD Genomics' GWAS Data Mining Service enables researchers to tap into the full potential of GWAS data, assisting them in identifying genetic variations associated with specific phenotypes and diseases, thereby illuminating the biological mechanisms and processes underlying these conditions.

What Does GWAS Data Mining Reveal?

GWAS Data Download and Standardization Process

The data formats of GWAS data downloaded from different databases are not entirely consistent. Common GWAS genotype data formats include PLINK format (.bed/.bim/.fam), VCF format, HapMap format, as well as simple CSV/TSV table formats. We will organize the data into a unified format. For studies that provide raw data, complete phenotype and genotype data will be organized to facilitate further analysis by the clients. The table below shows the .ped file in HapMap format. Normally, such files do not have headers; here, the meanings of each column in the file are presented.

Table 1. GWAS Data in HapMap Format: .ped File

Family ID	Individual ID	Sex	Phenotype	SNP1	SNP2	SNP3	...
FAM001	IND001	1	1	A	A	G	...
FAM002	IND002	2	2	A	G	G	...
FAM003	IND003	1	1	A	G	G	...

Association Analysis Reveals the Relationship Between SNPs and Phenotypes

In this example, the authors conducted a multi-family GWAS weight analysis involving over 153,781 individuals and identified 60 loci associated with fetal genotype and birth weight (BW), as shown in the figure below. The larger plot represents the Manhattan plot, while the smaller one is the quantile-quantile (QQ) plot. The GWAS analysis results are first evaluated using the QQ plot, which is an important indicator for assessing model false positives and false negatives. An ideal QQ plot is illustrated in the figure below, where the initial points lie on the straight line, and subsequent points deviate upwards from the line.

Manhattan plot illustrating SNP association P-values for birth weight (BW) with significant signals marked in green and pink; QQ plot displaying observed and expected P-values. Figure 1. Manhattan and QQ plots of the association analysis for BW .The plot displays association P-values (-log10 scale) for 22,434,434 SNPs against genomic positions. Significant signals (P<5×10-8) are highlighted in green for novel findings and pink for previously reported ones. The QQ plot shows observed P-values (black/red) and expected P-values (grey).(Horikoshi,2016)

Shared Genetic Contribution of Birth Weight and Other Health-Related Traits Revealed by Genetic Association Analysis

This study employed the LD score regression method to calculate the genetic correlation between BW and other health traits. The results indicated a strong positive genetic correlation between BW and anthropometric and obesity-related traits, while showing a negative genetic correlation with adverse metabolic and cardiovascular health indicators.

Genetic correlation analysis showing the relationship between birth weight and various traits and diseases in adulthood. Figure 2. Genetic correlations across the genome between birth weight and various traits and diseases in later life. The genetic correlation estimates (rg) are represented by color, with red indicating positive correlations and blue representing negative correlations.(Horikoshi,2016)

Cluster Analysis of the Overlap and Similarity of BW-Related Genes with Other Health-Related Traits

This analysis demonstrates that many genome-wide signals associated with BW are also established genome-wide association signals for various cardiometabolic traits, confirming that genetic factors are a primary reason for the negative correlation between BW and cardiovascular metabolic risk in adulthood. The results of the analysis are presented in a heatmap, and similar analyses can also be conducted by us.

Hierarchical clustering analysis showing birth weight loci overlaps with adult diseases, metabolic, and anthropometric traits. Figure 3. Hierarchical clustering of birth weight loci is performed according to their overlap similarities with adult diseases, metabolic, and anthropometric traits. A positive z-score (red) signifies a positive association between the BW-increasing allele and the outcome trait, while a negative z-score (blue) reflects an inverse relationship.(Horikoshi,2016)

Protein-Protein Interaction Network analyses

This analyses showed that proteins in the Wnt canonical signaling pathway were exclusively found in the PoC PPI network for blood pressure(BP) traits. By examining these PPI overlaps, specific transcripts within BW GWAS loci that may drive mechanistic connections can be identified. For instance, the overlap between the Wnt signaling pathway and the PoC PPI network for both BW and BP traits suggests FZD9 as the probable effector gene at the MLXIPL locus associated with BW.

Point of contact protein-protein interactions for body weight and blood pressure phenotypes in the Wnt signaling pathway. Figure 4. Point of contact PPI for BW and BP phenotypes. The canonical Wnt signaling pathway is enriched with PoC PPI connections between BW and BP-related phenotypes.(Horikoshi,2016)

Title:Genetic underpinning of the comorbidity between type 2 diabetes and osteoarthritis

Publication: Am J Hum Genet.

Main Methods: GWAS data mining

Abstract:This study investigated the genetic basis of the comorbidity between type 2 diabetes and osteoarthritis by comparing two publicly published GWAS projects. The authors first examined the genome-wide genetic correlation between the two diseases and conducted colocalization analysis. They then integrated multi-omics and functional information to elucidate colocalization signals and identify high-confidence effect genes, providing validation for the epidemiological link between obesity and the two diseases.

Research Results:

Genetic Correlation Between Type 2 Diabetes and Osteoarthritis

The authors performed linkage disequilibrium (LD) score regression analysis using LDSC software to estimate the genetic correlation between each osteoarthritis phenotype and type 2 diabetes. The results indicated that, across the genome, the genetic correlation between type 2 diabetes and knee osteoarthritis was stronger than that with hip osteoarthritis.

Genetic correlation results between type 2 diabetes and knee or hip osteoarthritis, including standard error bars and permutation testing. Figure 5.Genetic correlation between type 2 diabetes and osteoarthritis. (A) Results of genetic correlation (rg) between type 2 diabetes (T2D) and knee or hip osteoarthritis (OA), with error bars indicating the standard error of the estimated genetic correlations..(B) Permutation-based testing results for knee OA and hip OA, with the red line representing the observed correlation for each.(Arruda, 2022)

Colocalization Analysis of Type 2 Diabetes and Osteoarthritis

The authors defined a 2 Mb (±1 Mb) region surrounding each established independent association signal for both diseases. For each osteoarthritis phenotype, regional pairwise statistical colocalization analysis with diabetes was performed using the coloc R package. The results identified 18 unique colocalized genomic loci. Among these, 10 loci colocalized with type 2 diabetes in both hip and knee osteoarthritis, 2 loci colocalized only with hip osteoarthritis, and 6 loci colocalized only with knee osteoarthritis.

A graphical representation of colocalization regions for type 2 diabetes and osteoarthritis, displaying posterior probabilities and variant counts. Figure 6. Overview of colocalization regions between type 2 diabetes and osteoarthritis. The y-axis shows the posterior probability of a shared causal variant (PP4), while the x-axis indicates the number of variants in the 95% credible set for the causal variant. Each point represents a colocalized signal between type 2 diabetes and an osteoarthritis (OA) phenotype, with point size reflecting the number of variants in the 95% credible set from the colocalization analysis. (Arruda, 2022)

Differential Gene Expression Analyses.

The authors integrated multi-omics data and functional information to define 72 genes as potential effect genes for the comorbidity of type 2 diabetes and osteoarthritis, as they each had at least one line of evidence supporting their involvement in both diseases. Among these, 19 genes exhibited at least three lines of evidence and were defined as high-confidence effect genes. In the 19 high-confidence genes, 17 exhibited phenotypes associated with type 2 diabetes and osteoarthritis in knockout mouse models, supporting their role in the comorbidity.

Summary of 19 high-confidence effector genes linked to type 2 diabetes and osteoarthritis, categorized by affected joints. Figure 7. Summary of the 19 high-confidence effector genes associated with the comorbidity of type 2 diabetes and osteoarthritis. Genes are categorized according to the joint affected by osteoarthritis. OA = osteoarthritis; T2D = type 2 diabetes; molQTLs = molecular quantitative trait loci; DEG = differential expressed genes; KO mice = knockout mice; OMIM = Online Mendelian Inheritance in Man; HC = previously defined high-confidence effector genes; missense = missense variant.(Arruda, 2022)

Multi-Trait Statistical Colocalization Analysis with Adiposity Measures

The authors utilized GWAS data for type 2 diabetes, GWAS data for osteoarthritis phenotypes, and disease-related tissue molecular QTL datasets. They performed colocalization analyses at the gene or protein level and conducted regional gene analysis using R packages to evaluate colocalization across all traits. HyPrColoc was employed to estimate posterior probabilities and identify candidate effect genes using multiple traits as input. Among the 18 genomic regions colocalizing between type 2 diabetes and osteoarthritis, 16 displayed evidence of association or colocalization with at least one obesity-related trait. FTO and IRX3 were colocalized with the obesity-associated FTO locus, with a probability of over 92% for the occurrence of common causal variants later.

Regional association plots illustrating the FTO and IRX3 regions' link between type 2 diabetes and osteoarthritis, showing significance thresholds. Figure 8. Regional association plots of the FTO and IRX3 region between type 2 diabetes and osteoarthritis. The plots are color-coded according to the linkage disequilibrium between the lead causal variant in the colocalization analysis and all other variants in the region. The red dashed line indicates the genome-wide significance threshold (p = 5×10−8), while the blue dashed line marks the suggestive association threshold (p=10−6). (Arruda, 2022)

Conclusion

This study presents a genetic database approach aimed at revealing the shared genetic etiology between two chronic diseases (type 2 diabetes and osteoarthritis). By analyzing large-scale GWAS, the research identified 18 colocalized genomic loci and, in conjunction with multi-omics and functional genomics information, defined 19 high-confidence effect genes. The findings indicate that the genetic association between type 2 diabetes and knee osteoarthritis is stronger than that with hip osteoarthritis, and that biological pathways related to obesity, skeletal development, and other factors play significant roles in the comorbidity of these diseases.

1.What is GWAS data mining?

GWAS data mining involves analyzing genetic data from genome-wide association studies to identify associations between genetic variants and traits or diseases. It utilizes statistical methods to extract meaningful patterns and insights from large datasets.

2.What types of data are commonly used in GWAS?

GWAS typically uses genotype data (SNPs), phenotype data (traits or diseases), and sometimes additional data such as environmental factors, clinical information, and demographic data.We will provide the download and organization process for all data.

3.How is GWAS data typically formatted?

Common formats for GWAS data include PLINK (.bed, .bim, .fam), VCF (Variant Call Format), and tab-delimited text files (CSV/TSV), which include information on SNPs, individual samples, and their respective genotypes.

4.How is data quality ensured during analysis?

Our service implements stringent quality control measures, including preprocessing of raw data and statistical assessments, to ensure the accuracy and reproducibility of results. This includes filtering out low-quality sequences and evaluating sequencing depth and diversity metrics.

5.How can GWAS data mining contribute to understanding complex diseases?

By revealing associations between genetic variants and complex diseases, GWAS data mining helps elucidate the genetic basis of these conditions, contributing to knowledge of your pathophysiology and potential interventions.

References

Horikoshi, M., et al. Genome-wide associations for birth weight and correlations with adult disease. Nature.2016, 538(7624), 248–252.
Arruda, A. L., et al. Genetic underpinning of the comorbidity between type 2 diabetes and osteoarthritis. American journal of human genetics.2023, 110(8), 1304–1318.