Genome-wide association studies (GWAS) test thousands of genetic variants in the genome to find variants that are statistically associated with specific traits or diseases. The number of genetic variants associated with disease is expected to grow steadily as the GWAS sample size increases. GWAS have a range of applications, such as providing insight into the underlying biology of phenotypes, estimating their heritability, calculating genetic correlations, making clinical risk predictions, informing drug development programs, and inferring potential cause-and-effect relationships between risk factors and health outcomes.
GWAS typically requires very large sample sizes to identify replicable genome-wide significant association loci, and the required sample size can be determined using calculations in software tools such as CATS or GPC. Phenotypes can be for binary traits or quantitative traits. In addition, a choice can be made between a population-based design and a family-based design.
Genotyping of individuals is usually performed using microarrays of common variants or next-generation sequencing methods that include rare variants, such as WES or WGS. Because of the high cost of sequencing, microarray-based genotyping is the most commonly used method. Ideally, WGS, which can determine almost every genotype in the entire genome, is preferred over WES and microarrays and is expected to become the method of choice in the coming years as low-cost WGS technology becomes increasingly available.
Workflow of GWAS
Once sample and variant quality control has been performed on the GWAS array data, variants are typically phased and estimated using a reference panel of sequenced haplotypes, which involves statistical inference of genotypes that have not been directly analyzed.
In GWAS, linear or logisticstic regression models are usually used to test for associations, depending on whether the phenotype is continuous (e.g., height, blood pressure, or body mass index) or binary (e.g., presence or absence of disease). Covariates such as age, sex, and ancestry were included to account for stratification and to avoid confounding effects of demographic factors.
The main output of GWAS analysis is data on p-values, effect sizes and their directions generated from association tests of all tested genetic variants with the phenotype of interest. These data are typically presented using Manhattan plots and qq plots generated using software tools including R or online sites such as FUMA or LocusZoom.
a. Manhattan plot; b. quantile–quantile plot
Mapping is a computational simulation process that aims to prioritize the variants most likely to be associated with a target phenotype within each genetic locus identified by GWAS, based on observed patterns of linkage disequilibrium and association statistics. (i) The simplest fine-mapping analysis is a conditional association analysis of regional variants, which adjusts the regional association signal according to the set of variants on the locus by including lead variants as covariates in the genotype-phenotype regression model. (ii) Several other approaches are based on Bayesian models, including CAVIAR93, FINEMAP94, Paintor95, and SUSIE96. (iii) The prioritization of variants can be improved by integrating functional annotations of SNPs (e.g., expression of quantitative trait loci or epigenomic modalities) into the prior information of Bayesian fine-mapping models.
(i) For the 2-3% of GWAS motifs that are finely localized to coding variants, tools such as Annovar or VEP can be used to infer their potential impact on genes.
(ii) The vast majority of relevant, finely mapped SNPs are located outside the coding region, do not affect protein structure, and have unknown regulatory functions. Molecular quantitative trait locus (molQTLs) analysis identifies genetic variants regulating target genes, which correlates genetic variants with specific molecular phenotypes. Finely targeted GWAS variants can be linked to genes using a chromatin conformation capture (3C) based approach that can reflect enhancer-promoter loops controlling proximal or distal genes.
Highly polygenic signals from any specific trait of GWAS converge in a limited number of biological processes where pathway level effects of genetic variation can be identified and linked to cellular and physiological functions. Methodological software to achieve this includes MAGMA and DEPICT.
CD Genomics provides comprehensive GWAS services to help you identify genetic variants associated with specific phenotypes or diseases. Learn more to see how you can benefit from our GWAS service to advance their research.
Reference