GWAS - Introduction, Features, and Workflow

GWAS - Introduction, Features, and Workflow

Online Inquiry

What is genome-wide association studies?

Genome-wide association studies (GWAS) test thousands of genetic variants in the genome to find variants that are statistically associated with specific traits or diseases. The number of genetic variants associated with disease is expected to grow steadily as the GWAS sample size increases. GWAS have a range of applications, such as providing insight into the underlying biology of phenotypes, estimating their heritability, calculating genetic correlations, making clinical risk predictions, informing drug development programs, and inferring potential cause-and-effect relationships between risk factors and health outcomes.

How are genome-wide association studies conducted?

1. Collect DNA and phenotypic information on individuals (e.g. disease status and demographic information such as age and sex)

GWAS typically requires very large sample sizes to identify replicable genome-wide significant association loci, and the required sample size can be determined using calculations in software tools such as CATS or GPC. Phenotypes can be for binary traits or quantitative traits. In addition, a choice can be made between a population-based design and a family-based design.

2. Genotyping of each individual using available GWAS arrays or sequencing strategies

Genotyping of individuals is usually performed using microarrays of common variants or next-generation sequencing methods that include rare variants, such as WES or WGS. Because of the high cost of sequencing, microarray-based genotyping is the most commonly used method. Ideally, WGS, which can determine almost every genotype in the entire genome, is preferred over WES and microarrays and is expected to become the method of choice in the coming years as low-cost WGS technology becomes increasingly available.

Workflow of GWASWorkflow of GWAS

3. Quality control

  • Removal of rare variants and removal of variants that do not satisfy the Hardy-Weinberg equilibrium test.
  • Filtering of SNPs that are missing in a small percentage of individuals in the cohort.
  • Identifying and removing genotyping errors and ensuring that phenotypes are well matched to genetic data.

4. Haplotype phasing and filling

Once sample and variant quality control has been performed on the GWAS array data, variants are typically phased and estimated using a reference panel of sequenced haplotypes, which involves statistical inference of genotypes that have not been directly analyzed.

5. Performing statistical tests of association

In GWAS, linear or logisticstic regression models are usually used to test for associations, depending on whether the phenotype is continuous (e.g., height, blood pressure, or body mass index) or binary (e.g., presence or absence of disease). Covariates such as age, sex, and ancestry were included to account for stratification and to avoid confounding effects of demographic factors.

6. Conduct meta-analysis (optional)

7. Seek independent replications and interpret the results by performing multiple GWAS post-hoc analyses

Guidelines for Analysis Using GWAS Data

1.GWAS data

The main output of GWAS analysis is data on p-values, effect sizes and their directions generated from association tests of all tested genetic variants with the phenotype of interest. These data are typically presented using Manhattan plots and qq plots generated using software tools including R or online sites such as FUMA or LocusZoom.

a. Manhattan plot; b. quantile–quantile plota. Manhattan plot; b. quantile–quantile plot


Mapping is a computational simulation process that aims to prioritize the variants most likely to be associated with a target phenotype within each genetic locus identified by GWAS, based on observed patterns of linkage disequilibrium and association statistics. (i) The simplest fine-mapping analysis is a conditional association analysis of regional variants, which adjusts the regional association signal according to the set of variants on the locus by including lead variants as covariates in the genotype-phenotype regression model. (ii) Several other approaches are based on Bayesian models, including CAVIAR93, FINEMAP94, Paintor95, and SUSIE96. (iii) The prioritization of variants can be improved by integrating functional annotations of SNPs (e.g., expression of quantitative trait loci or epigenomic modalities) into the prior information of Bayesian fine-mapping models.

3.Identification of affected genes

(i) For the 2-3% of GWAS motifs that are finely localized to coding variants, tools such as Annovar or VEP can be used to infer their potential impact on genes.

(ii) The vast majority of relevant, finely mapped SNPs are located outside the coding region, do not affect protein structure, and have unknown regulatory functions. Molecular quantitative trait locus (molQTLs) analysis identifies genetic variants regulating target genes, which correlates genetic variants with specific molecular phenotypes. Finely targeted GWAS variants can be linked to genes using a chromatin conformation capture (3C) based approach that can reflect enhancer-promoter loops controlling proximal or distal genes.

4.Identification of regulatory pathways and cellular effects

Highly polygenic signals from any specific trait of GWAS converge in a limited number of biological processes where pathway level effects of genetic variation can be identified and linked to cellular and physiological functions. Methodological software to achieve this includes MAGMA and DEPICT.

Partner with CD Genomics

CD Genomics provides comprehensive GWAS services to help you identify genetic variants associated with specific phenotypes or diseases. Learn more to see how you can benefit from our GWAS service to advance their research.


  1. Uffelmann, E., Huang, Q.Q., Munung, N.S. et al. Genome-wide association studies. Nat Rev Methods Primers 1, 59 (2021).
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry