Bioinformatics 101: Amplicon Sequencing and Beta Diversity Analysis

Bioinformatics 101: Amplicon Sequencing and Beta Diversity Analysis

Online Inquiry

What is Beta Diversity Analysis?

Beta Diversity, a key metric in microbial ecology, enables a nuanced comparison of microbial community biodiversity across various samples. The analysis commences with the computation of a distance matrix among environmental samples, revealing the inherent structure of community data. Subsequently, differences between samples are observed through a process known as ordination.

Please refer to our article Bioinformatics 101: Microbiome Diversity Analysis for more information.

Working in tandem with alpha diversity, beta diversity encapsulates the overall biological heterogeneity within a given environmental community. To delve into beta diversity analysis, various algorithms come into play, including bray_curtis, euclidean, abund_jaccard, unweighted_unifrac, and weighted_unifrac. These algorithms fall into two primary categories: weighted (e.g., Bray-Curtis and Wright) and unweighted (e.g., Jaccard and Unweighted Unifrac).

In unweighted analyses, the primary focus lies on the presence or absence of species. A smaller beta diversity between two populations indicates a higher similarity in the types of species present. On the other hand, weighted methods consider both the presence or absence of species and their abundance dimensions.

Alpha and Beta Diversity. (Nah et al., 2019)Alpha and Beta Diversity. (Nah et al., 2019)

The Bray-Curtis distance metric, a commonly employed measure in ecology, calculates dissimilarity based on species abundance information, effectively addressing variability among communities. Weighted Unifrac distance, in contrast, factors in both the evolutionary relationship of microorganisms and the relative abundance of species in each sample. Meanwhile, Unweighted Unifrac solely considers species' presence or absence, disregarding relative abundance differences.

Each distance metric has its unique sensitivity: Unweighted Unifrac is particularly attuned to rare species, while Bray-Curtis and Weighted Unifrac distances are more responsive to species with higher abundance.

The analysis culminates in the examination of results using multivariate statistical methods like Principal Co-ordinates Analysis (PCoA) and Unweighted Group Mean Cluster Analysis (UPGMA). These methods provide insights into the differences in microbial community structure among samples and the varying contributions of different classifications to the samples.

CD Genomics integrates microbial amplicon sequencing and metagenome sequencing data analysis to assist clients in species classification, determining species abundance, and unveiling the interrelationship analysis between environmental factors and microbial communities. Simultaneously, we provide insight into environmental species composition and abundance, conduct gene prediction and functional annotation, and facilitate the comparison of gene abundance and metabolic network differences across samples.

Generating a Sample Distance Heatmap

In the initial stages of beta diversity analysis, the first crucial step involves calculating the distance between any two samples to construct a sample distance matrix. This process entails feeding a flat_out_table into the analysis, selecting an appropriate distance algorithm (commonly Bray-Curtis), and employing usearch software to compute the dissimilarity coefficient for each pair of samples. The resulting distance matrix is then subjected to hierarchical clustering, visually represented in a heatmap.

The heatmap coloration serves as a key indicator of sample proximity. Cooler, bluer tones signify closer distances, indicating higher similarity between samples. Conversely, warmer, redder hues denote greater distances, highlighting dissimilarity. The clustering tree within the heatmap visually organizes the samples, providing a clear depiction of the relationships and distances between them. This comprehensive visualization enhances our understanding of the sample distribution and facilitates the identification of clusters within the dataset.

PCA Analysis

Principal Component Analysis (PCA) is an invaluable technique for dimensionality reduction, commonly employed to simplify complex datasets. Grounded in Euclidean distance, PCA applies variance decomposition to identify principal components (eigenvalues) that underlie sample differences and their respective contribution rates.

Please refer to our article Overview of Principal Component Analysis for more information.

By extracting essential features from the original data, PCA reorganizes samples in a new, lower-dimensional coordinate system, maximizing the preservation of actual differences between them. This transformative process ensures that the distances between samples in the new coordinate system align with their true dissimilarities. As the sorting unfolds, each coordinate axis sequentially explains a decreasing proportion of the original data's sample differences.

Typically, the first two dimensions (PC1 and PC2) or three dimensions (PC1, PC2, and PC3) obtained from PCA analysis are selected for mapping. This visual representation effectively captures the primary distributional characteristics of community samples, allowing for the quantification of differences and similarities among them.

PCoA analysis based on weighted unifrac – CD GenomicsPCoA analysis based on weighted unifrac – CD Genomics

R software facilitated PCA analysis of the community compositional structure at the Operational Taxonomic Unit (OTU) level. Leveraging the results of Euclidean coefficient of dissimilarity calculations, two- or three-dimensional images were generated, depicting the natural distributional features among samples. Notably, on the PCA map, closer proximity indicates a higher similarity in species composition, offering a visually intuitive means to assess the community relationships and overall structural patterns.

Cluster Analysis

Cluster analysis, primarily utilizing the Hierarchical Clustering method, visually represents sample similarities through hierarchical trees, gauging the effectiveness of clustering by assessing the length of the branches in the clustering tree. Similar to Multidimensional Scaling (MDS) analysis, any distance metric can be employed in cluster analysis to evaluate sample similarities.

Various methods are available for cluster analysis, including Unweighted Pair-Group Method with Arithmetic Means (UPGMA), single-linkage clustering, complete-single-linkage clustering, complete-linkage clustering, and average-linkage clustering.

To gain a more profound insight into the outcomes of Principal Coordinates Analysis (PCoA), we conducted cluster analysis using the Unweighted Group Averaging Method (UPGMA). This involved assessing the samples based on both Weighted Unifrac Distance Matrix and Unweighted Unifrac Distance Matrix. The clustering results were then harmonized with the relative abundance of species at the operational taxonomic unit (OTU) level for a comprehensive presentation.


  1. Nah, G., Park, SC., Kim, K. et al. Type-2 Diabetics Reduces Spatial Variation of Microbiome Based on Extracellular Vesicles from Gut Microbes across Human Body. Sci Rep 9, 20136 (2019).
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry