Home
Resources
Plots
UMAP Plot: Dimensionality Reduction and Visualization Analysis of High-Dimensional Data

UMAP Plot: Dimensionality Reduction and Visualization Analysis of High-Dimensional Data

UMAP (Uniform Manifold Approximation and Projection) is a technique for reducing the dimensionality of data by mapping high-dimensional data into a lower-dimensional space. This method is commonly used for data visualization and simplifying complex datasets for further analysis. In this article, we will delve into feature dimensionality reduction, its importance, and the core principles behind UMAP. We will compare UMAP with other dimensionality reduction methods like PCA and t-SNE, discussing its advantages. Furthermore, the article will offer practical coding examples to showcase how UMAP can be used for reducing dimensions and visualizing data, helping to uncover underlying patterns and structures in the data.

What is UMAP

As a sophisticated technique for reducing dimensionality, UMAP employs nonlinear methods to analyze and visualize high-dimensional information. Through its preservation of relationships between neighboring points, this algorithm constructs meaningful representations within lower-dimensional spaces while maintaining crucial data structures. Compared to traditional linear approaches like PCA, UMAP demonstrates superior capabilities in detecting and preserving intricate patterns within complex datasets. Although both UMAP and t-SNE serve similar purposes, UMAP typically outperforms in processing speed, ability to handle large datasets, and preservation of global data relationships. However, t-SNE remains valuable for specific use cases, particularly when emphasis on local structure is paramount. UMAP can be understood as an evolution beyond t-SNE's foundation, incorporating advanced mathematical principles to enhance both computational efficiency and dataset scalability. The selection between these dimensional reduction methods should be guided by the particular requirements of your analytical objectives.

UMAP's foundational principles rest upon a trio of key data-related premises: uniform distribution across Riemannian manifolds, approximate local constancy of the Riemannian metric, and local connectivity within the manifold structure. This dimensionality reduction technique aims to maintain both microscopic and macroscopic relationships present in high-dimensional data through the construction of a graph that captures connectivity patterns. The implementation proceeds in two primary stages: initially constructing a connectivity graph via stochastic gradient descent optimization, followed by embedding the graphical nodes into a reduced-dimensional space. To achieve superior computational efficiency and scalability, this methodology incorporates advanced mathematical frameworks, drawing from multiple disciplines including the study of Riemannian geometric principles, topological algebra, and theoretical aspects of electrical networks.

The overview of UMAP.(Sainburg, T., et.al,2021)

The Advantages of UMAP

Capable of handling extremely large datasets while generating embeddings in a relatively short time.

Preserves both the local and global structures of the original data, enabling more informative visual representations and improving performance in classification, clustering, and other data analysis tasks.

Supports a wide range of data types, including numerical, categorical, and mixed data.

Does not require prior standardization or normalization, as it can adaptively process data with different scales.

How to Read a UMAP Plot

A UMAP plot is usually a two-dimensional or three-dimensional scatter plot, with each point representing a sample in the raw data. The following are the key elements for interpreting the UMAP diagram:

(1) Position of the point

Close points: Represents similar data points in high-dimensional space.

Far away points: Represents data points with large differences in high-dimensional space. For example, dots of different cell types are often distributed in different areas of the graph.

(2) Color and marking

Color coding: Colors are commonly used in UMAP maps to represent category labels (such as cell types, sample conditions, etc.). For example, differently colored dots may represent different cell types.

The UMAP of tumor cells and Thyrocyte.(Wang, Y.,et.al,2023)

(3) Density and distribution

Density of points: High-density areas may indicate a concentrated distribution of certain categories or features.

Distribution shapes: Certain distribution shapes (such as rings, chains) may reveal the inherent structure of the data. For example, some data may form a chain distribution, indicating a certain sequential relationship between data points.

(4) Global and local structure

Global structure: UMAP diagram can show the overall distribution pattern of data. For example, certain global distributions may reveal overall relationships between samples.

Local structure: The distribution of points in a local area can reveal more detailed data characteristics.

How to Draw a UMAP Plot in R

So, how do we implement UMAP in the R language? Here, the editor brings you two examples. We will start with simple random data.

First we need to install and load the UMAP package:

install.packages("umap")
library(umap)

Next, we can use the umap() function to reduce the data to two-dimensional space and visualize it.

In this example, we generated a random data set containing 10 features and 100 observations.

set.seed(123)
data <- matrix(rnorm(1000), ncol = 10)

We then use the umap() function to reduce the data to two-dimensional space and visualize the results.

embedding <- umap(data, n_components = 2)
plot(embedding$layout[,1], embedding$layout[,2], pch = 20)

The random dataset UMAP result.

In addition, we can also use ggplot2 to draw UMAP maps ourselves.

Load data and preprocess:

Use the Palmer Penguin dataset.

Remove missing values and select the numeric column.

Add a unique row ID.

Create a metadata data box: Contains all classification variables and unique row IDs.

library(tidyverse)
 library(palmerpenguins)
 library(umap)
 library(ggplot2)

penguins <- penguins %>% drop_na() %>% select(-year) %>% mutate(ID = row_number())

Perform UMAP dimension reduction:

Select the numeric column.

Standardized data.

Use the umap() function to reduce UMAP dimensions.

Extract UMAP components and merge metadata:

Extract UMAP components.

Merge UMAP components with metadata.

set.seed(142)
 umap_fit <- penguins %>% select(where(is.numeric)) %>% column_to_rownames("ID") %>% scale() %>% umap()

 umap_df <- umap_fit$layout %>% as.data.frame() %>% rename(UMAP1 = "V1", UMAP2 = "V2") %>% mutate(ID = row_number()) %>% inner_join(penguins_meta, by = "ID")

Draw a UMAP diagram:

Use ggplot2 to draw a scatter plot.

Points are colored and shaped based on species and gender.

umap_df %>% ggplot(aes(x = UMAP1, y = UMAP2, color = species, shape = sex)) +
 geom_point() +
 labs(x = "UMAP1", y = "UMAP2", subtitle = "UMAP plot")

The Penguin dataset UMAP result.

UMAP offers a powerful and efficient approach to reducing the complexity of high-dimensional data while preserving both local and global structures. Its ability to handle large datasets and various data types makes it an invaluable tool for data visualization and analysis. By incorporating UMAP into your workflow, you can gain deeper insights into complex data patterns and relationships.

Stay tuned for our upcoming bioinformatics series, where CD Genomics will explore more advanced techniques and visualization tools that can further enhance your data analysis capabilities. Don't miss out on these insightful articles to elevate your understanding and skills in bioinformatics!

References

Sainburg, T., McInnes, L., & Gentner, T. Q. (2021). Parametric UMAP Embeddings for Representation and Semisupervised Learning. Neural computation, 33(11), 2881–2907. https://doi.org/10.1162/neco_a_01434
Wang, Y., Song, W., Li, Y., Liu, Z.,et.al. (2023). Integrated analysis of tumor microenvironment features to establish a diagnostic model for papillary thyroid cancer using bulk and single-cell RNA sequencing technology. Journal of cancer research and clinical oncology, 149(18), 16837–16850. https://doi.org/10.1007/s00432-023-05420-8

* For Research Use Only. Not for use in diagnostic procedures.