UMAP (Uniform Manifold Approximation and Projection) is a technique for reducing the dimensionality of data by mapping high-dimensional data into a lower-dimensional space. This method is commonly used for data visualization and simplifying complex datasets for further analysis. In this article, we will delve into feature dimensionality reduction, its importance, and the core principles behind UMAP. We will compare UMAP with other dimensionality reduction methods like PCA and t-SNE, discussing its advantages. Furthermore, the article will offer practical coding examples to showcase how UMAP can be used for reducing dimensions and visualizing data, helping to uncover underlying patterns and structures in the data.
As a sophisticated technique for reducing dimensionality, UMAP employs nonlinear methods to analyze and visualize high-dimensional information. Through its preservation of relationships between neighboring points, this algorithm constructs meaningful representations within lower-dimensional spaces while maintaining crucial data structures. Compared to traditional linear approaches like PCA, UMAP demonstrates superior capabilities in detecting and preserving intricate patterns within complex datasets. Although both UMAP and t-SNE serve similar purposes, UMAP typically outperforms in processing speed, ability to handle large datasets, and preservation of global data relationships. However, t-SNE remains valuable for specific use cases, particularly when emphasis on local structure is paramount. UMAP can be understood as an evolution beyond t-SNE's foundation, incorporating advanced mathematical principles to enhance both computational efficiency and dataset scalability. The selection between these dimensional reduction methods should be guided by the particular requirements of your analytical objectives.
UMAP's foundational principles rest upon a trio of key data-related premises: uniform distribution across Riemannian manifolds, approximate local constancy of the Riemannian metric, and local connectivity within the manifold structure. This dimensionality reduction technique aims to maintain both microscopic and macroscopic relationships present in high-dimensional data through the construction of a graph that captures connectivity patterns. The implementation proceeds in two primary stages: initially constructing a connectivity graph via stochastic gradient descent optimization, followed by embedding the graphical nodes into a reduced-dimensional space. To achieve superior computational efficiency and scalability, this methodology incorporates advanced mathematical frameworks, drawing from multiple disciplines including the study of Riemannian geometric principles, topological algebra, and theoretical aspects of electrical networks.
The overview of UMAP.(Sainburg, T., et.al,2021)
Capable of handling extremely large datasets while generating embeddings in a relatively short time.
Preserves both the local and global structures of the original data, enabling more informative visual representations and improving performance in classification, clustering, and other data analysis tasks.
Supports a wide range of data types, including numerical, categorical, and mixed data.
Does not require prior standardization or normalization, as it can adaptively process data with different scales.
A UMAP plot is usually a two-dimensional or three-dimensional scatter plot, with each point representing a sample in the raw data. The following are the key elements for interpreting the UMAP diagram:
(1) Position of the point
Close points: Represents similar data points in high-dimensional space. For example, in single-cell RNA sequencing, adjacent dots may belong to the same cell type.
Far away points: Represents data points with large differences in high-dimensional space. For example, dots of different cell types are often distributed in different areas of the graph.
(2) Color and marking
Color coding: Colors are commonly used in UMAP maps to represent category labels (such as cell types, sample conditions, etc.). For example, differently colored dots may represent different cell types.
The UMAP of tumor cells and Thyrocyte.(Wang, Y.,et.al,2023)
(3) Density and distribution
Density of points: High-density areas may indicate a concentrated distribution of certain categories or features. For example, in single-cell RNA sequencing, high-density areas may correspond to aggregation of specific cell types.
Distribution shapes: Certain distribution shapes (such as rings, chains) may reveal the inherent structure of the data. For example, some data may form a chain distribution, indicating a certain sequential relationship between data points.
(4) Global and local structure
Global structure: UMAP diagram can show the overall distribution pattern of data. For example, certain global distributions may reveal overall relationships between samples.
Local structure: The distribution of points in a local area can reveal more detailed data characteristics. For example, in single cell data, local areas may correspond to subpopulations of specific cell types.
So, how do we implement UMAP in the R language? Here, the editor brings you two examples. We will start with simple random data.
First we need to install and load the UMAP package:
install.packages("umap") library(umap)
Next, we can use the umap() function to reduce the data to two-dimensional space and visualize it.
In this example, we generated a random data set containing 10 features and 100 observations.
set.seed(123) data <- matrix(rnorm(1000), ncol = 10)
We then use the umap() function to reduce the data to two-dimensional space and visualize the results.
embedding <- umap(data, n_components = 2) plot(embedding$layout[,1], embedding$layout[,2], pch = 20)
The random dataset UMAP result.
In addition, we can also use ggplot2 to draw UMAP maps ourselves.
Load data and preprocess:
Use the Palmer Penguin dataset.
Remove missing values and select the numeric column.
Add a unique row ID.
Create a metadata data box: Contains all classification variables and unique row IDs.
library(tidyverse) library(palmerpenguins) library(umap) library(ggplot2) penguins <- penguins %>% drop_na() %>% select(-year) %>% mutate(ID = row_number())
Perform UMAP dimension reduction:
Select the numeric column.
Standardized data.
Use the umap() function to reduce UMAP dimensions.
Extract UMAP components and merge metadata:
Extract UMAP components.
Merge UMAP components with metadata.
set.seed(142) umap_fit <- penguins %>% select(where(is.numeric)) %>% column_to_rownames("ID") %>% scale() %>% umap() umap_df <- umap_fit$layout %>% as.data.frame() %>% rename(UMAP1 = "V1", UMAP2 = "V2") %>% mutate(ID = row_number()) %>% inner_join(penguins_meta, by = "ID")
Draw a UMAP diagram:
Use ggplot2 to draw a scatter plot.
Points are colored and shaped based on species and gender.
umap_df %>% ggplot(aes(x = UMAP1, y = UMAP2, color = species, shape = sex)) + geom_point() + labs(x = "UMAP1", y = "UMAP2", subtitle = "UMAP plot")
The Penguin dataset UMAP result.
UMAP offers a powerful and efficient approach to reducing the complexity of high-dimensional data while preserving both local and global structures. Its ability to handle large datasets and various data types makes it an invaluable tool for data visualization and analysis. By incorporating UMAP into your workflow, you can gain deeper insights into complex data patterns and relationships.
Stay tuned for our upcoming bioinformatics series, where CD Genomics will explore more advanced techniques and visualization tools that can further enhance your data analysis capabilities. Don't miss out on these insightful articles to elevate your understanding and skills in bioinformatics!
References: