Inquiry
banner
PCA Plot:The Principle and How to Draw it

PCA Plot:The Principle and How to Draw it

Online Inquiry

This article begins by presenting the fundamental concepts and underlying principles of Principal Component Analysis. It then delves into a comprehensive explanation of how to extract the principal components of a dataset by computing the covariance matrix, eigenvalues, and eigenvectors, ultimately facilitating dimensionality reduction. Furthermore, the article details the process of performing PCA analysis using R, along with step-by-step instructions on generating both 2D and 3D PCA plots. These visualizations help provide a clearer and more intuitive understanding of the data's structure and patterns.

What is Principal Component Analysis?

Principal Component Analysis(PCA) is a commonly used data analysis method. PCA transforms the original data into a set of linearly independent representations of each dimension through linear transformation. It can be used to extract the main feature components of the data and is often used to reduce the dimension of high-dimensional data.

In many fields of research and application, it is often necessary to conduct a large number of observations on multiple variables that reflect things and collect a large amount of data for analysis and finding patterns. Multivariate large samples will undoubtedly provide rich information for research and application, but it also increases the workload of data collection to a certain extent. More importantly, in most cases, there may be correlations between many variables, thus increasing the complexity of problem analysis and causing inconvenience to the analysis. If each indicator is analyzed separately, the analysis is often isolated rather than comprehensive. Blindly reducing indicators will lose a lot of information and easily lead to wrong conclusions.

Therefore, it is necessary to find a reasonable method to reduce the number of indicators that need to be analyzed while minimizing the loss of information contained in the original indicators, so as to achieve the purpose of comprehensive analysis of the collected data. Because there is a certain correlation between variables, it is possible to use fewer comprehensive indicators to synthesize various types of information existing in each variable. PCA and factor analysis belong to this type of dimensionality reduction method.

The Principle of PCA

PCA fundamentally transforms high-dimensional data into a reduced set of key features. This dimensional reduction process generates new orthogonal vectors, known as principal components, which capture essential data characteristics. These components represent a reconstruction of the original n-dimensional space using fewer dimensions while preserving critical information patterns. The methodology systematically identifies independent coordinate systems that optimally describe data variance. What distinguishes this approach is how the newly constructed axes emerge directly from inherent data properties and relationships.

You can see that in the figure below, the original different dimensions (Gene) have been extracted through orthogonal features and become 2 dimensions, which allows us to better observe the results. Note that the newly constructed features are constructed based on the original features, including their main information.

The schematic diagram of PCA.The schematic diagram of PCA.(Chen, L.,et.al,2024)

Through dimensional transformation, PCA identifies and extracts key data features, converting them into a new representational framework. This technique effectively maps multidimensional information onto an alternative coordinate system. The resulting space consists of orthogonally arranged principal components, with the dimensionality reduction achieved through systematic projection methods. These components represent the fundamental patterns within the original dataset while maintaining essential variability.

The problem is how to extract the main components of the data and how to measure the information stored after projection? The PCA algorithm uses variance to measure the amount of information. In order to ensure that the reduced-dimension low-dimension data retains as much effective information as possible from the original data, it is necessary to make the reduced-dimension data as dispersed as possible. From the perspective of variance, it means retaining the largest variance. So how to get the principal components that contain the greatest differences? In fact, the covariance matrix of the data matrix is calculated to obtain the eigenvalues and eigenvectors of the covariance matrix, and then a matrix composed of the eigenvectors corresponding to the k features with the largest eigenvalues is selected to project the original data matrix onto the new k-dimensional feature space, which realizes dimensional reduction of data features.

The calculate principle of PCAThe principle of PCA.

How to Draw a PCA Plot in R

1. Install and load necessary packages
First, you need to install and load some commonly used packages, such as ggplot2, factoextra, scatterplot3d, rgl, etc.

install.packages("ggplot2")
install.packages("factoextra")
install.packages("scatterplot3d")
install.packages("rgl")
install.packages("plotly")

library(ggplot2)
library(factoextra)
library(scatterplot3d)
library(rgl)
library(plotly)

2. Perform PCA analysis
PCA analysis was performed using the prcomp() function. Taking the iris dataset as an example here.

data(iris)
iris_pca <- prcomp(iris[, 1:4], scale. = TRUE)
summary(iris_pca)

The iris_pca results.The iris_pca results.

3. Draw a 2D PCA diagram using ggplot2 and tidyverse packages

iris_pca_df <-data.frame(iris_pca$x,Species = iris$Species)
ggplot(iris_pca_df,aes(x=PC1,y=PC2,color=Species))+
geom_point()+
  stat_ellipse(level = 0.95, show.legend = T) + 
  annotate('text', label = 'setosa', x = -2, y = -1.25, size = 5, colour = 'red') +
  annotate('text', label = 'versicolor', x = 0, y = - 0.5, size = 5, colour = '#00ba38') +
  annotate('text', label = 'virginica', x = 3, y = 0.5, size = 5, colour = '#619cff')	

The iris_pca plot 2D result.The iris_pca plot 2D result.

4. Draw a 3D PCA diagram

scores <- as.data.frame(iris_pca$x)
scatterplot3d(scores$PC1, scores$PC2, scores$PC3, color = as.numeric(iris$Species), pch = 16, main = "3D PCA Plot", xlab = "PC1", ylab = "PC2", zlab = "PC3")
legend("topright", legend = levels(iris$Species), col = 1:length(levels(iris$Species)), pch = 16)

The iris_pca plot 3D result.The iris_pca plot 3D result.

In conclusion, PCA is a powerful tool for simplifying complex datasets by reducing dimensionality while preserving essential information. Through the process of dimensionality reduction, PCA helps reveal hidden patterns and structures in the data, making it easier to analyze and interpret. In this article, we've walked through the fundamental concepts of PCA, its practical implementation, and how to visualize the results with both 2D and 3D plots using R.

Stay tuned for our upcoming articles, where CD Genomics will continue to explore various plotting techniques to enhance your data visualization skills. Don't miss out on our next installment in the series-whether you're a beginner or an experienced analyst, we'll help you master the tools that can unlock new insights from your data.

Reference

  1. Chen, L., Li, T., Chen, Y., Chen, X., Wozniak, M., Xiong, N., & Liang, W. (2024). Design and analysis of quantum machine learning: a survey. Connection Science, 36(1). https://doi.org/10.1080/09540091.2024.2312121
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry