Home
Resources
Support Documents
From Start to End: Single Cell RNA Sequencing Data Analysis

From Start to End: Single Cell RNA Sequencing Data Analysis

Developments and innovations in sequencing technology have allowed the measurement of gene expression in thousands of cells in a single experiment. And single-cell RNA sequencing (scRNA-seq) methods have revolutionized the field of genomics, as they have created unprecedented opportunities to address cellular heterogeneity by exploring gene expression profiles at single-cell resolution. The rapidly growing field of scRNA-seq also places new demands on its data analysis tools. Unlike bulk RNA sequencing approaches, scRNA-seq requires comprehensive computational tools to address high data complexity and keep pace with emerging single-cell correlation challenges.

Laboratory Workflow of Single Cell RNA Sequencing

Currently, all scRNA-seq laboratory methods rely on six main steps: (1) preparation of viable single-cell suspensions; (2) assessment of cell viability; (3) removal of lysed cells; (4) individual transcriptome barcoding; (5) cDNA generation and (6) sequencing library generation. As for instrument implementation, the Illumina series is commonly used today because it balances cost-effectiveness with high-quality output. In addition, BGI sequencing platforms are also used in single-cell studies.

Workflow of single-cell RNA sequencing analysis. (Fresia R et al., 2021)

Workflow of Single-Cell RNA Sequencing Data Analysis

Unlike previous whole transcriptomics analyses, scRNA-seq requires innovative analytical tools to address emerging single-cell-related challenges, including large-scale data and high levels of noise interference due to loss events. A standardized workflow helps guide through the key steps of scRNA-seq data analysis regardless of specific tools and different biological data types.

scRNA-seq analysis consists of six basic steps, including raw data pre-processing, filtering by QC covariates, normalization, feature selection, linear dimensional reduction, visualization, and clustering.

Workflow for various steps in scRNA-seq data analysis. (Malhotra A et al., 2022)

Quality control and cell filtering

The limitations of scRNA-seq are mainly related to the low capture efficiency, which may lead to an increased level of technical noise. Therefore, barcodes of cells that do not correspond to live cells must be filtered out before downstream analysis can be performed. These cells are typically identified by detecting outliers in the QC covariate distribution and filtered out by threshold. This step is common to all scRNA-seq pipelines and is based on the analysis of three QC covariate distributions: (1) the number of genes captured per cell barcode; (2) the mitochondrial read fraction per barcode, which is used to identify dying cells; and (3) the number of unique molecular barcodes per barcode (i.e., the depth of cell coverage).

Gene filtering to remove noise

scRNA-seq datasets typically include over 25,000 genes measured in thousands of cells, many of which may be uninformative because they mostly contain zero counts that should be filtered out before starting downstream analysis. Gene filtering can help speed up data processing by reducing dimensionality and reducing excessive zero counts, thereby improving the data normalization step and all downstream analyses. Usually, a fixed threshold is defined so that genes detected in a small number of cells are removed.

Data normalization

Data normalization addresses unwanted bias caused by count depth variability while preserving true biological differences. Through normalization, the expression of each gene is readjusted to make gene expression comparable between individual cells, taking into account the abundance of mRNA molecules captured by each cell. The most commonly used method for normalizing scRNA-seq data today is counts per million (CPM), also known as RPM (Reads per million), a linear global scaling method inherited from bulk RNA-seq.

Please read our article RPM, RPKM, FPKM, and TPM: Normalization Methods in RNA Sequencing to learn more about normalization methods of RNA-Seq.

Feature selection

Feature selection is designed to detect biologically relevant genes while excluding uninformative genes. scRNA-seq data dimensionality can be kept quite high, with a large number of genes (>10,000) remaining even after gene filtering. Feature selection can greatly speed up processing because it reduces the data dimensionality by filtering out "uninformative" genes. This is usually achieved by selecting a limited number of highly variable genes (HVG) to guide further analysis.

Downscaling and visualizing scRNA-seq data

Dimensionality reduction aims to compress the complexity of the data into a lower dimensional space by optimally retaining the key attributes of the data. Dimensionality reduction methods are essential for clustering, visualization, and summarization of scRNA-seq data.

Linear dimensionality reduction methods are often used as a pre-processing step for non-linear dimensionality reduction methods. The most popular linear dimensionality reduction algorithm is PCA (Principal Component Analysis). Nonlinear dimensionality reduction methods are powerful tools for data visualization rather than summarization, and the two most commonly used methods are t-distributed random neighborhood embedding (t-SNE) and uniform stream shape approximation and projection (UMAP).

Cluster analysis to identify cellular subpopulations

A key goal of scRNA-seq is to identify cellular subpopulations based on their transcriptional similarity. The goal of clustering is to determine the intrinsic grouping of a set of unlabeled objects by knowing their similarity scores (i.e., distances).

Several unsupervised clustering methods have been applied to delineate single-cell data and can be further classified into three groups: (1) k-means, (2) hierarchical clustering, and (3) community detection methods. For single-cell data analysis, all methods are applied after feature selection and data downscaling on the PC downscaling space. The identified cell clusters are then superimposed on the visualization space.

Differential expression, annotation of cell clusters

Characterization and annotation of identified cell populations by comparing each individual cluster of cells with all other cells to identify marker genes. Several differential expression tests have been developed specifically to deal with the presence of missing elements in scRNA-seq data, including Bayesian methods and MAST.

Conclusions

Single-cell sequencing is no longer limited to transcription experiments, but can also capture other data types, including DNA, ChIP and ATAC, and future bioinformatics analysis pipelines should be able to cope with multi-omics data integration, allowing the ability to simultaneously obtain information about all levels of living cells, including DNA, RNA, protein and epigenetic modifications, allowing a more comprehensive understanding of cell fate regulation and phenotype.

In addition, high-throughput scRNA-seq technology has been used in recent years for personalized medicine, screening unused cell types and tissues to tailor appropriate drugs to patient characteristics. Therefore, the development of single-cell sequencing data analysis tools combined with machine learning can drive the progress and development of the field of precision medicine.

References

Fresia R, Marangoni P, Burstyn-Cohen T, et al. From bite to byte: Dental structures resolved at a single-cell resolution[J]. Journal of dental research, 2021, 100(9): 897-905.
Malhotra A, Das S, Rai S N. Analysis of Single-Cell RNA-Sequencing Data: A Step-by-Step Guide[J]. BioMedInformatics, 2022, 2(1): 43-61.
Slovin S, Carissimo A, Panariello F, et al. Single-cell RNA sequencing analysis: a step-by-step overview[J]. RNA Bioinformatics, 2021: 343-365.

* For Research Use Only. Not for use in diagnostic procedures.