Bioinformatics for Small RNA analysis: Introduction, Workflow, and Data Processing

Bioinformatics for Small RNA analysis: Introduction, Workflow, and Data Processing

Online Inquiry


Small RNA sequencing enables quantification of expression by different non-coding small RNAs from cells present in a sample. Generally, small RNA and its forms are known to be important regulators of transcription. For example, the most abundant small RNA in cells is microRNA (miRNA) which reduces the expression of a target RNA by binding to a target site. miRNAs come from the process of consecutive cleaving by two enzymes, Drosha and Dicer. In its final form, one strand of the miRNA becomes the microRNA-induced silencing complex (miRISC) which facilitates binding of the miRNA complex to the target sequence.

Through the next generation sequencing (NGS) platforms, sequencing of small RNA has been in high throughput with high feasibility on studying miRNA expression, identification, and characterization. Small RNA sequencing features a process of total RNA extraction, fractionization, and size selection. Selected fragments then undergo adapter ligation preceding amplification through reverse transcription-polymerase chain reaction (RT-PCR). Amplicons for adaptors are then sequenced through an NGS machine which favors small RNAs due to previous size selection processes.

Bioinformatics Basics of small RNA Analysis

Small RNA data analysis requires training on Linux command line environment which can be hard and time-consuming. Web-based tools are continuously being developed to make analysis and interpretation easier which features file compression through miRanalyser and DSAP, for example. The general workflow of small RNA data analysis is illustrated in the figure below.

General workflow for small RNA analysis (Mehta, 2014)Figure 1. General workflow for small RNA analysis (Mehta, 2014)

The sequencer produces a file in FASTQ format with the data which can be opened in any text editor. FASTAQ is similar to FASTA format but it gives information on the quality for each base sequenced. It contains the Sequence ID, adapter sequence at the end, the raw sequence output, and quality values of each base sequenced in ASCII format. Adapter sequences should be removed before proceeding to further analysis. Quality control is a vital recurring step in all sequencing processes. The total number of reads, alignment percentage, and known small RNA should be considered. Looking at the frequency of “N” or “no base call” and the per-base sequence quality, allows one to filter out missing base calls. Certain sequences might have overrepresentation implying contamination from adapter dimers which can be inevitable. miRNA comprehensive software such as FASTX-Toolkit could be used for adapter removal and reduce contamination.

Output reads are then aligned to a reference sequence by using Bowtie and BWA, for example. Overlapping reads are considered as measures for expression for small RNA. Since different samples produce different total numbers of sequences, normalization is done which features calculation expressions as “reads per million” or “reads per million mapped.” A preferred approach is the alignment with known small RNA sequences which is designed for miRNA studies. Prediction algorithms are applied to the sequences to identify novel miRNAs. These algorithms take the cleavage sites and folding patterns into consideration. Most studies identify differential expression of miRNA among samples subjected to different environmental parameters. This can be elucidated by counting reads through software like DESeq. Most miRNA profiling studies aim to dissect the mechanism by which miRNA affects cellular physiology which is motivated by the ability of the miRNA to silence or reduce the expression of target sequences. Differential binding thorough the frequency of binding sites, binding strength, miRNA polymorphisms, and mismatches are examined through sequencing and pattern recognition by algorithms.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.


  1. Mehta JP. Sequencing small RNA: introduction and data analysis fundamentals. InRNA Mapping 2014 (pp. 93-103). Humana Press, New York, NY.
  2. Raabe CA, Tang TH, Brosius J, Rozhdestvensky TS. Biases in small RNA deep sequencing data. Nucleic acids research. 2014, 42(3).
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry