CD Genomics is a bioinformatics data analysis provider. Our team is experienced Ribo-seq Data Analysis and our high-quality data analysis platform will be used to generate high-quality analysis results in a fast analysis cycle.

Introduction

Ribosome profiling sequencing (Ribo-seq) is a technology based on high-throughput sequencing to detect genome-wide RNA translation (Ingolia, Ghaemmaghami, Newman, & Weissman, 2009). Ribo-seq is also the current mainstream method for the study of translation of RNA to protein. The specific method is to treat ribosomal-nascent peptide complexes with low concentration of RNase, degrade the RNA fragments without ribosomal coverage, and then remove the ribosome. Finally, a small fragment of about ~30 bp of translating RNA protected by ribosomes was detected by next-generation sequencing technology. These protected RNA fragments accurately indicate the "footprints" of the ribosome in translation. Therefore, these protected RNA fragments are also called ribosome footprints (RFs).

Fig. 1 Ribosome profiling sequencing (Ribo-seq) workflow

Application Field

Medicine: disease mechanism research, disease marker discovery, drug target screening

Plants: stress resistance mechanism, growth and development mechanism, breeding protection research, etc.

Animal husbandry: quality research, animal nutrition, breed breeding, etc.

Food environment: storage and processing conditions optimization, quality identification, food nutrition

CD Genomics Data Analysis Pipeline

Bioinformatics Analysis Content

Ribo-seq data analysis
Reads filtering
Reference genome alignment
Three nucleotide periodicity analysis
Alignment with Codon distribution analysis
Pause sites analysis
Quantification of gene abundance
Sample relationship analysis
Correlation Analysis of Replicas
Principal Component Analysis
Differentially translated genes (DTGs) analysis
GO Enrichment Analysis
ORF identification
Translation quantification for ORF
Differentially translated (DT) ORFs analysis
Evaluation of Coding potential of non-canonical ORFs
Sequence features analysis
The influence of the uORFs upon the mORFs
sORF annotation

How It Works

Table 1 Partial software and database list

Software or database	Uses
fastp	Low quality Reads filtering
STAR	Reference genome alignment
riboWaltz R package	Three nucleotide periodicity analysis
PausePred	Pause sites analysis
RSEM	Quantification of gene abundance

1. What is ORF?

The key to Ribo-seq is to find ORF (Open Reading Frame), just like RNA-seq to identify mRNA/IncRNA & circRNA. An ORF is a continuous stretch of codons that begins with a start codon (usually AUG) and ends at a stop codon (usually UAA, UAG or UGA).

2. What are the difficulties of ribosomal sequencing?

The experimental rRNA residue is high
The short Ribosome footprints makes it difficult to identify the real ORF

3. Scientific questions to which Ribo-seq addresses.

First, it can be applied to the mechanism of translation inquiry. Attention should be paid to which of the genes being translated are being translated efficiently, and which elements regulate the efficiency of translation.
Second, transcriptome analysis can be deepened. Focus on which of the differentially expressed genes are being translated and how efficiently they are translated. Narrow down the range of differentially expressed genes and focus only on genes that are being translated.
Third, it can be used to explain the inconsistency between transcriptome and proteome results. When gene transcription is not translated, or translated inefficiently, it may occur that the transcriptome shows a difference or significant difference, while the proteome shows no difference or insignificant difference.

Quality Control

The raw data of the sequencing is filtered through a series of filtration methods to obtain high quality sequencing data for subsequent analysis.

Table 1 Reads filter information statistics table

Sample	Raw Reads Num	Clean Reads Num (%)	Read length	Adapter (%)	low quality (%)	polyA (%)	N (%)
T1	49132379	48934325 (99.6%)	150/150+150/150	94166 (0.38%)	9617 (0.02%)	0 (0%)	106 (0.0%)
T2	51508605	51284781 (99.57%)	150/150+150/150	105050 (0.41%)	13652 (0.03%)	0 (0%)	71 (0.0%)

Table 2 Statistical table of filtered base information

Sample	Raw Data(bp)_before	Q20 (%)	Q30 (%)	N (%)	GC (%)
T1	7369856850	1451146305 (98.57%)	1414372834 (96.08%)	45774 (0.0%)	745364428 (50.63%)
T2	7726290750	1505703187 (98.46%)	1464924432 (95.8%)	43089 (0.0%)	746397174 (48.81%)

RFs distribution statistics

According to the reads length distribution statistics, we only retained reads with length between 20 bp and 40 bp, and we will use these data that meet the expected length for subsequent analysis.

Fig.1 Plot of RFs length distribution

Based on the alignment position of RFs on coding genes, we classified RFs into four categories: CDS, 5'UTR, 3'UTR, and Intron. In general, RFs were mostly distributed in the CDS region and less in the UTR region.

Fig.2 Map of the location distribution of RFs coding genes

sORF identification

Identifying coding elements is very important work in genomic studies. In common genome annotation pipelines, only proteins longer than 100 amino acids are generally concerned. The Coding sequences of these common coding genes are also called consensus coding sequences (CCDS), and the corresponding open reading frames are called CCDS ORF. In general, these ORF are main protein-coding ORF, which are collectively referred to as mORF in this concluding report. However, recent studies have shown that some traditional RNA regions (including lncRNA, 5'UTR and 3'UTR) that do not encode proteins can actually translate some peptides, which are usually less than 100 amino acids in length. small peptides (less than 100 amino acids in length) also play diverse roles in organisms, including ontogeny, muscle contraction and DNA repair. The small ORF encoding these short peptides are typically less than 300nt in length and will be collectively referred to as sORF in this report. According to the source region of sORF, we classify them, and the classification rules are as follows:(a) sORF derived from the 5'UTR region of the known coding gene, designated as uORF; (b) sORF derived from 3'UTR region of known coding gene, named dORF;

Table 3 sORF identification statistics

	uORF	dORF
Number	73557	190578
Average length	45	38

Fig.3 Violin plot of sORF expression distribution

Screening of potentially translatable sORF

ORFscore and RRS were calculated based on the abundance and position distribution of each sORF, Fickett score and Hexamer score were calculated based on the sequence characteristics of sORF, and four values were integrated to screen the possible translated sORF.

Fig.4 Distribution of score values for RRS and ORFscore

Fig.5 Potentially translatable sORF Venn diagram

Ribosome profiling (Ribo-seq) Analysis