CD Genomics is a bioinformatics data analysis provider. Our team is experienced Ribo-seq Data Analysis and our high-quality data analysis platform will be used to generate high-quality analysis results in a fast analysis cycle.
Introduction
Ribosome profiling sequencing (Ribo-seq) is a technology based on high-throughput sequencing to detect genome-wide RNA translation (Ingolia, Ghaemmaghami, Newman, & Weissman, 2009). Ribo-seq is also the current mainstream method for the study of translation of RNA to protein. The specific method is to treat ribosomal-nascent peptide complexes with low concentration of RNase, degrade the RNA fragments without ribosomal coverage, and then remove the ribosome. Finally, a small fragment of about ~30 bp of translating RNA protected by ribosomes was detected by next-generation sequencing technology. These protected RNA fragments accurately indicate the "footprints" of the ribosome in translation. Therefore, these protected RNA fragments are also called ribosome footprints (RFs).
Fig. 1 Ribosome profiling sequencing (Ribo-seq) workflow
Application Field
Medicine: disease mechanism research, disease marker discovery, drug target screening
Plants: stress resistance mechanism, growth and development mechanism, breeding protection research, etc.
Animal husbandry: quality research, animal nutrition, breed breeding, etc.
Food environment: storage and processing conditions optimization, quality identification, food nutrition
CD Genomics Data Analysis Pipeline
Bioinformatics Analysis Content
- Ribo-seq data analysis
Reads filtering
Reference genome alignment
Three nucleotide periodicity analysis
Alignment with Codon distribution analysis
Pause sites analysis
Quantification of gene abundance - Sample relationship analysis
Correlation Analysis of Replicas
Principal Component Analysis - Differentially translated genes (DTGs) analysis
- GO Enrichment Analysis
- ORF identification
- Translation quantification for ORF
- Differentially translated (DT) ORFs analysis
- Evaluation of Coding potential of non-canonical ORFs
- Sequence features analysis
- The influence of the uORFs upon the mORFs
- sORF annotation
How It Works
Table 1 Partial software and database list
Software or database | Uses |
fastp | Low quality Reads filtering |
STAR | Reference genome alignment |
riboWaltz R package | Three nucleotide periodicity analysis |
PausePred | Pause sites analysis |
RSEM | Quantification of gene abundance |
1. What is ORF?
The key to Ribo-seq is to find ORF (Open Reading Frame), just like RNA-seq to identify mRNA/IncRNA & circRNA. An ORF is a continuous stretch of codons that begins with a start codon (usually AUG) and ends at a stop codon (usually UAA, UAG or UGA).
2. What are the difficulties of ribosomal sequencing?
- The experimental rRNA residue is high
- The short Ribosome footprints makes it difficult to identify the real ORF
3. Scientific questions to which Ribo-seq addresses.
- First, it can be applied to the mechanism of translation inquiry. Attention should be paid to which of the genes being translated are being translated efficiently, and which elements regulate the efficiency of translation.
- Second, transcriptome analysis can be deepened. Focus on which of the differentially expressed genes are being translated and how efficiently they are translated. Narrow down the range of differentially expressed genes and focus only on genes that are being translated.
- Third, it can be used to explain the inconsistency between transcriptome and proteome results. When gene transcription is not translated, or translated inefficiently, it may occur that the transcriptome shows a difference or significant difference, while the proteome shows no difference or insignificant difference.
Quality Control
The raw data of the sequencing is filtered through a series of filtration methods to obtain high quality sequencing data for subsequent analysis.
Table 1 Reads filter information statistics table
Sample | Raw Reads Num | Clean Reads Num (%) | Read length | Adapter (%) | low quality (%) | polyA (%) | N (%) |
T1 | 49132379 | 48934325 (99.6%) | 150/150+150/150 | 94166 (0.38%) | 9617 (0.02%) | 0 (0%) | 106 (0.0%) |
T2 | 51508605 | 51284781 (99.57%) | 150/150+150/150 | 105050 (0.41%) | 13652 (0.03%) | 0 (0%) | 71 (0.0%) |
Table 2 Statistical table of filtered base information
Sample | Raw Data(bp)_before | Q20 (%) | Q30 (%) | N (%) | GC (%) |
T1 | 7369856850 | 1451146305 (98.57%) | 1414372834 (96.08%) | 45774 (0.0%) | 745364428 (50.63%) |
T2 | 7726290750 | 1505703187 (98.46%) | 1464924432 (95.8%) | 43089 (0.0%) | 746397174 (48.81%) |
RFs distribution statistics
According to the reads length distribution statistics, we only retained reads with length between 20 bp and 40 bp, and we will use these data that meet the expected length for subsequent analysis.
Fig.1 Plot of RFs length distribution
Based on the alignment position of RFs on coding genes, we classified RFs into four categories: CDS, 5'UTR, 3'UTR, and Intron. In general, RFs were mostly distributed in the CDS region and less in the UTR region.
Fig.2 Map of the location distribution of RFs coding genes
sORF identification
Identifying coding elements is very important work in genomic studies. In common genome annotation pipelines, only proteins longer than 100 amino acids are generally concerned. The Coding sequences of these common coding genes are also called consensus coding sequences (CCDS), and the corresponding open reading frames are called CCDS ORF. In general, these ORF are main protein-coding ORF, which are collectively referred to as mORF in this concluding report. However, recent studies have shown that some traditional RNA regions (including lncRNA, 5'UTR and 3'UTR) that do not encode proteins can actually translate some peptides, which are usually less than 100 amino acids in length. small peptides (less than 100 amino acids in length) also play diverse roles in organisms, including ontogeny, muscle contraction and DNA repair. The small ORF encoding these short peptides are typically less than 300nt in length and will be collectively referred to as sORF in this report. According to the source region of sORF, we classify them, and the classification rules are as follows:(a) sORF derived from the 5'UTR region of the known coding gene, designated as uORF; (b) sORF derived from 3'UTR region of known coding gene, named dORF;
Table 3 sORF identification statistics
uORF | dORF | |
Number | 73557 | 190578 |
Average length | 45 | 38 |
Fig.3 Violin plot of sORF expression distribution
Screening of potentially translatable sORF
ORFscore and RRS were calculated based on the abundance and position distribution of each sORF, Fickett score and Hexamer score were calculated based on the sequence characteristics of sORF, and four values were integrated to screen the possible translated sORF.
Fig.4 Distribution of score values for RRS and ORFscore
Fig.5 Potentially translatable sORF Venn diagram