Sequencing depth is the ratio of the total number of bases (bp) obtained by sequencing to the genome size, which is one of the indicators to evaluate the sequencing volume. There is a positive correlation between sequencing depth and genome coverage, and the error rate or false positive results from sequencing decreases as the sequencing depth increases. For resequenced individuals, if a double-end or Mate-Pair protocol is used, both genome coverage and sequencing error rate control are guaranteed when the sequencing depth is above 10-15X. Assuming a gene size of 2M and a sequencing depth of 10X, the total amount of data obtained is 20M.
Sequencing coverage refers to the proportion of sequences obtained by sequencing the whole genome. For example, if a bacterial genome is sequenced and the coverage is 98%, then there is still 2% of the sequence region that is not obtained by sequencing.
De novo literally means brand new, professional point is from scratch sequencing without relying on any pre-existing sequence information. The detailed point is to sequence the unknown genome sequence, and use bioinformatics analysis means to splice and assemble the sequence, so as to obtain the map of its genome.
For coverage, the genome sequence assembled after sequencing analysis usually cannot completely cover all regions due to the existence of gaps in large segments of splicing, limited sequencing read lengths, and duplicate sequences, etc. Coverage is the proportion of the final result to the whole genome. For example, if a human genome is sequenced and the coverage is 98.5%, then it means that there is still 1.5% of the genome that cannot be obtained by our assembly and analysis; for depth, it is the average number of times a single base on the genome being sequenced. 30 times, note that it is an average. Of course, depth has a maximum and minimum value, which can be obtained from the information analysis. In fact, in order to improve the accuracy, 30X is usually about right.
Choosing the appropriate sequencing coverage depends on the specific objectives of the sequencing project, the quality of the sequencing data, and the available resources. In general, higher sequencing coverage can increase the accuracy and completeness of genome assembly and annotation, but it also requires more sequencing reads and computational resources.
1) 1 billion bases is about 1GB data volume
2) Number of bases = number of sequencing reads * sequencing length
3) Sequencing depth = total amount of bases / number of bases in the reference genome
Example: Hiseq 4000 PE150 mode, a total of 600,000,000 reads were sequenced, the total number of bases is 600,000,000*150=9*1010
Divided by 1 billion, which is 90GB data volume, the human genome is about 3GB, then the sequencing depth is 90/3=30X.
When it comes to sequencing DNA for genomics research, one critical decision that researchers must make is whether to use single-end or paired-end sequencing reads. Both options have their advantages and disadvantages, and researchers must carefully consider their specific research questions and needs when making this decision.
Single-end sequencing reads involve reading only one end of a DNA fragment, which is typically faster and less expensive than paired-end sequencing, as it requires half the amount of sequencing. However, single-end sequencing reads have limitations in detecting structural variations and identifying the precise location of variants.
Paired-end sequencing reads, on the other hand, involve reading both the forward and reverse strands of a DNA fragment. This type of sequencing provides more information about the structure of the genome, allowing for the identification of structural variants and single-nucleotide polymorphism (SNP). The distance between the two ends is known, and this information can be used to map the reads over repetitive regions. Paired-end sequencing also extends the read length, enabling the detection of gene insertions, deletions, repetitive sequences, and other rearrangements.
While paired-end sequencing is more accurate and provides more information, it is also more expensive and requires more sequencing. Researchers must weigh the benefits of increased accuracy and read length against the cost and time required for the sequencing process. If extending the read length is not important for the research question or if increased accuracy can be achieved through other means, such as unique molecular indexing, then single-end sequencing may be a more efficient and cost-effective option.
Besides, paired-end read data has the advantage of being analyzed as either single-end or paired-end, depending on the downstream application's needs. For example, both single-end and paired-end sequencing reads can be used for VDJ analysis. If the downstream application requires only the frequencies of the unique CDR3 region, then single-end data are sufficient for frequency calculations and CDR3 identification without suffering from data loss. However, if information needs to be extended to identify CDR1 and CDR2 regions, the same data can be analyzed as paired-end with read stitching.