Inquiry
Quality control: How do you read your FASTQC results?

Quality control: How do you read your FASTQC results?

Online Inquiry

With the popularity of sequencing experiments, many projects and laboratories require sequencing, but do you really know how to look at the quality of sequencing data? If the sequencing quality is poor, the results will be inaccurate, so the correct approach is to look at the quality of the raw reads when the raw sequencing data is available, the most common tool is FastQC.

FastQC is written in Java and can be used to quickly and multi-threadedly assess the quality of sequencing data. It generates an assessment report containing information such as base quality of sequenced reads, GC content, length of reads, k-mer distribution, etc., so that we can quickly know the quality of sequencing data.

Basic statistics: The qc file analyzed generates some simple combined statistical information.

  • Encoding: indicates the ASCII encoding of the quality values found in this file.
  • Total Sequences: the total number of sequences processed.
  • Filtered Sequences: Sequences marked for filtering will be removed from all analyses if run in Casava mode. The number of sequences removed will be reported here. The total number of sequences above will not include these filtered sequences and will be the actual number of sequences used in the rest of the analysis.
  • Sequence Length: provides the length of the shortest and longest sequences in the set. If all sequences have the same length, only one value will be reported.
  • %GC: the total GC value content of all bases in all sequences.

The sequencing quality value for each position base.

The quality score is calculated using the formula -10*log10(p), where p is the probability of error. For example, a base with an error probability of 0.01 has a quality score of 20.

To visualize the distribution of quality scores across all bases in a sequencing run, a box whisker-type diagram is often used. In this diagram, the yellow box represents the interquartile range (25-75%), the upper and lower tentacles account for 10% and 90%, and the blue line indicates the average quality. The y-axis shows the quality score, with higher scores indicating better quality bases.

In the box whisker plot, the background is divided into three regions: very good quality (green), passable quality (orange), and poor quality (red). Typically, the quality of bases decreases towards the end of the sequencing run, and bases towards the end of the read may fall into the orange or red regions.

To evaluate the overall sequencing quality of a run, the average sequencing quality per base is often used. A sequencing quality score of 30 or above is generally considered very good and is commonly referred to as Q30. However, it is important to note that other factors such as read length, error rate, and coverage also influence the quality of a sequencing run.

Quality control: How do you read your FASTQC results?

The quality of each tile sequenced.

The plot uses a color scale to represent the quality of each tile, with cooler colors indicating higher quality and hotter colors indicating lower quality. By comparing the quality of each tile to the average quality across all tiles, you can identify any deviations from the expected pattern.

Consistently poor quality in certain tiles may indicate a problem with a specific region of the flow cell, such as a physical defect or contamination. Ideally, all tiles should show high quality, indicated by cooler colors on the plot.

Quality control: How do you read your FASTQC results?

The sequencing quality of each sequence.

The sequencing quality of a sequence refers to the accuracy and reliability of the base calls generated by the sequencing machine for that particular sequence. In general, higher quality scores indicate a lower likelihood of errors in the sequence.

The statement mentions that 90% of the reads with a quality score of 35 or above are considered to be of very good quality. This is because a quality score of 35 corresponds to an error rate of 0.03% (1 in 3,333), which is very low. However, it's important to note that the threshold for what is considered "good quality" may vary depending on the specific application or analysis being performed.

If a large proportion of sequences in a run have low-quality scores across the board, this could indicate a problem with the sequencing run itself, such as an issue with the sequencing chemistry or a problem with the sample preparation.

If the input is a BAM/SAM file with no quality score recorded, the results of this module will not be displayed. Quality scores are typically recorded in the Phred scale, which assigns a numerical value to the base call quality based on the likelihood of an error occurring.

The statement also mentions that a warning is given when the peak quality score is less than 27 (0.2% error rate) and an error is given when the peak quality score is less than 20 (1% error rate). This is because at lower quality scores, the likelihood of errors in the base calls becomes increasingly high, which can affect downstream analyses and interpretations of the sequencing data.

Quality control: How do you read your FASTQC results?

Content of ATCG per base sequenced

In a good-quality sequencing sample, the four lines representing the proportions of the bases at each position should be parallel and close together. However, if the lines become tangled or intertwined at some positions, it may suggest contamination of the overrepresented sequence.

Additionally, if the ratio of A/T or G/C bases differs by more than 10% at any position, a "WARN" is reported, while if the difference is more than 20%, a "FAIL" is reported. This may indicate a bias in the library construction or a systematic error in sequencing.

Quality control: How do you read your FASTQC results?

The GC content distribution

The GC content refers to the proportion of guanine (G) and cytosine (C) nucleotides in a DNA sequence, which is known to affect various properties of DNA, such as its stability and melting temperature.

The normal distribution of GC content is often used as a reference distribution for assessing the quality of sequencing libraries, since it is expected that the GC content distribution of a random library should resemble a normal distribution centered around the GC content of the underlying genome. However, if the distribution is unusually shaped, this can indicate some bias or contamination in the library preparation process.

To assess the deviation of the observed GC content distribution from the theoretical normal distribution, a module can calculate the modal GC content from the observed data and use it to create a reference normal distribution. This allows for the detection of systematic biases that are independent of the base position, which may not be flagged as an error by the module if there are no known genomic GC content values.

To quantify the degree of deviation from the theoretical distribution, the module can use a threshold of 15% for warning and 30% for error, indicating that reads with more than 15% or 30% deviation from the expected distribution should be flagged as potential issues.

Quality control: How do you read your FASTQC results?

The N ratio per base in sequencing reads

The N ratio represents the percentage of positions in sequencing reads where the base cannot be confidently determined, and it is typically very low. However, if the N ratio exceeds 5% at any position, this is considered a warning that there may be a problem with the sequencing system. If the N ratio exceeds 20% at any position, this is considered an error.

It is important to monitor the N ratio during sequencing to ensure that the data is of high quality and to identify any issues that may be affecting the accuracy of the sequencing reads. If the N ratio exceeds the recommended thresholds, it may be necessary to investigate the cause of the problem and take corrective measures, such as optimizing sequencing conditions or resequencing the affected samples.

Quality control: How do you read your FASTQC results?

The sequence length distribution

A uniform sequence length distribution means that all sequences in the library have the same length. This is typically seen in libraries generated from platforms that produce uniform read lengths, such as Illumina sequencing.

The sequence length distribution can impact downstream analyses such as alignment and assembly. In the case of variable length reads, it may be necessary to trim sequences to a specific length or use specialized tools that can handle variable length reads. It is also important to note that low-quality bases at the end of reads may need to be trimmed to ensure accuracy in downstream analyses.

The warning for non-uniform length reads is crucial because it alerts the user to potential issues that may arise from using tools that require uniform length reads. An error for a read of length 0 is also significant because it indicates that there is a problem with the sequence data and that the read cannot be used in downstream analyses.

Quality control: How do you read your FASTQC results?

Sequence Duplication Levels

Repeated sequences are a measure of the degree of duplication or bias in sequencing data. When sequencing depth is high, it is expected to have some degree of duplication, but too much duplication can indicate technical issues or sample contamination.

To quantify the degree of duplication, reads with identical sequences are identified and their frequency is counted. The resulting distribution shows the number of duplicated reads (on the vertical axis) as a function of the number of duplications (on the horizontal axis), normalized to the total number of unique reads.

A warning is typically issued when the percentage of non-unique reads (i.e., reads that occur more than once) exceeds 20% of the total number of reads. An error is issued if this percentage exceeds 50%, indicating a significant problem with the data.

Identifying and correcting for bias and duplication is an important step in quality control for sequencing data, as it can affect downstream analysis and interpretation.

Quality control: How do you read your FASTQC results?

Over-represented sequences

An over-represented sequence is one that occurs frequently in a set of sequencing reads. The criterion for considering a sequence as over-represented is if it appears in more than 0.1% of all the reads. The fastqc tool provides a warning when a sequence exceeds this threshold and an error when it exceeds 1% of the total reads.

It is important to note that the analysis is performed on the first 200,000 reads for ease of calculation. Therefore, over-represented sequences that occur after this point may not be included in the analysis. To increase the sensitivity of the analysis, the -c contaminant file can be added to the command line to look for matching hits of the over-represented sequence in the contaminant file, with at least 20bp and at most one mismatch.

Adapter sequences

The adapters are short DNA sequences that are attached to the ends of the DNA fragments before sequencing to allow for proper binding and identification during the sequencing process.

The graph you mentioned displays the cumulative percentage count of the proportion of each adapter sequence that has been observed at each location. This means that for each adapter sequence used in the sequencing process, the graph tracks how frequently it appears at each position in the DNA fragment as the sequencing progresses.

Once an adapter sequence is observed in a read, it is considered to exist until the end of the read. Therefore, the percentage seen only increases as the length of the read increases since more positions are being observed as the sequencing progresses.

Kmer content

The k-mer module is used to identify over-represented k-mers in reads, which are short DNA sequences of k base pairs. The module measures the number of each k-mer at each position in the library and uses a binomial test to find significant deviations that are uniformly covered at all positions.

If a k-mer is found to be over-represented with a p-value less than 0.01, FastQC issues a warning. If the p-value is less than 10^-5, it raises an error. The default k-mer length is 5, but it can be adjusted using the -k or --kmers option to a value between 2 and 10.

FastQC lists all over-represented k-mers and the top six k-mers per base distribution that show positional bias. However, the top six biased k-mers are excluded to show their distribution.

Reference

  1. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry