Base-calling usually refers to the conversion of intensity data into sequences and quality scores. Intensity information is extracted from images by the image analysis.
Base-calling has two aspects: Identifying the base-call and assigning a confidence estimate to the call
Quality scores quantify the probability that a base-call is correct (or wrong)
Base quality scores: Individual bases have quality scores which reflect the likelihood of the base being correct/incorrect
Alignment scores: Probability than an alignment to a given position in the reference genome is correct
Allele scores, SNP scores: Probability that a given allele, SNP was observed (often conditional on the alignment being correct)
Base and alignment scores: are single read scores; SNP scores are consensus scores
Consensus calls use information from multiple reads
Phred scores: A base quality score assigned by the phred software (or a program based on the phred which is the most prominent base-calling program for capillary sequencing.)
A quality score expressed on a logarithmic scale:
Q = -10 log10( probability of an error )
Example: Q20 = 1% error probability
The Phred method assigns quality scores to a base-call based on observed properties of the base (predictors)
Phred is a two-step process:
1. Training: Given a set of reads, labels as to which bases are correct, and a set of quality statistics for each base, produce a model that can predict error rates for unseen bases
2. Application: Given new reads and quality statistics, predict the quality for each of the bases.
Phred is essentially a big lookup table!
What goes into a quality table?
Quality predictors are numbers correlated with the quality of a base call, and attempt to quantify concepts such as:
–
“Is the signal for the called base much brighter than the others?”
–
“Did the spot get suspiciously dim, compared to the beginning?”
–
“Does the signal look clean in the next few cycles, and the previous few cycles?
A training data set
–
One of the fundamental questions is: How similar is the training data set to the data set the phred table is applied? Does it make sense for my data?
–
Run-to-run variance, type of sample etc.; need diversity in training
An alignment method and reference
–
phred is inherently based on the assumption that alignments are correct andthe reference is well known
–
Makes it intrinsically very hard to provide accurate scores for low quality bases!
–
Need high-quality references for training –what are the highest qualities we can get for references?
No comments:
Post a Comment