Friday, November 12, 2010

What is base calling?

Base-calling converts raw or processed data from a sequencing instrument into sequences and quality scores.

Base-calling usually refers to the conversion of intensity data into sequences and quality scores. Intensity information is extracted from images by the image analysis.

Base-calling has two aspects: Identifying the base-call and assigning a confidence estimate to the call

Quality scores quantify the probability that a base-call is correct (or wrong)

Base quality scores: Individual bases have quality scores which reflect the likelihood of the base being correct/incorrect

Alignment scores: Probability than an alignment to a given position in the reference genome is correct

Allele scores, SNP scores: Probability that a given allele, SNP was observed (often conditional on the alignment being correct)

Base and alignment scores: are single read scores; SNP scores are consensus scores

Consensus calls use information from multiple reads


Phred scores: A base quality score assigned by the phred software (or a program based on the phred which is the most prominent base-calling program for capillary sequencing.)

A quality score expressed on a logarithmic scale:

Q = -10 log10( probability of an error )

Example: Q20 = 1% error probability

The Phred method assigns quality scores to a base-call based on observed properties of the base (predictors)

Phred is a two-step process:

1. Training: Given a set of reads, labels as to which bases are correct, and a set of quality statistics for each base, produce a model that can predict error rates for unseen bases

2. Application: Given new reads and quality statistics, predict the quality for each of the bases.
Phred is essentially a big lookup table!

What goes into a quality table?

Quality predictors are numbers correlated with the quality of a base call, and attempt to quantify concepts such as:
“Is the signal for the called base much brighter than the others?”
“Did the spot get suspiciously dim, compared to the beginning?”
“Does the signal look clean in the next few cycles, and the previous few cycles?
A training data set
One of the fundamental questions is: How similar is the training data set to the data set the phred table is applied? Does it make sense for my data?
Run-to-run variance, type of sample etc.; need diversity in training
An alignment method and reference
phred is inherently based on the assumption that alignments are correct andthe reference is well known
Makes it intrinsically very hard to provide accurate scores for low quality bases!
Need high-quality references for training –what are the highest qualities we can get for references?



No comments:

Post a Comment