Next Generation Sequencing and Data Analysis: What is base calling?

Base-calling converts raw or processed data from a sequencing instrument into sequences and quality scores.

Base-calling usually refers to the conversion of intensity data into sequences and quality scores. Intensity information is extracted from images by the image analysis.

Base-calling has two aspects: Identifying the base-call and assigning a confidence estimate to the call

Quality scores quantify the probability that a base-call is correct (or wrong)

Base quality scores: Individual bases have quality scores which reflect the likelihood of the base being correct/incorrect

Alignment scores: Probability than an alignment to a given position in the reference genome is correct

Allele scores, SNP scores: Probability that a given allele, SNP was observed (often conditional on the alignment being correct)

Base and alignment scores: are single read scores; SNP scores are consensus scores

Consensus calls use information from multiple reads

Phred scores: A base quality score assigned by the phred software (or a program based on the phred which is the most prominent base-calling program for capillary sequencing.)

A quality score expressed on a logarithmic scale:

Q = -10 log10( probability of an error )

Example: Q20 = 1% error probability

The Phred method assigns quality scores to a base-call based on observed properties of the base (predictors)

Phred is a two-step process:

1. Training: Given a set of reads, labels as to which bases are correct, and a set of quality statistics for each base, produce a model that can predict error rates for unseen bases

2. Application: Given new reads and quality statistics, predict the quality for each of the bases.

Phred is essentially a big lookup table!

What goes into a quality table?

Quality predictors are numbers correlated with the quality of a base call, and attempt to quantify concepts such as:

–

“Is the signal for the called base much brighter than the others?”

–

“Did the spot get suspiciously dim, compared to the beginning?”

–

“Does the signal look clean in the next few cycles, and the previous few cycles?

A training data set

–

One of the fundamental questions is: How similar is the training data set to the data set the phred table is applied? Does it make sense for my data?

–

Run-to-run variance, type of sample etc.; need diversity in training

An alignment method and reference

–

phred is inherently based on the assumption that alignments are correct andthe reference is well known

–

Makes it intrinsically very hard to provide accurate scores for low quality bases!

–

Need high-quality references for training –what are the highest qualities we can get for references?

Next Generation Sequencing and Data Analysis

Friday, November 12, 2010

What is base calling?

No comments:

Post a Comment

About Me

Blog Archive