Next Generation Sequencing and Data Analysis: November 2010

Tuesday, November 30, 2010

dbSNP file naming convention

>ls ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606

dr-xr-xr-x 2 ftp anonymous 4096 Feb 28 2010 1000Genomes

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 14:52 ASN1_bin

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 14:55 ASN1_flat

dr-xr-xr-x 2 ftp anonymous 4096 Nov 29 16:39 ASN1_flat_b131

dr-xr-xr-x 3 ftp anonymous 4096 Feb 28 2010 GWAS_arrays

lr--r--r-- 1 ftp anonymous 17 Oct 28 20:17 README_human_9606_B132.txt -> /snp/00readme.txt

dr-xr-xr-x 3 ftp anonymous 4096 Sep 30 18:55 VCF

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:07 XML

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:08 chr_rpts

dr-xr-xr-x 7 ftp anonymous 4096 Sep 28 20:29 database

dr-xr-xr-x 2 ftp anonymous 167936 Nov 3 15:33 gene_report

dr-xr-xr-x 2 ftp anonymous 4096 Feb 28 2010 genome_report

dr-xr-xr-x 3 ftp anonymous 4096 Oct 14 17:39 genotype

dr-xr-xr-x 27 ftp anonymous 4096 Oct 14 18:20 genotype_by_gene

dr-xr-xr-x 3 ftp anonymous 4096 Feb 28 2010 haplotypes

dr-xr-xr-x 5 ftp anonymous 4096 Nov 15 19:05 misc

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:14 rs_fasta

dr-xr-xr-x 15 ftp anonymous 4096 Apr 5 2010 ss_fasta

dr-xr-xr-x 2 ftp anonymous 45056 Nov 17 21:14 viewBatch

****************************************************

example:

14-12162-MKK.vcf.gz

14 -> Chromosome 14

12162 -> dbSNP population ID (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?pop=12162)

MKK -> Three letter population identifier (http://ccr.coriell.org/sections/collections/NHGRI/?SsId=11)

Friday, November 12, 2010

What is base calling?

Base-calling converts raw or processed data from a sequencing instrument into sequences and quality scores.

Base-calling usually refers to the conversion of intensity data into sequences and quality scores. Intensity information is extracted from images by the image analysis.

Base-calling has two aspects: Identifying the base-call and assigning a confidence estimate to the call

Quality scores quantify the probability that a base-call is correct (or wrong)

Base quality scores: Individual bases have quality scores which reflect the likelihood of the base being correct/incorrect

Alignment scores: Probability than an alignment to a given position in the reference genome is correct

Allele scores, SNP scores: Probability that a given allele, SNP was observed (often conditional on the alignment being correct)

Base and alignment scores: are single read scores; SNP scores are consensus scores

Consensus calls use information from multiple reads

Phred scores: A base quality score assigned by the phred software (or a program based on the phred which is the most prominent base-calling program for capillary sequencing.)

A quality score expressed on a logarithmic scale:

Q = -10 log10( probability of an error )

Example: Q20 = 1% error probability

The Phred method assigns quality scores to a base-call based on observed properties of the base (predictors)

Phred is a two-step process:

1. Training: Given a set of reads, labels as to which bases are correct, and a set of quality statistics for each base, produce a model that can predict error rates for unseen bases

2. Application: Given new reads and quality statistics, predict the quality for each of the bases.

Phred is essentially a big lookup table!

What goes into a quality table?

Quality predictors are numbers correlated with the quality of a base call, and attempt to quantify concepts such as:

–

“Is the signal for the called base much brighter than the others?”

–

“Did the spot get suspiciously dim, compared to the beginning?”

–

“Does the signal look clean in the next few cycles, and the previous few cycles?

A training data set

–

One of the fundamental questions is: How similar is the training data set to the data set the phred table is applied? Does it make sense for my data?

–

Run-to-run variance, type of sample etc.; need diversity in training

An alignment method and reference

–

phred is inherently based on the assumption that alignments are correct andthe reference is well known

–

Makes it intrinsically very hard to provide accurate scores for low quality bases!

–

Need high-quality references for training –what are the highest qualities we can get for references?

Monday, November 8, 2010

Pileup format

samtools

chr1 808922 G A 21 116 53 9 aAaaaaaAA 9=;:@?9:,

chr1 ---> name of this sequence is chr1

808922 ---> coordinate is 808922 (1-based)

G ---> reference base is 'G'

A ---> consensus base is 'A'

21 ---> consensus quality is 21

116 ---> SNP quality is 116

53 ---> maximum mapping quality is 53

9 ---> the number of reads covering the site is 9

aAaaaaaAA ---> read bases are 'aAaaaaaAA'

9=;:@?9:, --->base qualities are '9=;:@?9:,'

consensus quality (21) is the Phred-scaled probability that the consensus is wrong. SNP quality (116) is the Phred-scaled probability that the consensus is identical to the reference.

For SNP calling, SNP quality is of more importance.

For example, to further filter the SNP calls:

samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup

SNP call

novoalign -o SAM -F ILMFQ -t120 -r0.2 -q5 -d hg19.ndx -f 7746x2_s_6_1_sequence.txt.gz 7746x2_s_6_2_sequence.txt.gz | grep chr > 7746x2.sam

samtools view -bS 7746x2.sam | samtools sort - 7746x2

samtools rmdup 7746x2.bam 7746x2.sorted.bam

samtools index 7746x2.sorted.bam

samtools view -u -q 20 7746x2.sorted.bam | samtools pileup -vcf /DataDisk/Analysis/Sun/fa/hg19/chr1.fa - | samtools.pl varFilter -D 100 -G 50 > 1.var

awk '($3=="*"&&$6>=50)||($3!="*"&&$6>30)' 1.var > 1.var.filtered

Tuesday, November 2, 2010

delete lines from a file

sed -e '2d' hello.txt # delete line #2 from hello.txt

sed -e '1,36d' hello.txt # delete line 1-36 from hello.txt

awk '!(NR>=1 && NR<=36){print $0}' hello.txt # delete line 1-36 from hello.txt

Clipped alignment. In Smith-Waterman alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. We introduce operation ʻSʼ to describe (softly) clipped alignment. Here is an example. Suppose the clipped alignment is:

REF:AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG

READ: gggGTGTAACC-GACTAGgggg

where on the read sequence, bases in uppercase are matches and bases in lowercase are clipped off. The CIGAR for this alignment is: 3S8M1D6M4S.

Next Generation Sequencing and Data Analysis