Tuesday, November 30, 2010

dbSNP file naming convention

>ls ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606

dr-xr-xr-x 2 ftp anonymous 4096 Feb 28 2010 1000Genomes
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 14:52 ASN1_bin
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 14:55 ASN1_flat
dr-xr-xr-x 2 ftp anonymous 4096 Nov 29 16:39 ASN1_flat_b131
dr-xr-xr-x 3 ftp anonymous 4096 Feb 28 2010 GWAS_arrays
lr--r--r-- 1 ftp anonymous 17 Oct 28 20:17 README_human_9606_B132.txt -> /snp/00readme.txt
dr-xr-xr-x 3 ftp anonymous 4096 Sep 30 18:55 VCF
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:07 XML
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:08 chr_rpts
dr-xr-xr-x 7 ftp anonymous 4096 Sep 28 20:29 database
dr-xr-xr-x 2 ftp anonymous 167936 Nov 3 15:33 gene_report
dr-xr-xr-x 2 ftp anonymous 4096 Feb 28 2010 genome_report
dr-xr-xr-x 3 ftp anonymous 4096 Oct 14 17:39 genotype
dr-xr-xr-x 27 ftp anonymous 4096 Oct 14 18:20 genotype_by_gene
dr-xr-xr-x 3 ftp anonymous 4096 Feb 28 2010 haplotypes
dr-xr-xr-x 5 ftp anonymous 4096 Nov 15 19:05 misc
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:14 rs_fasta
dr-xr-xr-x 15 ftp anonymous 4096 Apr 5 2010 ss_fasta
dr-xr-xr-x 2 ftp anonymous 45056 Nov 17 21:14 viewBatch

****************************************************

example:
14-12162-MKK.vcf.gz
14 -> Chromosome 14
12162 -> dbSNP population ID (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?pop=12162)
MKK -> Three letter population identifier (http://ccr.coriell.org/sections/collections/NHGRI/?SsId=11)

Friday, November 12, 2010

What is base calling?

Base-calling converts raw or processed data from a sequencing instrument into sequences and quality scores.

Base-calling usually refers to the conversion of intensity data into sequences and quality scores. Intensity information is extracted from images by the image analysis.

Base-calling has two aspects: Identifying the base-call and assigning a confidence estimate to the call

Quality scores quantify the probability that a base-call is correct (or wrong)

Base quality scores: Individual bases have quality scores which reflect the likelihood of the base being correct/incorrect

Alignment scores: Probability than an alignment to a given position in the reference genome is correct

Allele scores, SNP scores: Probability that a given allele, SNP was observed (often conditional on the alignment being correct)

Base and alignment scores: are single read scores; SNP scores are consensus scores

Consensus calls use information from multiple reads


Phred scores: A base quality score assigned by the phred software (or a program based on the phred which is the most prominent base-calling program for capillary sequencing.)

A quality score expressed on a logarithmic scale:

Q = -10 log10( probability of an error )

Example: Q20 = 1% error probability

The Phred method assigns quality scores to a base-call based on observed properties of the base (predictors)

Phred is a two-step process:

1. Training: Given a set of reads, labels as to which bases are correct, and a set of quality statistics for each base, produce a model that can predict error rates for unseen bases

2. Application: Given new reads and quality statistics, predict the quality for each of the bases.
Phred is essentially a big lookup table!

What goes into a quality table?

Quality predictors are numbers correlated with the quality of a base call, and attempt to quantify concepts such as:
“Is the signal for the called base much brighter than the others?”
“Did the spot get suspiciously dim, compared to the beginning?”
“Does the signal look clean in the next few cycles, and the previous few cycles?
A training data set
One of the fundamental questions is: How similar is the training data set to the data set the phred table is applied? Does it make sense for my data?
Run-to-run variance, type of sample etc.; need diversity in training
An alignment method and reference
phred is inherently based on the assumption that alignments are correct andthe reference is well known
Makes it intrinsically very hard to provide accurate scores for low quality bases!
Need high-quality references for training –what are the highest qualities we can get for references?



Monday, November 8, 2010

Pileup format

samtools

chr1 808922 G A 21 116 53 9 aAaaaaaAA 9=;:@?9:,


chr1 ---> name of this sequence is chr1
808922 ---> coordinate is 808922 (1-based)
G ---> reference base is 'G'
A ---> consensus base is 'A'
21 ---> consensus quality is 21
116 ---> SNP quality is 116
53 ---> maximum mapping quality is 53
9 ---> the number of reads covering the site is 9
aAaaaaaAA ---> read bases are 'aAaaaaaAA'
9=;:@?9:, --->base qualities are '9=;:@?9:,'


consensus quality (21) is the Phred-scaled probability that the consensus is wrong. SNP quality (116) is the Phred-scaled probability that the consensus is identical to the reference.
For SNP calling, SNP quality is of more importance.

For example, to further filter the SNP calls:
samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup

SNP call

novoalign -o SAM -F ILMFQ -t120 -r0.2 -q5 -d hg19.ndx -f 7746x2_s_6_1_sequence.txt.gz 7746x2_s_6_2_sequence.txt.gz | grep chr > 7746x2.sam

samtools view -bS 7746x2.sam | samtools sort - 7746x2

samtools rmdup 7746x2.bam 7746x2.sorted.bam

samtools index 7746x2.sorted.bam

samtools view -u -q 20 7746x2.sorted.bam | samtools pileup -vcf /DataDisk/Analysis/Sun/fa/hg19/chr1.fa - | samtools.pl varFilter -D 100 -G 50 > 1.var
awk '($3=="*"&&$6>=50)||($3!="*"&&$6>30)' 1.var > 1.var.filtered

Tuesday, November 2, 2010

delete lines from a file

sed -e '2d' hello.txt # delete line #2 from hello.txt
sed -e '1,36d' hello.txt # delete line 1-36 from hello.txt
awk '!(NR>=1 && NR<=36){print $0}' hello.txt # delete line 1-36 from hello.txt

soft clipped

Clipped alignment. In Smith-Waterman alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. We introduce operation ʻSʼ to describe (softly) clipped alignment. Here is an example. Suppose the clipped alignment is:
REF:AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG
READ: gggGTGTAACC-GACTAGgggg

where on the read sequence, bases in uppercase are matches and bases in lowercase are clipped off. The CIGAR for this alignment is: 3S8M1D6M4S.