Monday, February 27, 2012

Intersecting VCF

We can use vcftools (together with bgzip and tabix) to intersect VCF files. The first 4 columns for each VCF file(chr, position, ref and alt) can be easily overlapped. However it is undefined on how to "intersecting scores" in VCF files because it is not a simple arithmetic operation. To generate a VCF format compatible file, vcftools used the scores from the first VCF file only when intersecting multiple VCF files. This is simple but not a perfect solution.

Another possible solution is to merge the BAM firstly before calling variances. Therefore we will have only one VCF file after calling variances.

Is this perfect solution? Maybe. Merging alignment BAM files from different samples (individuals) is not well studied. We need a well known dataset to benchmark these two methods.

Tuesday, February 21, 2012

MarkDuplicate is very necessary

Recently I did an analysis which basically comparing the positions of nucleosome in two samples to see if the pattern differs.

The target region is only 1500bp in length. HiSeq generated big raw read sequence files. After aligning to hg19, I can see many alignemnts are incorrect because they were mapped to other chromosomes and positions.

The first analysis without MarkDuplicate from Picard in BAM files result in huge peaks of depth of coverage which basically drowns any useful signals. After removing duplicates, the peaks were totally removed and I got strong signals.

Later I found another problem in GATK's DepthOfCoverage which reports 1000 as maximum depth of coverage. For example, even the actual depth of coverage on two positions are 5000 and 1020 respectively, GATK reports the same 1000.

As an alternative solution, samtools can report the actual depth of coverage without truncating.

samtools mpileup -ABr chr1:12345-123456 -d 1000000 x1.bam | awk '{print $4}' > x1.cov
samtools mpileup -ABr chr1:12345-123456 -d 1000000 x2.bam | awk '{print $4}' > x2.cov

Extracting the 4th column that has the depth of coverage.

The final step is plotting with R.


x1=as.numeric(unlist(read.table('x1.cov')))
x2=as.numeric(unlist(read.table('x2.cov')))
pdf(file="1.pdf")
op = par( mfrow = c( 3, 1 ))
barplot(x1)
title("X1")
barplot(x2)
title("X2")
op = par(op)
dev.off()
pdf(file="2.pdf")
plot(x1,type='l',col='green',xlab='chr1:12345-123456',ylab='Depth of Coverage')
lines(x2,type='l',col='red')
legend(10, 8000, c("X1", "X2"), col = c("green","red"), text.col = "green4", lty = c(3,4), pch = c(3,4), merge = TRUE, bg = 'gray90')
dev.off()

Remove any variances that were in CEU population


#use CEU as common variance database, not dbSNP
import sys
def filter_ceu(target_file,background_file='CEU.all.ids'):
x = []
for line in file(background_file):
x.append(line.strip())
mx = {}.fromkeys(x)
for line in file(target_file):
line = line.strip()
if line.startswith('#'):
pass
#print line
else:
toks = line.split('\t')
if not mx.has_key(toks[2]):
print line
else:
pass

if __name__=='__main__':
filter_ceu(sys.argv[1])