Next Generation Sequencing and Data Analysis: MarkDuplicate is very necessary

Recently I did an analysis which basically comparing the positions of nucleosome in two samples to see if the pattern differs.

The target region is only 1500bp in length. HiSeq generated big raw read sequence files. After aligning to hg19, I can see many alignemnts are incorrect because they were mapped to other chromosomes and positions.

The first analysis without MarkDuplicate from Picard in BAM files result in huge peaks of depth of coverage which basically drowns any useful signals. After removing duplicates, the peaks were totally removed and I got strong signals.

Later I found another problem in GATK's DepthOfCoverage which reports 1000 as maximum depth of coverage. For example, even the actual depth of coverage on two positions are 5000 and 1020 respectively, GATK reports the same 1000.

As an alternative solution, samtools can report the actual depth of coverage without truncating.

samtools mpileup -ABr chr1:12345-123456 -d 1000000 x1.bam | awk '{print $4}' > x1.cov
samtools mpileup -ABr chr1:12345-123456 -d 1000000 x2.bam | awk '{print $4}' > x2.cov

Extracting the 4th column that has the depth of coverage.

The final step is plotting with R.


x1=as.numeric(unlist(read.table('x1.cov')))
x2=as.numeric(unlist(read.table('x2.cov')))
pdf(file="1.pdf") 
op = par( mfrow = c( 3, 1 ))
barplot(x1)
title("X1")
barplot(x2)
title("X2")
op = par(op) 
dev.off() 
pdf(file="2.pdf") 
plot(x1,type='l',col='green',xlab='chr1:12345-123456',ylab='Depth of Coverage')
lines(x2,type='l',col='red')
legend(10, 8000, c("X1", "X2"), col = c("green","red"), text.col = "green4", lty = c(3,4), pch = c(3,4), merge = TRUE, bg = 'gray90')
dev.off()

Next Generation Sequencing and Data Analysis

Tuesday, February 21, 2012

MarkDuplicate is very necessary

No comments:

Post a Comment

About Me

Blog Archive