Tuesday, March 13, 2012

Failed: Install and use VAT

tar -zxvf libbios-1.0.0.tar.gz
cd libbios-1.0.0
./configure --prefix=/home/app/VAT/out/libbios-1.0.0;make;make install

tar -zxvf gsl-1.15.tar.gz
cd gsl-1.15
./configure --prefix=/home/app/VAT/out/gsl-1.15;;make;make install


#cd gd-libgd
#./configure --prefix=/home/app/VAT/out/libgd

#blatSuite34
unzip blatSuite.34.zip -d /home/app/VAT/out/blatSuite-34

tar -zxvf libs3-2.0.tar.gz
cd libs3-2.0
make DESTDIR=/home/app/VAT/out/libs3 install

export CPPFLAGS="-I/home/app/VAT/out/gsl-1.15/include -I/home/app/VAT/out/libbios-1.0.0/include -I/home/app/VAT/out/libs3/include -ggdb"
export LDFLAGS="-L/home/app/VAT/out/gsl-1.15/lib -L/home/app/VAT/out/libbios-1.0.0/lib -L/home/app/VAT/out/libs3/lib"

tar -zxvf vat-2.0.0.tar.gz
cd vat-2.0.0

#vim default.vatrc
#TABIX_DIR /home/app/tabix
#VAT_EXEC_DIR /home/app/VAT/out/vat/bin

./configure --prefix=/home/app/VAT/out/vat;make;make install
export VAT_CONFIG_FILE=/home/app/.vatrc



cd /home/app/test
gunzip gencode.v11.annotation.gtf.gz
awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v11.annotation.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v11.annotation.cds.gtpc.ttpc.gtf
/home/app/VAT/out/vat/bin/gencode2interval < gencode.v11.annotation.cds.gtpc.ttpc.gtf > gencode.v11.annotation.cds.gtpc.ttpc.interval


./faToTwoBit -noMask /tomato/data/hg19.fasta hg19.2bit

export LD_LIBRARY_PATH=/home/app/VAT/out/libbios-1.0.0/lib:/home/app/VAT/out/libs3/lib:/home/app/VAT/out/gsl-1.15/lib



/home/app/VAT/out/vat/bin/interval2sequences hg19.2bit gencode.v11.annotation.cds.gtpc.ttpc.interval exonic > gencode.v11.annotation.cds.gtpc.ttpc.exonic.fasta
#twoBitReadSeqFrag in chr6 start (53371710) >= end (53371710)
#Segmentation fault

/home/app/VAT/out/vat/bin/interval2sequences hg19.2bit gencode.v11.annotation.cds.gtpc.ttpc.interval genomic> gencode.v11.annotation.cds.gtpc.ttpc.genomic.fasta
#pass

/home/app/VAT/out/vat/bin/snpMapper gencode.v11.annotation.cds.gtpc.ttpc.interval gencode.v11.annotation.cds.gtpc.ttpc.genomic.fasta 1
#Segmentation fault

/home/app/VAT/out/vat/bin/indelMapper gencode.v11.annotation.cds.gtpc.ttpc.interval gencode.v11.annotation.cds.gtpc.ttpc.genomic.fasta 2

Thursday, March 1, 2012

R, DESeq and qvalue installation

wget http://cran.cnr.berkeley.edu/src/base/R-2/R-2.14.2.tar.gz
tar -zxvf R-2.14.2.tar.gz
cd R-2.14.2;./configure --prefix=/home/R-2.14.2;make;make install
cd /home/R-2.14.2/bin

wget http://genomics.princeton.edu/storeylab/qvalue/qvalue_1.1.tar.gz
./R CMD INSTALL qvalue_1.1.tar.gz
rm -f qvalue_1.1.tar.gz

source("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq")

phastConsElements46way and avsift

wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/phastConsElements46way.txt.gz
gunzip phastConsElements46way.txt.gz
mv phastConsElements46way.txt hg19_phastConsElements46way.txt
mv hg19_phastConsElements46way.txt /tomato/data
/tomato/app/annovar/convert2annovar.pl x1.pass.vcf -format vcf4 > x1.annovar
/tomato/app/annovar/annotate_variation.pl -regionanno --buildver hg19 -dbtype mce46way x1.annovar /tomato/data/


/tomato/app/annovar/annotate_variation.pl -downdb -buildver hg19 avsift .
mv hg19_avsift.txt* /tomato/data
/tomato/app/annovar/convert2annovar.pl x1.pass.vcf -format vcf4 > x1.annovar
/tomato/app/annovar/annotate_variation.pl -filter -buildver hg19 -dbtype avsift x1.annovar /tomato/data/

Monday, February 27, 2012

Intersecting VCF

We can use vcftools (together with bgzip and tabix) to intersect VCF files. The first 4 columns for each VCF file(chr, position, ref and alt) can be easily overlapped. However it is undefined on how to "intersecting scores" in VCF files because it is not a simple arithmetic operation. To generate a VCF format compatible file, vcftools used the scores from the first VCF file only when intersecting multiple VCF files. This is simple but not a perfect solution.

Another possible solution is to merge the BAM firstly before calling variances. Therefore we will have only one VCF file after calling variances.

Is this perfect solution? Maybe. Merging alignment BAM files from different samples (individuals) is not well studied. We need a well known dataset to benchmark these two methods.

Tuesday, February 21, 2012

MarkDuplicate is very necessary

Recently I did an analysis which basically comparing the positions of nucleosome in two samples to see if the pattern differs.

The target region is only 1500bp in length. HiSeq generated big raw read sequence files. After aligning to hg19, I can see many alignemnts are incorrect because they were mapped to other chromosomes and positions.

The first analysis without MarkDuplicate from Picard in BAM files result in huge peaks of depth of coverage which basically drowns any useful signals. After removing duplicates, the peaks were totally removed and I got strong signals.

Later I found another problem in GATK's DepthOfCoverage which reports 1000 as maximum depth of coverage. For example, even the actual depth of coverage on two positions are 5000 and 1020 respectively, GATK reports the same 1000.

As an alternative solution, samtools can report the actual depth of coverage without truncating.

samtools mpileup -ABr chr1:12345-123456 -d 1000000 x1.bam | awk '{print $4}' > x1.cov
samtools mpileup -ABr chr1:12345-123456 -d 1000000 x2.bam | awk '{print $4}' > x2.cov

Extracting the 4th column that has the depth of coverage.

The final step is plotting with R.


x1=as.numeric(unlist(read.table('x1.cov')))
x2=as.numeric(unlist(read.table('x2.cov')))
pdf(file="1.pdf")
op = par( mfrow = c( 3, 1 ))
barplot(x1)
title("X1")
barplot(x2)
title("X2")
op = par(op)
dev.off()
pdf(file="2.pdf")
plot(x1,type='l',col='green',xlab='chr1:12345-123456',ylab='Depth of Coverage')
lines(x2,type='l',col='red')
legend(10, 8000, c("X1", "X2"), col = c("green","red"), text.col = "green4", lty = c(3,4), pch = c(3,4), merge = TRUE, bg = 'gray90')
dev.off()

Remove any variances that were in CEU population


#use CEU as common variance database, not dbSNP
import sys
def filter_ceu(target_file,background_file='CEU.all.ids'):
x = []
for line in file(background_file):
x.append(line.strip())
mx = {}.fromkeys(x)
for line in file(target_file):
line = line.strip()
if line.startswith('#'):
pass
#print line
else:
toks = line.split('\t')
if not mx.has_key(toks[2]):
print line
else:
pass

if __name__=='__main__':
filter_ceu(sys.argv[1])

Tuesday, January 3, 2012

1000Genome Population

CHB Han Chinese Han Chinese in Beijing, China
CHS Southern Han Chinese Han Chinese South
CDX Dai Chinese Chinese Dai in Xishuangbanna, China
CHD Denver Chinese Chinese in Denver, Colorado (pilot 3 only)
JPT Japanese Japanese in Tokyo, Japan
KHV Kinh Vietnamese Kinh in Ho Chi Minh City, Vietnam

CEU CEPH Utah residents (CEPH) with Northern and Western European ancestry
TSI Tuscan Toscani in Italia
GBR British British in England and Scotland
FIN Finnish Finnish in Finland
IBS Spanish Iberian populations in Spain

YRI Yoruba Yoruba in Ibadan, Nigeria
LWK Luhya Luhya in Webuye, Kenya
GWD Gambian Gambian in Western Division, The Gambia
GHN Ghanaian Ghanaian in Navrongo, Ghana
MAB Malawian Malawian in Blantyre, Malawi

ASW African-American SW African Ancestry in Southwest US
AJM African-American MS African American in Jackson, Mississippi
ACB African-Caribbean African Caribbean in Barbados

MXL Mexican-American Mexican Ancestry in Los Angeles, California
CLM Colombian Colombian in Medellin, Colombia
PEL Peruvian Peruvian in Lima, Peru
PUR Puerto Rican Puerto Rican in Puerto Rico