Monday, June 20, 2011

Indel filter

java -Xmx32g -jar $HOME/tool/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R $HOME/data/hg19.fasta \
-o x1.indel.filtered.vcf \
-B:variant,VCF x1.indel.vcf \
-B:mask,VCF ../SNP/x1.snp.vcf \
--maskName SNP \
--filterExpression "QUAL < 30 || SB > -1.0 || QD < 2.0" \
--filterName "QUAL<30;SB>-1.0;QD<2"

SNP Filter

#uses SB >= 0.10 filter
#uses QD < 5.0 filter
#uses HRun >= 4 filter
#uses clustered SNPs (>= 3 SNPs within 10 bp) filter
#masks out SNPs that overlap the filtered indels

java -Xmx32g -jar $HOME/tool/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R $HOME/data/hg19.fasta \
-o x1.snp.filtered.vcf \
-B:variant,VCF x1.snp.vcf \
-B:mask,VCF ../INDEL/x1.indel.vcf \
--maskName InDel \
--clusterSize 3 \
--clusterWindowSize 10 \
--filterExpression "SB >= 0.10 || QD < 5.0 || HRun >= 4" \
--filterName "SB>=0.1;QD<5.0;HRun>=4;Cluster_3in10;MaskIndel"

#copy header and "PASS" items to "x1.snp.passed.vcf"

awk '/^#/ { print $0;next} $7 == "PASS" {print $0} ' x1.snp.filtered.vcf > x1.snp.passed.vcf



Friday, June 17, 2011

Build Genome Index

#novoalign
#normal
novoindex -k 14 -s 1 hg19.nov.normal.nix hg19.fasta

#solid color space
novoindex -k 14 -s 1 -c hg19.nov.color.nix hg19.fasta

#bisulphite
novoindex -k 14 -s 1 -b hg19.nov.bis.nix hg19.fasta

#normal
bwa index -a bwtsw -p hg19.bwa.normal hg19.fasta

#color
bwa index -a bwtsw -c -p hg19.bwa.color hg19.fasta

wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/chr*.fa.gz

novoindex -k 14 -s 1 mm9.nov.normal.nix genomes/mm9/fasta/mm9.fasta
novoindex -k 14 -s 1 -c mm9.nov.color.nix genomes/mm9/fasta/mm9.fasta
novoindex -k 14 -s 1 -b mm9.nov.bis.nix genomes/mm9/fasta/mm9.fasta

#normal
./bwa index -a bwtsw -p mm9.bwa.normal genomes/mm9/fasta/mm9.fasta
./bwa index -a bwtsw -c -p mm9.bwa.color genomes/mm9/fasta/mm9.fasta

Call SNP & Indel

java -Xmx32g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R hg19.fasta -B:dbsnp,VCF dbsnp132.human.vcf -glm BOTH -o X1.vcf -I X1.bam -l INFO -stand_call_conf 20 -stand_emit_conf 10


VAT

VAT

http://archive.gersteinlab.org/proj/VAT/


$pwd
$HOME/VAT

#download&install GNU Scientific Library
$wget http://mirror.clarkson.edu/gnu/gsl/gsl-1.15.tar.gz
$sudo tar -zxvf gsl-1.15.tar.gz;cd gsl-1.15; ./configure; make; make install

#download&install gettext
$wget http://ftp.gnu.org/pub/gnu/gettext/gettext-0.18.1.1.tar.gz
$sudo tar -zxvf gettext-0.18.1.1.tar.gz;cd gettext-0.18.1.1;./configure; make; make install

#download GD Library
http://google-desktop-for-linux-mirror.googlecode.com/files/gd-2.0.35.tar.gz
sudo tar -zxvf gd-2.0.35.tar.gz;cd gd/2.0.35/;./configure; make; make install


#libbios
$wget http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz
$sudo tar -zxvf libbios-1.0.0.tar.gz; cd libbios-1.0.0
$sudo export CPPFLAGS="-I$HOME/VAT/libbios-1.0.0/include"
$sudo export LDFLAGS="-L$HOME/VAT/libbios-1.0.0/lib"
$sudo ./configure;make;make install


#twoBitToFa is required for "interval2sequences"
$wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
$chmod +x twoBitToFa
$cp twoBitToFa $HOME/tool/


#download&install VAT package
$wget http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz
$tar -zxvf vat-1.0.0.tar.gz; cd vat-1.0.0
$mkdir tmp
$./configure --prefix=$HOME/VAT/vat-1.0.0/tmp;make;make install
$mv tmp/* $HOME/VAT/tool


wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz
gunzip gencode.v3c.annotation.GRCh37.gtf.gz


wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

export LD_LIBRARY_PATH=$HOME/VAT/lib:$LD_LIBRARY_PATH

cat x2.snp.vcf | snpMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > x2.snp.annotated.vcf

cat x2.indel.vcf | indelMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > x2.indel.annotated.vcf

Thursday, June 16, 2011

An python script to seperate SNP and Indel calls in one VCF


#!/usr/bin/python
def split_snp_indel(fvcf):
basename = fvcf.rstrip('.vcf')
fsnp = file(basename+'.snp.vcf','w')
findel = file(basename+'.indel.vcf','w')
for line in file(fvcf):
line = line.strip()
if line.startswith('#'):
print >>fsnp, line
print >> findel, line
else:
toks = line.split('\t')
ref = toks[3]
alt = toks[4]
if len(ref)> 1 or len(alt)> 1:
print >> findel, line
else:
print >>fsnp, line

if __name__=='__main__':
import sys
split_snp_indel(sys.argv[1])


Wednesday, June 15, 2011

GATK resource bundle

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/5974/hg19

Monday, June 13, 2011

update GATK and PICARD

#!/bin/bash
rm -fr app/gatk/*
svn co https://svn.broadinstitute.org/Sting/trunk GATK
cd GATK; ant; cd ..; cp GATK/dist/* app/gatk/; rm -fr GATK
rm -fr app/picard/*
svn co https://picard.svn.sourceforge.net/svnroot/picard/trunk PICARD
cd PICARD; ant; cd ..; cp PICRAD/dist/* app/picard/; rm -fr PICARD

Monday, June 6, 2011

Hide a control in Flex

btn_hello.visible = false; 
btn_hello.includeInLayout = false;

Wednesday, June 1, 2011

Ambiguous Code in SOLiD Color Space

An Example from a SOLiD Color Space aligned SAM file:

CS:Z:T3.3300220113.2323222031301323303202033330300301210


Here "CS:Z" field gives the color read sequence "T3.33". The first character after the "CS:Z:" is always "T", after that, all valid code should be range from 0 to 3.

0 -this base is same as previous base
1 - this base is the transversion of previous base
2 - this base is the transition of previous base
3 - this base is the complement of previous base

However sometimes we can observe ambiguous code like "." or "4". You may notice the 3rd base is not in 0~3, it is a "." - an ambiguous code indicates an ambiguous base, could be any base of A, C, G and T.

This ambiguous code will fail the GATK's "CountCovariates" tool. It will complain


Error Caused by: org.broadinstitute.sting.utils.exceptions.UserException$MalformedBAM: SAM/BAM file SAMFileReader{/home/u0592675/A232/441_10xF3.mate.dup.sort.realign.bam}
is malformed: Unrecognized color space in SOLID read, color = . Unfortunately this bam file can not be recalibrated without full color space information because of potential reference bias.



In a simple test, around 1% (58297 out of 5275136) alignments has at least one ambiguous code. So here we have two options to eliminate this problem:

1. make a script to change all ambiguous codes into random valid codes. It looks like:

m = re.compile("\.")
m.sub(random.choice(('0','1','2','3')),CS_Z_STRING)

where CS_Z_STRING is like above example.

However, this may bring a serious problem because all bases execpt for the 1st base are depending on previous base. If we choose a wrong code which probability is 3/4, then all the following bases will be wrong.

2. delete the affected alignments

Considering the small percentage (1%) of ambiguous code affected alignments, it may be wise to delete those troublesome lines.

m1 = re.compile("CS:Z:T[0-3]*?[^0-3\s]")
def delete_ambiguous_line(fin,fout):
fo = file(fout,'w')
i = j = 0
for line in file(fin):
line = line.strip()
if line.startswith('@'):
nline = line
else:
if m1.search(line):
j += 1
else:
i += 1
nline = line
print >>fo,nline
print "Fix ambiguous code - Total: %d, Deleted: %d, Percent of Deleted: %d%%" % (i+j,j,j*100/(i+j))
fo.close()