Wednesday, February 23, 2011

Tomato tutorial

Suppose you have a folder named "test" which looks like:

$pwd

/home/u12345/test

$ll
-rw-r--r-- 1 hiseq hiseq 2145917 Feb 23 11:05 7550X1_1_1.txt.gz
-rw-r--r-- 1 hiseq hiseq 2212841 Feb 23 11:05 7550X1_1_2.txt.gz
-rw-r--r-- 1 hiseq hiseq 2114459 Feb 23 11:05 7550X1_2_1.txt.gz
-rw-r--r-- 1 hiseq hiseq 2135003 Feb 23 11:05 7550X1_2_2.txt.gz
-rw-r--r-- 1 hiseq hiseq 2129933 Feb 23 11:05 7550X1_3_1.txt.gz
-rw-r--r-- 1 hiseq hiseq 2168390 Feb 23 11:05 7550X1_3_2.txt.gz
-rw-r--r-- 1 hiseq hiseq 2147386 Feb 23 11:08 7550X2_5_1.txt.gz

Now you want to do a alignment against "humen genome build 19" using these fastq files.
Following two steps to do this:

1. Create a "cmd.txt" file in this folder with two lines:
$cat > cmd.txt

#e your.email@gmail.com
@hg19.nix,bam
echo 'hello,world'
date


The first line starts with '#e' gives your emails address if you want to be notified once the job is done. It is optional.

The second line starts with '@' gives a keywords based commands. It will be translated into:
"I want to do a alignment against hg19.nix(a novoalign indexed human genome) using all fastq files in this folder, the output should be in .bam format."
If you want the output in .sam format, use "@hg19.nix,sam".
All the fastq files are automatically paired. For above example, if will be:

7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz

7550X1_2_1.txt.gz and 7550X1_2_2.txt.gz

7550X1_3_1.txt.gz and 7550X1_3_2.txt.gz

7550X2_5_1.txt.gz


so the pseudo commands to be executed on the cluster machines will be like:

novoalign -o SAM -d hg19.nix -f 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz > 7550X1_1.sam 2>7550X1_1.err
novoalign -o SAM -d hg19.nix -f 7550X1_2_1.txt.gz 7550X1_2_2.txt.gz > 7550X1_2.sam 2>7550X1_2.err
novoalign -o SAM -d hg19.nix -f 7550X1_3_1.txt.gz 7550X1_3_2.txt.gz > 7550X1_3.sam 2>7550X1_3.err
novoalign -o SAM -d hg19.nix -f 7550X2_5_1.txt.gz > 7550X2_5.sam 2>7550X2_5.err
echo "hello,world"
date
*All lines that do not start with '#' or '@' are just normal Linux commands (in this case, echo 'hello,world' and date ) you want to execute.

2. Create an empty "b" file to tell the tomato this job is ready to "begin".
>touch b

'''Further Reading'''

The "keyword based commands" was designed to provide an easy way to do simple jobs which share standard procedures.
At now, the keywords only recognize "bam","sam" and any words with ".nix". Next, "snp" and "indel" are under development and test.
For example, a command like:

'''@hg19.nix, snp'''

will be translated into a standard procedure which comprise of many steps (commands) to call snp variants:

'''
align...
sort...
remove duplicates...
re-align around indels...
calibrate quality score...
merge lane samples....
sort...
call snp...
filter snp calls....
'''


However, you can always use your own commands in "cmd.txt" if you want (and you know how):

'''
#e your.email@gmail.com
#i 7550X1_1_1.txt.gz , 7550X1_1_1.txt.gz
novoalign -o SAM $'@RG\tID:USER\tPL:ILLUMINA\tLB:LIBRARY\tSM:7550X1\tCN:HCI' -d hg19.nix -f 7550X1_1_1.txt.gz 7550X1_1_1.txt.gz > 7550X1_1.sam 2>7550X1_1.err
SortSam.jar INPUT=7550X1_1.sam OUTPUT=7550X1_1.bam CREATE_INDEX=true SO=coordinate
GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fasta -I 7550X1_1.bam -B:indels,VCF dbsnp132.human.vcf -o 7550X1_1.intervals
...
'''

*The second line starts with '#i' gives the input files(seperated by comma ",") to be used in your commands. In the case, the tomato will download these specified files only.
If '#i' was not given, then all files under this folder will be downloaded into cluster machine as possible input files.


[to be continued]

Monday, February 7, 2011

Craete GATK compatible dbSNP file

$ lftp ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-All.vcf.gz
$ gunzip 00-All.vcf.gz
$ grep -c '#' 00-All.vcf
57

$ tail -n+58 00-All.vcf | awk '$0 !~ "_random" {print}' | awk '$0 !~ "_hap" {print}'| awk '$0 !~ "PAR" {print}' | awk '{sub(/MT/, "M", $0); print}' | awk '{print "chr" $0}'> dbsnp132.human.body.vcf

$ awk '{print $1}' dbsnp132.human.body.vcf | sort -u

chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY


$ head -n 57 00-All.vcf > dbsnp132.human.head.vcf

$ cat dbsnp132.human.head.vcf dbsnp132.human.body.vcf > dbsnp132.human.vcf

$ rm dbsnp132.human.body.vcf dbsnp132.human.head.vcf

Friday, February 4, 2011

tropico

svn checkout http://tropico.googlecode.com/svn/ tropico

Tuesday, February 1, 2011

7550x.tar

gunzip *.gz
cat `ls *.txt | sort -g` > seq.txt
gzip -c seq.txt > 7550X1_B20AAHABXX_1_1.txt.gz