Tuesday, August 18, 2015

Evaluation on speedseq


Recently read an interesting article from here:

http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3505.html

The author is kind enough to open source it
https://github.com/hall-lab/speedseq


So I checked out the code and make some tests. Why is it faster than popular GATK package? After spending few hours , my conclusions:


1. The speedseq use simplified process.

Basically it use works like:

"mapping"->"remove duplicates"->"Call variants"

while traditional GATK works like:

"mapping"->"remove duplicates"->"realign"->"recalibrate"->"Call variants" plus lots of sorting and indexing between steps.


As indicated in freebayes' website, freebayes internally handle realign and recalibrate.



2. User GNU parallel to parallel the variants calling, because freebayes does not support threading or sub-processing.


So supposedly you assign 8 CPU cores to the speedseq, it will launch a 8-processes pool to call variants using freebayes on each chromosome independently.

3. Other improvements.

For example, use sambamba instead of picard (in my experience sambamba is at least 3x faster than picard). use samblaster to get discordant and split reads simultaneously for later SV and CNV callings right after the alignment.


So what are the catches? Nothing can be perfect.

1. The components are tightly wrapped into the pipeline. Changing of individual parameters are no easy tasks.

2. The variant calling process did not split the BAM actually. If fact it only splits the BAM header for a chromosome based region to accelerate the processing because the caller will not need to go through the whole genome. Each process has to load the full size reference genome and full size BAM file(s). As a result, the memory requirement will be a problem for some machines. Also the disk I/O could be a bottleneck if you have many processes trying to read/write large data on the same time.

No wonder why the authors did the test on a pretty good AWS EC2 c3.8xlarge instance with big memory and large SSD.



Final words: Even it is not perfect, it is a wonderful solution for some people. The best of all, it is free! No need to pay greedy and lovely Broad Institute $$$$$$$$$$$$$$$$$$ for the full GATK package.


Increasing the amount of inotify watchers

echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -p

Saturday, August 15, 2015

The famous NA12878 WGS at 50x

Due to the lacking of true variants from real genome sequencing, many tools use NA12878 for benchmarking. For whole genome sequencing at 50x depth, we can download these two files: ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz
Some statistics of these two files

NA12878 WGS 50x
Facts ERR194147_1.fastq.gz ERR194147_2.fastq.gz
URI ERR194147_1.fastq.gz ERR194147_2.fastq.gz
Size 48G 49G
Reads# 787265109 787265109
Reads Length 101 101
Coverage 101*787265109/3000000000=26.5 101*787265109/3000000000=26.5