Monday, July 28, 2014
To parallel the execution of genotyping, one possible solution is to split the BAM files by chromosome (the safe method). For example, we have one big BAM file named "A.bam". Now we want to use GATK's UnifiedGenotyper to call variants on "A.bam". The command may looks like this:It may take hours to get above command completed. #The 1st improvement As an improvement, we can split the "A.bam" by chromosome, then process splitted BAM files in paralle. so we can do: Now we have generated 20+ files named "A.chr1.bam", "A.chr2.bam", ... Now call GATK's UnifiedGenotyper: It takes 32 mins to complete. If you look at the outputs of above command you may find that UnifiedGenotyper does not stop after iterating "chr1". Since A.chr1.bam only contains alignments on chr1, it make no sense to waste time on other chromosomes. Let us make another improvement #The 2nd improvement This time, we split the reference FASTA file as well. supposedly we have "hg19.chr1.fa", "hg19.chr2.fa" under a folder. Now call GATK's UnifiedGenotyper: Let us see how much time it use ... Again, UnifiedGenotyper does not stop after iterating "chr1", it continue goes to "chr2" then throw an error message and terminate the process. What's wrong here? The header of splitted BAM files let us have a look at the header of "A.bam" let us have a look at the header of "A.chr1.bam" They are the same. Since "A.chr1.bam" only has the mappings on chr1, we do not need information on other chromosomes like "@SQ SN:chr2 LN:243199373" Unfortunately, both "samtools view" and "bamtools split" are unable to process the header file directly, we have to do it manually. Here is a script to do so Save above script as "bam_split_by_chr.sh". #split now under /home/hadoop/output you will see "A.chr1.bam". OK with this cleaned BAM file let us run UnifiedGenotyper again. Now it only takes 7 mins to process the "A.chr1.bam". Finally, as a side effect, you will not be able to merge the splitted BAM files into the original "A.bam" again by using "samtools merge" It does not matter, anyway. We do not need to do that.
Thursday, July 24, 2014
Before splitting the "fastq.gz" into fragments, we need to counting how many lines in the raw file, then we can calculate how many lines per fragment.
If the raw file is quite big, the counting itself will take a long time to process because the file has to be decompressed firstly.
The common approach is using "zcat" with "wc". Is it the fastest?
#1. use zcat - 9.381s
$time zcat A_R1.fq.gz | wc -l
#2. use zgrep - 16.258s
$time zgrep -Ec "$" A_R1.fq.gz
#3. use pigz - 7.227s
$sudo apt-get install pigz
$time pigz -d -p 1 -c A_R1.fq.gz | wc -l
#4. use pigz with 4 threads - 5.973s
$time pigz -d -p 4 -c A_R1.fq.gz | wc -l
By default, pigz will use all available processors so "-p" is not necessary. The above command can be simplified as
$pigz -dc A_R1.fq.gz | wc -l
Clearly, our winner is pigz.
I am going to write another blog on splitting the "fastq.gz" file with only one goal - as fast as possible.
1. Count lines
2. Determine how many lines per splitted fastq file
3. Unzip and split the paired "fastq.gz" files
4. Zip splitted "fastq" files again.
./configure && make && sudo make install && sudo ldconfig
mvn compile && mvn package