Monday, July 28, 2014

Split the BAM file by chromosome then process it with GATK's UnifiedGenotyper

To parallel the execution of genotyping, one possible solution is to split the BAM files by chromosome (the safe method). For example, we have one big BAM file named "A.bam". Now we want to use GATK's UnifiedGenotyper to call variants on "A.bam". The command may looks like this:

time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper 
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.bam -o A.vcf
It may take hours to get above command completed. #The 1st improvement As an improvement, we can split the "A.bam" by chromosome, then process splitted BAM files in paralle. so we can do:


for i in `samtools view -H A.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do
    samtools view -h -F 0x4 -q 10 A.bam $i | samtools view -hbS - > A.$i.bam
done
Now we have generated 20+ files named "A.chr1.bam", "A.chr2.bam", ... Now call GATK's UnifiedGenotyper:

time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.chr1.bam -o A.chr1.vcf

real    32m
It takes 32 mins to complete. If you look at the outputs of above command you may find that UnifiedGenotyper does not stop after iterating "chr1". Since A.chr1.bam only contains alignments on chr1, it make no sense to waste time on other chromosomes. Let us make another improvement #The 2nd improvement This time, we split the reference FASTA file as well. supposedly we have "hg19.chr1.fa", "hg19.chr2.fa" under a folder.

for i in *.fa  
do  
   j=$(echo $i | cut -d"." -f1)  
   echo $j  
   java -jar ~/app/picard/1.114/CreateSequenceDictionary.jar R=$j.fa O=$j.dict  
   ~/app/samtools/0.1.19/samtools faidx $j.fa  
done  
Now call GATK's UnifiedGenotyper:


time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf

Let us see how much time it use ... Again, UnifiedGenotyper does not stop after iterating "chr1", it continue goes to "chr2" then throw an error message and terminate the process. What's wrong here? The header of splitted BAM files let us have a look at the header of "A.bam"


$samtools view -h A.bam | less

@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:162008842
...
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

let us have a look at the header of "A.chr1.bam"


$samtools view -h A.chr1.bam | less

@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:162008842
...
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

They are the same. Since "A.chr1.bam" only has the mappings on chr1, we do not need information on other chromosomes like "@SQ SN:chr2 LN:243199373" Unfortunately, both "samtools view" and "bamtools split" are unable to process the header file directly, we have to do it manually. Here is a script to do so


#!/bin/bash
#usage   

SAMTOOLS=$1
BAM_INPUT=$2
PATH_OUTPUT=$3
BAMTOOLS=~/bio/app/bamtools/bin/bamtools/2.3.0/bamtools
#create the target directory
mkdir -p ${PATH_OUTPUT}

if [ -f $BAMTOOLS ];
then

 cd ${PATH_OUTPUT} && ~/bio/app/bamtools/bin/bamtools split -in ${BAM_INPUT} -refPrefix "" -reference
 for i in *.chr*bam 
 do 
  sampleName=${i%.bam}
  chrName=${sampleName#*.}
  ${SAMTOOLS} view -h -F 0x4 -q 10 ${i} | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$chrName'/) next} {if(/^@/) print $0}{if ($3~/^'$chrName'/) print $0}' | ${SAMTOOLS} view -hbS - > ${PATH_OUTPUT}/${sampleName}.tmp
  mv ${PATH_OUTPUT}/${sampleName}.tmp ${PATH_OUTPUT}/${sampleName}.bam
 done
else
 fileName=$(basename "$BAM_INPUT")
 #extension="${fileName##*.}"
 sampleName="${fileName%.*}"
 #split the bam
 for i in `${SAMTOOLS} view -H ${BAM_INPUT} | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
 do
  ${SAMTOOLS} view -h -F 0x4 -q 10 ${BAM_INPUT} $i | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$i'/) next} {if(/^@/) print $0}{if ($3~/^'$i'/) print $0}' | $1 view -hbS - > ${PATH_OUTPUT}/${sampleName}.$i.tmp
  mv ${PATH_OUTPUT}/${sampleName}.$i.tmp ${PATH_OUTPUT}/${sampleName}.$i.bam
 done
fi

Save above script as "bam_split_by_chr.sh". #split


$chmod +x bam_split_by_chr.sh

$./bam_split_by_chr.sh /home/hadoop/samtools /home/hadoop/A.bam 
/home/hadoop/output

now under /home/hadoop/output you will see "A.chr1.bam".


$samtools view -h A.chr1.bam | less
@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

OK with this cleaned BAM file let us run UnifiedGenotyper again.


time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf

real:   7m

Now it only takes 7 mins to process the "A.chr1.bam". Finally, as a side effect, you will not be able to merge the splitted BAM files into the original "A.bam" again by using "samtools merge"


$samtools merge A.bam A.chr*.bam
[bam_merge_core] different target sequence name: 'chr1' != 'chr2' in file 'A.chr2.bam'

It does not matter, anyway. We do not need to do that.

Thursday, July 24, 2014

Counting lines in "fastq.gz"

Before splitting the "fastq.gz" into fragments, we need to counting how many lines in the raw file, then we can calculate how many lines per fragment.

If the raw file is quite big, the counting itself will take a long time to process because the file has to be decompressed firstly.

The common approach is using "zcat" with "wc". Is it the fastest?



#1. use zcat - 9.381s 
$time zcat A_R1.fq.gz | wc -l
 20000000

 real    0m9.381s
 user    0m7.423s
 sys     0m1.919s


#2. use zgrep - 16.258s
 $time zgrep -Ec "$" A_R1.fq.gz
 20000000

 real    0m16.258s
 user    0m13.109s
 sys     0m0.490s


#3. use pigz - 7.227s
 $sudo apt-get install pigz
 $time pigz -d -p 1 -c A_R1.fq.gz  | wc -l
 20000000

 real    0m7.227s
 user    0m5.544s
 sys     0m0.386s

#4. use pigz with 4 threads - 5.973s
 $time pigz -d -p 4 -c A_R1.fq.gz  | wc -l
 20000000

 real    0m5.973s
 user    0m5.599s
 sys     0m2.200s

By default, pigz will use all available processors so "-p" is not necessary. The above command can be simplified as 

 $pigz -dc A_R1.fq.gz  | wc -l


Clearly, our winner is pigz.

I am going to write another blog on splitting the "fastq.gz" file with only one goal - as fast as possible. 

 1. Count lines
 2. Determine how many lines per splitted fastq file
 3. Unzip and split the paired "fastq.gz" files
 4. Zip splitted "fastq" files again.


Reference:
http://superuser.com/questions/135329/count-lines-in-a-compressed-file



Build hadoop-2.4.1-src

1. http://apache.cs.utah.edu/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz

2. http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.bz2

./configure && make && sudo make install && sudo ldconfig
mvn compile && mvn package