Next Generation Sequencing and Data Analysis: July 2014

To parallel the execution of genotyping, one possible solution is to split the BAM files by chromosome (the safe method). For example, we have one big BAM file named "A.bam". Now we want to use GATK's UnifiedGenotyper to call variants on "A.bam". The command may looks like this:


time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper 
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.bam -o A.vcf

It may take hours to get above command completed. #The 1st improvement As an improvement, we can split the "A.bam" by chromosome, then process splitted BAM files in paralle. so we can do:



for i in `samtools view -H A.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do
    samtools view -h -F 0x4 -q 10 A.bam $i | samtools view -hbS - > A.$i.bam
done

Now we have generated 20+ files named "A.chr1.bam", "A.chr2.bam", ... Now call GATK's UnifiedGenotyper:


time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.chr1.bam -o A.chr1.vcf

real    32m

It takes 32 mins to complete. If you look at the outputs of above command you may find that UnifiedGenotyper does not stop after iterating "chr1". Since A.chr1.bam only contains alignments on chr1, it make no sense to waste time on other chromosomes. Let us make another improvement #The 2nd improvement This time, we split the reference FASTA file as well. supposedly we have "hg19.chr1.fa", "hg19.chr2.fa" under a folder.


for i in *.fa  
do  
   j=$(echo $i | cut -d"." -f1)  
   echo $j  
   java -jar ~/app/picard/1.114/CreateSequenceDictionary.jar R=$j.fa O=$j.dict  
   ~/app/samtools/0.1.19/samtools faidx $j.fa  
done

Now call GATK's UnifiedGenotyper:



time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf

Let us see how much time it use ... Again, UnifiedGenotyper does not stop after iterating "chr1", it continue goes to "chr2" then throw an error message and terminate the process. What's wrong here? The header of splitted BAM files let us have a look at the header of "A.bam"



$samtools view -h A.bam | less

@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:162008842
...
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

let us have a look at the header of "A.chr1.bam"



$samtools view -h A.chr1.bam | less

@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:162008842
...
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

They are the same. Since "A.chr1.bam" only has the mappings on chr1, we do not need information on other chromosomes like "@SQ SN:chr2 LN:243199373" Unfortunately, both "samtools view" and "bamtools split" are unable to process the header file directly, we have to do it manually. Here is a script to do so



#!/bin/bash
#usage   

SAMTOOLS=$1
BAM_INPUT=$2
PATH_OUTPUT=$3
BAMTOOLS=~/bio/app/bamtools/bin/bamtools/2.3.0/bamtools
#create the target directory
mkdir -p ${PATH_OUTPUT}

if [ -f $BAMTOOLS ];
then

 cd ${PATH_OUTPUT} && ~/bio/app/bamtools/bin/bamtools split -in ${BAM_INPUT} -refPrefix "" -reference
 for i in *.chr*bam 
 do 
  sampleName=${i%.bam}
  chrName=${sampleName#*.}
  ${SAMTOOLS} view -h -F 0x4 -q 10 ${i} | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$chrName'/) next} {if(/^@/) print $0}{if ($3~/^'$chrName'/) print $0}' | ${SAMTOOLS} view -hbS - > ${PATH_OUTPUT}/${sampleName}.tmp
  mv ${PATH_OUTPUT}/${sampleName}.tmp ${PATH_OUTPUT}/${sampleName}.bam
 done
else
 fileName=$(basename "$BAM_INPUT")
 #extension="${fileName##*.}"
 sampleName="${fileName%.*}"
 #split the bam
 for i in `${SAMTOOLS} view -H ${BAM_INPUT} | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
 do
  ${SAMTOOLS} view -h -F 0x4 -q 10 ${BAM_INPUT} $i | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$i'/) next} {if(/^@/) print $0}{if ($3~/^'$i'/) print $0}' | $1 view -hbS - > ${PATH_OUTPUT}/${sampleName}.$i.tmp
  mv ${PATH_OUTPUT}/${sampleName}.$i.tmp ${PATH_OUTPUT}/${sampleName}.$i.bam
 done
fi

Save above script as "bam_split_by_chr.sh". #split



$chmod +x bam_split_by_chr.sh

$./bam_split_by_chr.sh /home/hadoop/samtools /home/hadoop/A.bam 
/home/hadoop/output

now under /home/hadoop/output you will see "A.chr1.bam".



$samtools view -h A.chr1.bam | less
@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

OK with this cleaned BAM file let us run UnifiedGenotyper again.



time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf

real:   7m

Now it only takes 7 mins to process the "A.chr1.bam". Finally, as a side effect, you will not be able to merge the splitted BAM files into the original "A.bam" again by using "samtools merge"



$samtools merge A.bam A.chr*.bam
[bam_merge_core] different target sequence name: 'chr1' != 'chr2' in file 'A.chr2.bam'

It does not matter, anyway. We do not need to do that.

Before splitting the "fastq.gz" into fragments, we need to counting how many lines in the raw file, then we can calculate how many lines per fragment.

If the raw file is quite big, the counting itself will take a long time to process because the file has to be decompressed firstly.

The common approach is using "zcat" with "wc". Is it the fastest?

#1. use zcat - 9.381s 

$time zcat A_R1.fq.gz | wc -l

20000000

real 0m9.381s

user 0m7.423s

sys 0m1.919s

#2. use zgrep - 16.258s

$time zgrep -Ec "$" A_R1.fq.gz

20000000

real 0m16.258s

user 0m13.109s

sys 0m0.490s

#3. use pigz - 7.227s

$sudo apt-get install pigz

$time pigz -d -p 1 -c A_R1.fq.gz | wc -l

20000000

real 0m7.227s

user 0m5.544s

sys 0m0.386s

#4. use pigz with 4 threads - 5.973s

$time pigz -d -p 4 -c A_R1.fq.gz | wc -l

20000000

real 0m5.973s

user 0m5.599s

sys 0m2.200s

By default, pigz will use all available processors so "-p" is not necessary. The above command can be simplified as

$pigz -dc A_R1.fq.gz | wc -l

Clearly, our winner is pigz.

I am going to write another blog on splitting the "fastq.gz" file with only one goal - as fast as possible.

1. Count lines

2. Determine how many lines per splitted fastq file

3. Unzip and split the paired "fastq.gz" files

4. Zip splitted "fastq" files again.

Reference:

http://superuser.com/questions/135329/count-lines-in-a-compressed-file

Next Generation Sequencing and Data Analysis

Monday, July 28, 2014

Split the BAM file by chromosome then process it with GATK's UnifiedGenotyper

Thursday, July 24, 2014

Counting lines in "fastq.gz"

Build hadoop-2.4.1-src

About Me

Blog Archive