To parallel the execution of genotyping, one possible solution is to split the BAM files by chromosome (the safe method).
For example, we have one big BAM file named "A.bam". Now we want to use GATK's UnifiedGenotyper to call variants on "A.bam".
The command may looks like this:
time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.bam -o A.vcf
It may take hours to get above command completed.
#The 1st improvement
As an improvement, we can split the "A.bam" by chromosome, then process splitted BAM files in paralle.
so we can do:
for i in `samtools view -H A.bam | awk -F"\t" '/@SQ/{print $2}' | cut -d":" -f2`
do
samtools view -h -F 0x4 -q 10 A.bam $i | samtools view -hbS - > A.$i.bam
done
Now we have generated 20+ files named "A.chr1.bam", "A.chr2.bam",
...
Now call GATK's UnifiedGenotyper:
time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.chr1.bam -o A.chr1.vcf
real 32m
It takes 32 mins to complete. If you look at the outputs of above command you may find that UnifiedGenotyper does not stop after iterating "chr1". Since A.chr1.bam only contains alignments on chr1, it make no sense to waste time on other chromosomes.
Let us make another improvement
#The 2nd improvement
This time, we split the reference FASTA file as well.
supposedly we have "hg19.chr1.fa", "hg19.chr2.fa" under a folder.
for i in *.fa
do
j=$(echo $i | cut -d"." -f1)
echo $j
java -jar ~/app/picard/1.114/CreateSequenceDictionary.jar R=$j.fa O=$j.dict
~/app/samtools/0.1.19/samtools faidx $j.fa
done
Now call GATK's UnifiedGenotyper:
time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf
Let us see how much time it use ... Again, UnifiedGenotyper does not stop after iterating "chr1", it continue goes to "chr2" then throw an error message and terminate the process.
What's wrong here? The header of splitted BAM files
let us have a look at the header of "A.bam"
$samtools view -h A.bam | less
@HD VN:1.3 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:162008842
...
@RG ID:HELLO LB:L01 PL:illumina PU:barcode SM:001
@PG ID:bwa PN:bwa VN:0.7.9a-r786
...
let us have a look at the header of "A.chr1.bam"
$samtools view -h A.chr1.bam | less
@HD VN:1.3 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:162008842
...
@RG ID:HELLO LB:L01 PL:illumina PU:barcode SM:001
@PG ID:bwa PN:bwa VN:0.7.9a-r786
...
They are the same. Since "A.chr1.bam" only has the mappings on chr1, we do not need information on other chromosomes like
"@SQ SN:chr2 LN:243199373"
Unfortunately, both "samtools view" and "bamtools split" are unable to process the header file directly, we have to do it manually. Here is a script to do so
#!/bin/bash
#usage
SAMTOOLS=$1
BAM_INPUT=$2
PATH_OUTPUT=$3
BAMTOOLS=~/bio/app/bamtools/bin/bamtools/2.3.0/bamtools
#create the target directory
mkdir -p ${PATH_OUTPUT}
if [ -f $BAMTOOLS ];
then
cd ${PATH_OUTPUT} && ~/bio/app/bamtools/bin/bamtools split -in ${BAM_INPUT} -refPrefix "" -reference
for i in *.chr*bam
do
sampleName=${i%.bam}
chrName=${sampleName#*.}
${SAMTOOLS} view -h -F 0x4 -q 10 ${i} | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$chrName'/) next} {if(/^@/) print $0}{if ($3~/^'$chrName'/) print $0}' | ${SAMTOOLS} view -hbS - > ${PATH_OUTPUT}/${sampleName}.tmp
mv ${PATH_OUTPUT}/${sampleName}.tmp ${PATH_OUTPUT}/${sampleName}.bam
done
else
fileName=$(basename "$BAM_INPUT")
#extension="${fileName##*.}"
sampleName="${fileName%.*}"
#split the bam
for i in `${SAMTOOLS} view -H ${BAM_INPUT} | awk -F"\t" '/@SQ/{print $2}' | cut -d":" -f2`
do
${SAMTOOLS} view -h -F 0x4 -q 10 ${BAM_INPUT} $i | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$i'/) next} {if(/^@/) print $0}{if ($3~/^'$i'/) print $0}' | $1 view -hbS - > ${PATH_OUTPUT}/${sampleName}.$i.tmp
mv ${PATH_OUTPUT}/${sampleName}.$i.tmp ${PATH_OUTPUT}/${sampleName}.$i.bam
done
fi
Save above script as "bam_split_by_chr.sh".
#split
$chmod +x bam_split_by_chr.sh
$./bam_split_by_chr.sh /home/hadoop/samtools /home/hadoop/A.bam
/home/hadoop/output
now under /home/hadoop/output you will see "A.chr1.bam".
$samtools view -h A.chr1.bam | less
@HD VN:1.3 SO:coordinate
@SQ SN:chr1 LN:249250621
@RG ID:HELLO LB:L01 PL:illumina PU:barcode SM:001
@PG ID:bwa PN:bwa VN:0.7.9a-r786
...
OK with this cleaned BAM file let us run UnifiedGenotyper again.
time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf
real: 7m
Now it only takes 7 mins to process the "A.chr1.bam".
Finally, as a side effect, you will not be able to merge the splitted BAM files into the original "A.bam" again by using
"samtools merge"
$samtools merge A.bam A.chr*.bam
[bam_merge_core] different target sequence name: 'chr1' != 'chr2' in file 'A.chr2.bam'
It does not matter, anyway. We do not need to do that.