Supposedly we have a single BAM file "x.sort.bam" which is the sorted output of an aligner like BWA.
To call variants, we still need many steps. It will take a long time to execute these steps one by one on a single server. To speed up the downstream analysis, we can "split and conquer" the BAM files by chromosomes, this is called "parallel processing".
for i in `samtools view -H x.sort.bam | awk -F"\t" '/@SQ/{print $2}' | cut -d":" -f2`
do
#remove not-mapped-by-pair, low quality reads and output BAMs by chr
samtools view -bh -F 0x4 -q 1 -o x.$i.bam x.sort.bam $i
#index
samtools index x.$i.bam
#all other steps
java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T BaseRecalibrator \
-knownSites ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.norm.vcf \
-knownSites Mills_and_1000G_gold_standard.indels.b37.sites.norm.vcf \
-knownSites 1000G_phase1.indels.b37.norm.vcf \
-I x.$i.bam \
-o x.$i.grp
java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T PrintReads \
-BQSR x.$i.grp \
-I x.$i.bam \
-o x.$i.recal.bam
...
java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T UnifiedGenotyper --genotype_likelihoods_model BOTH
-I x.$i.bam -rf BadCigar -dcov 500 -o x.$i.gatk.vcf
done
After indexing the splitted BAMs, we can let hadoop taking over the BAMs to run other steps in hadoop cluster.
for i in `samtools view -H x.sort.bam | awk -F"\t" '/@SQ/{print $2}' | cut -d":" -f2`
do
#remove not-mapped-by-pair, low quality reads and output BAMs by chr
samtools view -bh -F 0x4 -q 1 -o x.$i.bam x.sort.bam $i
#index
samtools index x.$i.bam
done
#upload data into hadoop
hadoop fs -put *chr*.ba? $hdfsInput
#run hadoop
hadoop jar xxx.jar MyPipeline $hdfsInput $hdfsOutput ...
Hello, How to implement xxx.jar and MyPipeline? Please elaborate on the implementation side which would help many. Any suggestion will be really appreciated.
ReplyDeleteThis comment has been removed by the author.
ReplyDelete