Next Generation Sequencing and Data Analysis: Split the BAM into smaller BAM files by chr (then use hadoop to accelerate the downstream analysis)

Monday, August 5, 2013

Split the BAM into smaller BAM files by chr (then use hadoop to accelerate the downstream analysis)

Supposedly we have a single BAM file "x.sort.bam" which is the sorted output of an aligner like BWA. To call variants, we still need many steps. It will take a long time to execute these steps one by one on a single server. To speed up the downstream analysis, we can "split and conquer" the BAM files by chromosomes, this is called "parallel processing".


for i in `samtools view -H x.sort.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do

#remove not-mapped-by-pair, low quality reads and output BAMs by chr
samtools view -bh -F 0x4 -q 1 -o x.$i.bam x.sort.bam $i

#index
samtools index x.$i.bam


#all other steps
java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T BaseRecalibrator \
-knownSites ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.norm.vcf \
-knownSites Mills_and_1000G_gold_standard.indels.b37.sites.norm.vcf \
-knownSites 1000G_phase1.indels.b37.norm.vcf \
-I x.$i.bam \
-o x.$i.grp

java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T PrintReads \
-BQSR x.$i.grp \
-I x.$i.bam \
-o x.$i.recal.bam

...

java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T UnifiedGenotyper --genotype_likelihoods_model BOTH 
-I x.$i.bam -rf BadCigar -dcov 500 -o x.$i.gatk.vcf
done

After indexing the splitted BAMs, we can let hadoop taking over the BAMs to run other steps in hadoop cluster.


for i in `samtools view -H x.sort.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do

#remove not-mapped-by-pair, low quality reads and output BAMs by chr
samtools view -bh -F 0x4 -q 1 -o x.$i.bam x.sort.bam $i

#index
samtools index x.$i.bam

done

#upload data into hadoop
hadoop fs -put *chr*.ba? $hdfsInput

#run hadoop
hadoop jar xxx.jar MyPipeline $hdfsInput $hdfsOutput ...

2 comments:

NGSHadoopJanuary 24, 2014 at 10:26 AM
Hello, How to implement xxx.jar and MyPipeline? Please elaborate on the implementation side which would help many. Any suggestion will be really appreciated.
ReplyDelete
Replies
HymenopteraApril 14, 2014 at 4:56 AM
This comment has been removed by the author.
ReplyDelete
Replies

Add comment

Next Generation Sequencing and Data Analysis

Monday, August 5, 2013

Split the BAM into smaller BAM files by chr (then use hadoop to accelerate the downstream analysis)

2 comments:

About Me

Blog Archive