Next Generation Sequencing and Data Analysis: The DistributedShell in hadoop 2.0 a.k.a YARN

[update: 01/05/2014]

The holiday season is over. Finally have sometime to make a update to add the link to the code. It is just a prototype now. lots of things are under development and testing.

Here is the link to the Client & ApplicationMaster:

https://github.com/home9464/BioinformaticsHadoop/blob/master/Client.java

https://github.com/home9464/BioinformaticsHadoop/blob/master/ApplicationMaster.java

***********************************************
The distributedshell example application in Hadoop release 2.2.0 is a good example to start playing with YARN. I spent some time in the past few days to modify this application for a simple BWA alignment job.

In the original application, parameter "-shell_command " gives the CMD you want to run. Therefore you can only run a single command like "cat && mouse && dog".

$hadoop jar hadoop-yarn-applications-distributedshell-2.2.0.jar \
org.apache.hadoop.yarn.applications.distributedshell.Client \
-jar hadoop-yarn-applications-distributedshell-2.2.0.jar \

-shell_command '/bin/date' -num_containers 2

For a simple BWA alignment, we need do:

$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_1.fastq > A_1.sai
$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_2.fastq > A_2.sai

$/PATH/TO/BWA/bwa sampe /PATH/TO/GENOME_INDEX A_1.sai A_2.sai A_1.fastq A_2.fastq > A.sam

Apparently we need put all these commands into a script, e.g. "1.sh"

Since this script is going to be executed on slave nodes, here we have three problems:

what is the full path to binary executable files for BWA (bwa) on slave nodes
what is the full path to indexed genome files for BWA (hg19.*) on slave nodes
what is the full path to raw sequencing files (*.fastq)on slave nodes

For the 1st and 2nd problems, because we expect to re-use the BWA application and indexed genomes many times, we should keep them on each slave nodes and most importantly, use the exact same full paths:

/home/hadoop/app/bwa-0.7.5/bwa

/home/hadoop/data/bwa/hg19/hg19.*

To do so we can manually mkdirs on each slave node

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/app/"; done

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/data/"; done

Also we need to set up a working directory "job" on each node to put any output files like "*.sai" and the SAM file.

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/job/"; done

Then sync all the BWA app and indexed genome files to each node

$rsync -arvue bwa-0.7.5 ssh hadoop@slave1:~/app

$rsync -arvue hg19 ssh hadoop@slave1:~/data

The shell script to be executed will be like:

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_1.fastq > /home/hadoop/job/A_1.sai

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_2.fastq > /home/hadoop/job/A_2.sai

/home/hadoop/app/bwa-0.7.5/bwa sampe ... > home/hadoop/job/A.sam

With all that said, now we can make some changes on the demo code. The new command will be like:

$hadoop jar ... -shell_script 1.sh -num_containers 2 -container_memory 8192

2 comments:

mareddyonlineJuly 24, 2014 at 5:23 AM
Hi,its really nice post on hadoop. i appreciate for your post. thanks for shearing it with us. keep it up.
Hadoop Training in hyderabad
AnonymousOctober 26, 2015 at 11:13 PM
Subjects like data warehousing, cloud based hadoop, are well learned when one attends hadoop online training one on one with the instructor

Next Generation Sequencing and Data Analysis

Tuesday, December 17, 2013

The DistributedShell in hadoop 2.0 a.k.a YARN

2 comments:

About Me

Blog Archive