[update: 01/05/2014]
The holiday season is over. Finally have sometime to make a update to add the link to the code. It is just a prototype now. lots of things are under development and testing.
Here is the link to the Client & ApplicationMaster:
https://github.com/home9464/BioinformaticsHadoop/blob/master/Client.java
https://github.com/home9464/BioinformaticsHadoop/blob/master/ApplicationMaster.java
***********************************************
The distributedshell example application in Hadoop release 2.2.0 is a good example to start playing with YARN. I spent some time in the past few days to modify this application for a simple BWA alignment job.
In the original application, parameter "-shell_command " gives the CMD you want to run. Therefore you can only run a single command like "cat && mouse && dog".
$hadoop jar hadoop-yarn-applications-distributedshell-2.2.0.jar \
org.apache.hadoop.yarn.applications.distributedshell.Client \
-jar hadoop-yarn-applications-distributedshell-2.2.0.jar \
-shell_command '/bin/date' -num_containers 2
For a simple BWA alignment, we need do:
$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_1.fastq > A_1.sai
$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_2.fastq > A_2.sai
$/PATH/TO/BWA/bwa sampe /PATH/TO/GENOME_INDEX A_1.sai A_2.sai A_1.fastq A_2.fastq > A.sam
Apparently we need put all these commands into a script, e.g. "1.sh"
Since this script is going to be executed on slave nodes, here we have three problems:
- what is the full path to binary executable files for BWA (bwa) on slave nodes
- what is the full path to indexed genome files for BWA (hg19.*) on slave nodes
- what is the full path to raw sequencing files (*.fastq)on slave nodes
For the 1st and 2nd problems, because we expect to re-use the BWA application and indexed genomes many times, we should keep them on each slave nodes and most importantly, use the exact same full paths:
/home/hadoop/app/bwa-0.7.5/bwa
/home/hadoop/data/bwa/hg19/hg19.*
To do so we can manually mkdirs on each slave node
$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/app/"; done
$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/data/"; done
Also we need to set up a working directory "job" on each node to put any output files like "*.sai" and the SAM file.
$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/job/"; done
Then sync all the BWA app and indexed genome files to each node
$rsync -arvue bwa-0.7.5 ssh hadoop@slave1:~/app
$rsync -arvue hg19 ssh hadoop@slave1:~/data
The shell script to be executed will be like:
/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_1.fastq > /home/hadoop/job/A_1.sai
/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_2.fastq > /home/hadoop/job/A_2.sai
/home/hadoop/app/bwa-0.7.5/bwa sampe ... > home/hadoop/job/A.sam
With all that said, now we can make some changes on the demo code. The new command will be like:
$hadoop jar ... -shell_script 1.sh -num_containers 2 -container_memory 8192