The holiday season is over. Finally have sometime to make a update to add the link to the code. It is just a prototype now. lots of things are under development and testing.
Here is the link to the Client & ApplicationMaster:
https://github.com/home9464/BioinformaticsHadoop/blob/master/Client.java
https://github.com/home9464/BioinformaticsHadoop/blob/master/ApplicationMaster.java
***********************************************
The distributedshell example application in Hadoop release 2.2.0 is a good example to start playing with YARN. I spent some time in the past few days to modify this application for a simple BWA alignment job.
In the original application, parameter "-shell_command
$hadoop jar hadoop-yarn-applications-distributedshell-2.2.0.jar \
org.apache.hadoop.yarn.applications.distributedshell.Client \
-jar hadoop-yarn-applications-distributedshell-2.2.0.jar \
-shell_command '/bin/date' -num_containers 2
For a simple BWA alignment, we need do:
$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_1.fastq > A_1.sai
$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_2.fastq > A_2.sai
$/PATH/TO/BWA/bwa sampe /PATH/TO/GENOME_INDEX A_1.sai A_2.sai A_1.fastq A_2.fastq > A.sam
Apparently we need put all these commands into a script, e.g. "1.sh"
Since this script is going to be executed on slave nodes, here we have three problems:
Apparently we need put all these commands into a script, e.g. "1.sh"
Since this script is going to be executed on slave nodes, here we have three problems:
- what is the full path to binary executable files for BWA (bwa) on slave nodes
- what is the full path to indexed genome files for BWA (hg19.*) on slave nodes
- what is the full path to raw sequencing files (*.fastq)on slave nodes
For the 1st and 2nd problems, because we expect to re-use the BWA application and indexed genomes many times, we should keep them on each slave nodes and most importantly, use the exact same full paths:
/home/hadoop/app/bwa-0.7.5/bwa
/home/hadoop/data/bwa/hg19/hg19.*
To do so we can manually mkdirs on each slave node
$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/app/"; done
$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/data/"; done
Also we need to set up a working directory "job" on each node to put any output files like "*.sai" and the SAM file.
$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/job/"; done
Then sync all the BWA app and indexed genome files to each node
$rsync -arvue bwa-0.7.5 ssh hadoop@slave1:~/app
$rsync -arvue hg19 ssh hadoop@slave1:~/data
The shell script to be executed will be like:
/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_1.fastq > /home/hadoop/job/A_1.sai
/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_2.fastq > /home/hadoop/job/A_2.sai
/home/hadoop/app/bwa-0.7.5/bwa sampe ... > home/hadoop/job/A.sam
With all that said, now we can make some changes on the demo code. The new command will be like:
$hadoop jar ... -shell_script 1.sh -num_containers 2 -container_memory 8192
Hi,its really nice post on hadoop. i appreciate for your post. thanks for shearing it with us. keep it up.
ReplyDeleteHadoop Training in hyderabad
Subjects like data warehousing, cloud based hadoop, are well learned when one attends hadoop online training one on one with the instructor
ReplyDelete