Friday, December 20, 2013

How does YARN execute my job

Found some interesting "behind the curtain" stuffs to understand how YARN works.

In the configuration file "yarn-site.xml", we can specify where to store LocalResouce files. Under there we can find "usercache" for APPLICATION level resource which will be automatically cleaned after finishing a job. The "filecache" that was for "PUBLIC" level resource which will NOT be deleted after finishing a job, saved as cache, only deleted if disk space is a concern.

In our script, we can add a command at the end of script to pass everything in current working directory to somewhere in master machine and check it out.

$rsync -arvue ssh * hadoop@master:~/tmp/

Then if we have a look at the hadoop@master:~/tmp/ we can find some stuffs:


container_tokens
default_container_executor.sh
launch_container.sh
\tmp


Here we can find 4 files (folders). "container_tokens" is about security so we can leave it for now. Let us have a look at the two scripts:



$cat default_container_executor.sh
#!/bin/bash

echo $$ > /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp

/bin/mv -f /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid
exec setsid /bin/bash -c "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/launch_container.sh"


"default_container_executor.sh" actually create a new session and call "launch_container.sh" in the container (slave).


$cat launch_container.sh

#!/bin/bash



export NM_HTTP_PORT="8042"
export LOCAL_DIRS="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001"
export HADOOP_COMMON_HOME="/home/hadoop/hadoop-2.2.0"
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
"
export HADOOP_YARN_HOME="/home/hadoop/hadoop-2.2.0"
export HADOOP_TOKEN_FILE_LOCATION="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/container_tokens"
export NM_HOST="n1"
export JVM_PID="$$"
export USER="hadoop"
export HADOOP_HDFS_HOME="/home/hadoop/hadoop-2.2.0"
export PWD="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002"
export CONTAINER_ID="container_1387575550242_0001_01_000002"
export NM_PORT="48810"
export HOME="/home/"
export LOGNAME="hadoop"
export HADOOP_CONF_DIR="/home/hadoop/hadoop-2.2.0/etc/hadoop"
export MALLOC_ARENA_MAX="4"
export LOG_DIRS="/home/hadoop/logs/application_1387575550242_0001/container_1387575550242_0001_01_000002"
ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/12/A_R2.fq" "A_R2.fq"
ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/11/A_R1.fq" "A_R1.fq"
ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/10/1.sh" "script.sh"
exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"


OK we can see lots of stuff. Most lines of the script just define environmental variables. Some lines ("ln -sf ...") make soft links from the filecache to current working directory. All these steps are prepared for the last command:



exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"


This time our own script "script.sh" was called on current working directory. In our script we can use previously defined environmental variables like "LOG_DIRS" like the original example application "DistributedShell" from hadoop-2.2.0 did.

Since all LocalResource files are also linked to current directory, our script can do something on these files now.

#!/bin/bash
echo "Running"
cat A_R1.fq A_R2.fq > A.fq
echo "Moving results back"
rsync -arvue ssh * hadoop@master:/somewhere/to/store/outputs/
echo "Done"






Tuesday, December 17, 2013

The DistributedShell in hadoop 2.0 a.k.a YARN

[update: 01/05/2014]

The holiday season is over. Finally have sometime to make a update to add the link to the code. It is just a prototype now. lots of things are under development and testing. 

Here is the link to the Client & ApplicationMaster:

https://github.com/home9464/BioinformaticsHadoop/blob/master/Client.java

https://github.com/home9464/BioinformaticsHadoop/blob/master/ApplicationMaster.java



***********************************************
The distributedshell example application in Hadoop release 2.2.0 is a good example to start playing with YARN. I spent some time in the past few days to modify this application for a simple BWA alignment job.


In the original application, parameter "-shell_command " gives the CMD you want to run. Therefore you can only run a single command like "cat && mouse && dog".


$hadoop jar hadoop-yarn-applications-distributedshell-2.2.0.jar  \
org.apache.hadoop.yarn.applications.distributedshell.Client \
-jar hadoop-yarn-applications-distributedshell-2.2.0.jar \

-shell_command '/bin/date' -num_containers 2


For a simple BWA alignment, we need do:

$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_1.fastq > A_1.sai
$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_2.fastq > A_2.sai
$/PATH/TO/BWA/bwa sampe /PATH/TO/GENOME_INDEX A_1.sai A_2.sai A_1.fastq A_2.fastq > A.sam


Apparently we need put all these commands into a script, e.g. "1.sh"

Since this script is going to be executed on slave nodes, here we have three problems:


  1. what is the full path to binary executable files for BWA (bwa) on slave nodes
  2. what is the full path to indexed genome files for BWA (hg19.*) on slave nodes
  3. what is the full path to raw sequencing files (*.fastq)on slave nodes


For the 1st and 2nd problems, because we expect to re-use the BWA application and indexed genomes many times, we should keep them on each slave nodes and most importantly, use the exact same full paths:

/home/hadoop/app/bwa-0.7.5/bwa
/home/hadoop/data/bwa/hg19/hg19.*


To do so we can manually mkdirs on each slave node

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/app/"; done

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/data/"; done


Also we need to set up a working directory "job" on each node to put any output files like "*.sai" and the SAM file.

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/job/"; done

Then sync all the BWA app and indexed genome files to each node

$rsync -arvue bwa-0.7.5 ssh hadoop@slave1:~/app
$rsync -arvue hg19 ssh hadoop@slave1:~/data


The shell script to be executed will be like:

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_1.fastq > /home/hadoop/job/A_1.sai

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_2.fastq > /home/hadoop/job/A_2.sai

/home/hadoop/app/bwa-0.7.5/bwa sampe ... > home/hadoop/job/A.sam


With all that said, now we can make some changes on the demo code. The new command will be like:

$hadoop jar ... -shell_script 1.sh -num_containers 2 -container_memory 8192








Friday, December 6, 2013

Back to work today

After refreshing myself with a 3-week vacation.

Then I learned a news this morning - 


After FDA's action



It looks like the "gene based disease prediction" still has a long way to go in real health care, let alone "personalized medicine" which has been imagined for a long time.