Next Generation Sequencing and Data Analysis: December 2013

Friday, December 20, 2013

How does YARN execute my job

Found some interesting "behind the curtain" stuffs to understand how YARN works.

In the configuration file "yarn-site.xml", we can specify where to store LocalResouce files. Under there we can find "usercache" for APPLICATION level resource which will be automatically cleaned after finishing a job. The "filecache" that was for "PUBLIC" level resource which will NOT be deleted after finishing a job, saved as cache, only deleted if disk space is a concern.

In our script, we can add a command at the end of script to pass everything in current working directory to somewhere in master machine and check it out.

$rsync -arvue ssh * hadoop@master:~/tmp/

Then if we have a look at the hadoop@master:~/tmp/ we can find some stuffs:

container_tokens
default_container_executor.sh
launch_container.sh
\tmp

Here we can find 4 files (folders). "container_tokens" is about security so we can leave it for now. Let us have a look at the two scripts:

$cat default_container_executor.sh
#!/bin/bash

echo $$ > /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp
/bin/mv -f /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid
exec setsid /bin/bash -c "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/launch_container.sh"

"default_container_executor.sh" actually create a new session and call "launch_container.sh" in the container (slave).

$cat launch_container.sh

#!/bin/bash

export NM_HTTP_PORT="8042"

export LOCAL_DIRS="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001"

export HADOOP_COMMON_HOME="/home/hadoop/hadoop-2.2.0"

export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"

export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

export HADOOP_YARN_HOME="/home/hadoop/hadoop-2.2.0"

export HADOOP_TOKEN_FILE_LOCATION="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/container_tokens"

export NM_HOST="n1"

export JVM_PID="$$"

export USER="hadoop"

export HADOOP_HDFS_HOME="/home/hadoop/hadoop-2.2.0"

export PWD="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002"

export CONTAINER_ID="container_1387575550242_0001_01_000002"

export NM_PORT="48810"

export HOME="/home/"

export LOGNAME="hadoop"

export HADOOP_CONF_DIR="/home/hadoop/hadoop-2.2.0/etc/hadoop"

export MALLOC_ARENA_MAX="4"

export LOG_DIRS="/home/hadoop/logs/application_1387575550242_0001/container_1387575550242_0001_01_000002"

ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/12/A_R2.fq" "A_R2.fq"

ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/11/A_R1.fq" "A_R1.fq"

ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/10/1.sh" "script.sh"

exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"

OK we can see lots of stuff. Most lines of the script just define environmental variables. Some lines ("ln -sf ...") make soft links from the filecache to current working directory. All these steps are prepared for the last command:

exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"

This time our own script "script.sh" was called on current working directory. In our script we can use previously defined environmental variables like "LOG_DIRS" like the original example application "DistributedShell" from hadoop-2.2.0 did.

Since all LocalResource files are also linked to current directory, our script can do something on these files now.

#!/bin/bash
echo "Running"
cat A_R1.fq A_R2.fq > A.fq
echo "Moving results back"
rsync -arvue ssh * hadoop@master:/somewhere/to/store/outputs/
echo "Done"

Tuesday, December 17, 2013

The DistributedShell in hadoop 2.0 a.k.a YARN

[update: 01/05/2014]

The holiday season is over. Finally have sometime to make a update to add the link to the code. It is just a prototype now. lots of things are under development and testing.

Here is the link to the Client & ApplicationMaster:

https://github.com/home9464/BioinformaticsHadoop/blob/master/Client.java

https://github.com/home9464/BioinformaticsHadoop/blob/master/ApplicationMaster.java

***********************************************
The distributedshell example application in Hadoop release 2.2.0 is a good example to start playing with YARN. I spent some time in the past few days to modify this application for a simple BWA alignment job.

In the original application, parameter "-shell_command " gives the CMD you want to run. Therefore you can only run a single command like "cat && mouse && dog".

$hadoop jar hadoop-yarn-applications-distributedshell-2.2.0.jar \
org.apache.hadoop.yarn.applications.distributedshell.Client \
-jar hadoop-yarn-applications-distributedshell-2.2.0.jar \

-shell_command '/bin/date' -num_containers 2

For a simple BWA alignment, we need do:

$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_1.fastq > A_1.sai
$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_2.fastq > A_2.sai

$/PATH/TO/BWA/bwa sampe /PATH/TO/GENOME_INDEX A_1.sai A_2.sai A_1.fastq A_2.fastq > A.sam

Apparently we need put all these commands into a script, e.g. "1.sh"

Since this script is going to be executed on slave nodes, here we have three problems:

what is the full path to binary executable files for BWA (bwa) on slave nodes
what is the full path to indexed genome files for BWA (hg19.*) on slave nodes
what is the full path to raw sequencing files (*.fastq)on slave nodes

For the 1st and 2nd problems, because we expect to re-use the BWA application and indexed genomes many times, we should keep them on each slave nodes and most importantly, use the exact same full paths:

/home/hadoop/app/bwa-0.7.5/bwa

/home/hadoop/data/bwa/hg19/hg19.*

To do so we can manually mkdirs on each slave node

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/app/"; done

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/data/"; done

Also we need to set up a working directory "job" on each node to put any output files like "*.sai" and the SAM file.

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/job/"; done

Then sync all the BWA app and indexed genome files to each node

$rsync -arvue bwa-0.7.5 ssh hadoop@slave1:~/app

$rsync -arvue hg19 ssh hadoop@slave1:~/data

The shell script to be executed will be like:

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_1.fastq > /home/hadoop/job/A_1.sai

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_2.fastq > /home/hadoop/job/A_2.sai

/home/hadoop/app/bwa-0.7.5/bwa sampe ... > home/hadoop/job/A.sam

With all that said, now we can make some changes on the demo code. The new command will be like:

$hadoop jar ... -shell_script 1.sh -num_containers 2 -container_memory 8192

Monday, December 9, 2013

Virtual Network Setting in VirtualBox

https://www.virtualbox.org/manual/ch06.html

Friday, December 6, 2013

Back to work today

After refreshing myself with a 3-week vacation.

Then I learned a news this morning -

http://www.nbcnews.com/health/23andme-faces-class-action-lawsuit-california-2D11691043

After FDA's action

http://www.nytimes.com/2013/11/26/business/fda-demands-a-halt-to-a-dna-test-kits-marketing.html?_r=0

It looks like the "gene based disease prediction" still has a long way to go in real health care, let alone "personalized medicine" which has been imagined for a long time.

Next Generation Sequencing and Data Analysis