In the configuration file "yarn-site.xml", we can specify where to store LocalResouce files. Under there we can find "usercache" for APPLICATION level resource which will be automatically cleaned after finishing a job. The "filecache" that was for "PUBLIC" level resource which will NOT be deleted after finishing a job, saved as cache, only deleted if disk space is a concern.
In our script, we can add a command at the end of script to pass everything in current working directory to somewhere in master machine and check it out.
$rsync -arvue ssh * hadoop@master:~/tmp/
Then if we have a look at the hadoop@master:~/tmp/ we can find some stuffs:
container_tokens
default_container_executor.sh
launch_container.sh
\tmp
Here we can find 4 files (folders). "container_tokens" is about security so we can leave it for now. Let us have a look at the two scripts:
$cat default_container_executor.sh
#!/bin/bash
echo $$ > /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp
/bin/mv -f /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid
exec setsid /bin/bash -c "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/launch_container.sh"
"default_container_executor.sh" actually create a new session and call "launch_container.sh" in the container (slave).
$cat launch_container.sh
#!/bin/bash
export NM_HTTP_PORT="8042"
export LOCAL_DIRS="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001"
export HADOOP_COMMON_HOME="/home/hadoop/hadoop-2.2.0"
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
"
export HADOOP_YARN_HOME="/home/hadoop/hadoop-2.2.0"
export HADOOP_TOKEN_FILE_LOCATION="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/container_tokens"
export NM_HOST="n1"
export JVM_PID="$$"
export USER="hadoop"
export HADOOP_HDFS_HOME="/home/hadoop/hadoop-2.2.0"
export PWD="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002"
export CONTAINER_ID="container_1387575550242_0001_01_000002"
export NM_PORT="48810"
export HOME="/home/"
export LOGNAME="hadoop"
export HADOOP_CONF_DIR="/home/hadoop/hadoop-2.2.0/etc/hadoop"
export MALLOC_ARENA_MAX="4"
export LOG_DIRS="/home/hadoop/logs/application_1387575550242_0001/container_1387575550242_0001_01_000002"
ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/12/A_R2.fq" "A_R2.fq"
ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/11/A_R1.fq" "A_R1.fq"
ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/10/1.sh" "script.sh"
exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"
OK we can see lots of stuff. Most lines of the script just define environmental variables. Some lines ("ln -sf ...") make soft links from the filecache to current working directory. All these steps are prepared for the last command:
exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"
This time our own script "script.sh" was called on current working directory. In our script we can use previously defined environmental variables like "LOG_DIRS" like the original example application "DistributedShell" from hadoop-2.2.0 did.
Since all LocalResource files are also linked to current directory, our script can do something on these files now.
#!/bin/bash
echo "Running"
cat A_R1.fq A_R2.fq > A.fq
echo "Moving results back"
rsync -arvue ssh * hadoop@master:/somewhere/to/store/outputs/
echo "Done"