Next Generation Sequencing and Data Analysis

Saturday, February 15, 2014

Goal

Compare BWA to SNAP

Approach

Preparation

#app

$wget http://snap.cs.berkeley.edu/downloads/snap-0.15.4-linux.tar.gz

#data, simulated from hg19-chr20, paired, 30x average coverage

$wgsim -N 10000000 -r 0.01 -1 100 -2 100 -S11 -e0 chr20.fa A1.fq A2.fq > mut.txt

Running Time

BWA	SNAP
30m20s	20m01s

Mapping metrics

Variants Concordance

Impression (so far)

1) SNAP is 1.5x faster than BWA (memory consumption is not evaluated)

2) SNAP aligned more TP alignments, less FP alignments than BWA

3) SNAP generated more TP (8512 vs. 740) and FP (2715 vs. 596) variants than BWA

STAR

#download app
$wget https://rna-star.googlecode.com/files/STAR_2.3.0e.Linux_x86_64.tgz

$tar -zxvf STAR_2.3.0e.Linux_x86_64.tgz

$cd STAR_2.3.0e.Linux_x86_64/

#splice junction data
$wget ftp://ftp2.cshl.edu/gingeraslab/tracks/STARrelease/STARgenomes/SpliceJunctionDatabases/gencode.v14.annotation.gtf.sjdbcd

#create the dir for index genome
$mkdir hg19

#generate index genome with splice junction annotations
$./STAR --runMode genomeGenerate --genomeDir hg19 --genomeFastaFiles /projects/confidential_sequence/home/sun/data/hg19/hg19.fa --runThreadN 4 --sjdbFileChrStartEnd gencode.v14.annotation.gtf.sjdb --sjdbOverhang 99

$mv hg19 ~/data/star_hg19

$cd ~

$mkdir -p test/STAR; cd ~/test/STAR

#full path of input files does NOT work. instead, create softlinks on the local folder.
$ln -s .

#run
$time STAR_2.3.0e.Linux_x86_64/STAR --genomeDir data/star_hg19 --readFilesIn *.fastq.gz --outFileNamePrefix --runThreadN 10 --readFilesCommand zcat 1>std.txt 2>err.txt

The output is a SAM file with name "OUTPUT.Aligned.out.sam"

By using suffix tree algorithm, STAR uses lots of (>30GB) memory in exchange of speed.

Monday, February 10, 2014

Node.js is fun

I have been playing with Node.js for few days and really love it. Trying to build a prototype website with workflows.

Use case:

log into the system
upload dataset with supported format (fastq, sam/bam, vcf, bed. etc)
describe the dataset
choose components/node, each node is a independent operation
connect components with edges as DAG (directed acyclic graph)
fine tune each component if necessary (add/remove/change parameter settings)
execute the flow
monitor the progress (check, terminate, pause)
check out the output of each component and last result
save the flow for future use, share, publish.

#download the NoFlo.js flow demo
$git clone https://github.com/noflo/dataflow-noflo.git flow

$cd flow

$npm install

$grunt build

#start a simple http server to serve the contents
$python -m SimpleHTTPServer

#You may see a log message like this:
#Serving HTTP on 0.0.0.0 port 8000

Now start a browser like Chrome, type in

"192.168.1.2:8000/demo/"

#192.168.1.2 is my ip address. YMMV.

You will see a very nice dataflow graph like this:

You can add/delete/drag/move nodes and edges. Cool!

Friday, December 20, 2013

How does YARN execute my job

Found some interesting "behind the curtain" stuffs to understand how YARN works.

In the configuration file "yarn-site.xml", we can specify where to store LocalResouce files. Under there we can find "usercache" for APPLICATION level resource which will be automatically cleaned after finishing a job. The "filecache" that was for "PUBLIC" level resource which will NOT be deleted after finishing a job, saved as cache, only deleted if disk space is a concern.

In our script, we can add a command at the end of script to pass everything in current working directory to somewhere in master machine and check it out.

$rsync -arvue ssh * hadoop@master:~/tmp/

Then if we have a look at the hadoop@master:~/tmp/ we can find some stuffs:

container_tokens
default_container_executor.sh
launch_container.sh
\tmp

Here we can find 4 files (folders). "container_tokens" is about security so we can leave it for now. Let us have a look at the two scripts:

$cat default_container_executor.sh
#!/bin/bash

echo $$ > /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp
/bin/mv -f /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid
exec setsid /bin/bash -c "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/launch_container.sh"

"default_container_executor.sh" actually create a new session and call "launch_container.sh" in the container (slave).

$cat launch_container.sh

#!/bin/bash

export NM_HTTP_PORT="8042"

export LOCAL_DIRS="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001"

export HADOOP_COMMON_HOME="/home/hadoop/hadoop-2.2.0"

export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"

export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=

export HADOOP_YARN_HOME="/home/hadoop/hadoop-2.2.0"

export HADOOP_TOKEN_FILE_LOCATION="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/container_tokens"

export NM_HOST="n1"

export JVM_PID="$$"

export USER="hadoop"

export HADOOP_HDFS_HOME="/home/hadoop/hadoop-2.2.0"

export PWD="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002"

export CONTAINER_ID="container_1387575550242_0001_01_000002"

export NM_PORT="48810"

export HOME="/home/"

export LOGNAME="hadoop"

export HADOOP_CONF_DIR="/home/hadoop/hadoop-2.2.0/etc/hadoop"

export MALLOC_ARENA_MAX="4"

export LOG_DIRS="/home/hadoop/logs/application_1387575550242_0001/container_1387575550242_0001_01_000002"

ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/12/A_R2.fq" "A_R2.fq"

ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/11/A_R1.fq" "A_R1.fq"

ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/10/1.sh" "script.sh"

exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"

OK we can see lots of stuff. Most lines of the script just define environmental variables. Some lines ("ln -sf ...") make soft links from the filecache to current working directory. All these steps are prepared for the last command:

exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"

This time our own script "script.sh" was called on current working directory. In our script we can use previously defined environmental variables like "LOG_DIRS" like the original example application "DistributedShell" from hadoop-2.2.0 did.

Since all LocalResource files are also linked to current directory, our script can do something on these files now.

#!/bin/bash
echo "Running"
cat A_R1.fq A_R2.fq > A.fq
echo "Moving results back"
rsync -arvue ssh * hadoop@master:/somewhere/to/store/outputs/
echo "Done"

Tuesday, December 17, 2013

The DistributedShell in hadoop 2.0 a.k.a YARN

[update: 01/05/2014]

The holiday season is over. Finally have sometime to make a update to add the link to the code. It is just a prototype now. lots of things are under development and testing.

Here is the link to the Client & ApplicationMaster:

https://github.com/home9464/BioinformaticsHadoop/blob/master/Client.java

https://github.com/home9464/BioinformaticsHadoop/blob/master/ApplicationMaster.java

***********************************************
The distributedshell example application in Hadoop release 2.2.0 is a good example to start playing with YARN. I spent some time in the past few days to modify this application for a simple BWA alignment job.

In the original application, parameter "-shell_command " gives the CMD you want to run. Therefore you can only run a single command like "cat && mouse && dog".

$hadoop jar hadoop-yarn-applications-distributedshell-2.2.0.jar \
org.apache.hadoop.yarn.applications.distributedshell.Client \
-jar hadoop-yarn-applications-distributedshell-2.2.0.jar \

-shell_command '/bin/date' -num_containers 2

For a simple BWA alignment, we need do:

$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_1.fastq > A_1.sai
$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_2.fastq > A_2.sai

$/PATH/TO/BWA/bwa sampe /PATH/TO/GENOME_INDEX A_1.sai A_2.sai A_1.fastq A_2.fastq > A.sam

Apparently we need put all these commands into a script, e.g. "1.sh"

Since this script is going to be executed on slave nodes, here we have three problems:

what is the full path to binary executable files for BWA (bwa) on slave nodes
what is the full path to indexed genome files for BWA (hg19.*) on slave nodes
what is the full path to raw sequencing files (*.fastq)on slave nodes

For the 1st and 2nd problems, because we expect to re-use the BWA application and indexed genomes many times, we should keep them on each slave nodes and most importantly, use the exact same full paths:

/home/hadoop/app/bwa-0.7.5/bwa

/home/hadoop/data/bwa/hg19/hg19.*

To do so we can manually mkdirs on each slave node

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/app/"; done

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/data/"; done

Also we need to set up a working directory "job" on each node to put any output files like "*.sai" and the SAM file.

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/job/"; done

Then sync all the BWA app and indexed genome files to each node

$rsync -arvue bwa-0.7.5 ssh hadoop@slave1:~/app

$rsync -arvue hg19 ssh hadoop@slave1:~/data

The shell script to be executed will be like:

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_1.fastq > /home/hadoop/job/A_1.sai

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_2.fastq > /home/hadoop/job/A_2.sai

/home/hadoop/app/bwa-0.7.5/bwa sampe ... > home/hadoop/job/A.sam

With all that said, now we can make some changes on the demo code. The new command will be like:

$hadoop jar ... -shell_script 1.sh -num_containers 2 -container_memory 8192

Monday, December 9, 2013

Virtual Network Setting in VirtualBox

https://www.virtualbox.org/manual/ch06.html

Friday, December 6, 2013

Back to work today

After refreshing myself with a 3-week vacation.

Then I learned a news this morning -

http://www.nbcnews.com/health/23andme-faces-class-action-lawsuit-california-2D11691043

After FDA's action

http://www.nytimes.com/2013/11/26/business/fda-demands-a-halt-to-a-dna-test-kits-marketing.html?_r=0

It looks like the "gene based disease prediction" still has a long way to go in real health care, let alone "personalized medicine" which has been imagined for a long time.