Next Generation Sequencing and Data Analysis: April 2015

Monday, April 20, 2015

Run a BWA job with docker

To run a BWA mappping job inside a docker, I want to create three containers. one for "bwa" executable, one for "bwa" genome index, and one for the input fastq files. All the benefits can be summarized by one word: "isolation".

1. The bwa executable application. I would put it into a volume /bioapp/bwa/0.7.9a/ in an docker image named "yings/bioapp"
2. The reference genome index which was created using "bwa index". I would put it into a volume /biodata/hg19/index/bwa/ in a docker image named "yings/biodata"
3. The input FASTQ files. Assuming they can be found under "/home/hadoop/fastq" in the host.

Create a image name "bioapp" with tag "v04202015" copy all files and folders under "app" to the "/bioapp/" in the container. For simplicity here other operations like installing dependencies were not included in the Dockerfile.

The Dockerfile looks like this:

    FROM ubuntu:14.04
    RUN mkdir -p /bioapp
    COPY app /bioapp/
    VOLUME /bioapp
    ENTRYPOINT /usr/bin/tail -f /dev/null

build the bioapp image

-$docker build -t yings/bioapp:v04202015 .




Create a image name "biodata" with tag "v04202015" 
copy all files and folders under "data" to the "/biodata/" in the container  

 $cat >Dockerfile<
 FROM ubuntu:14.04
 RUN mkdir -p /biodata
 COPY data /biodata/
 VOLUME /biodata
 ENTRYPOINT /usr/bin/tail -f /dev/null
 EOF



Build the bioapp image

 $docker build -t yings/biodata:v04202015 .




Start the biodata container as daemon, name it as "biodata"

$docker run -d --name biodata yings/biodata:v04202015



Start the bioapp container as daemon, name it as "bioapp"

$docker run -d --name bioapp yings/bioapp:v04202015




Now we should have two data volume containers running in the backend. It is time to launch the final executor container

$docker run -it --volumes-from biodata --volumes-from bioapp -v /home/hadoop/fastq:/fastq ubuntu:14.04 /bin/bash



Those parameters mean: 

"-it" run the executor container interactively
"--volumes-from biodata" load the data volume from container "biodata" (do not confused it with image "yings/biodata")
"--volumes-from bioapp" load the data volume from container "bioapp" (again, do not confused it with image "yings/bioapp")

"-v /home/hadoop/fastq:/fastq ubuntu:14.04" mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container.
"ubuntu:14.04" this is the standard image, as our OS
"/bin/bash" command to be executed as entry point.


If everything goes well, you will see you are root now in the executor container

root@5927eecc8530:/#




Is bwa there?
root@5927eecc8530:/# ls -R /bioapp/



Is genome index  there?
root@5927eecc8530:/# ls -R /biodata/



Is fastq  there?
root@5927eecc8530:/# ls -R /fastq/




Launch the job, save the alignment as "/fastq/output/A.sam"


root@5927eecc8530:/#/bioapp/bwa/0.7.9a/bwa mem -t 8 -R '@RG\tID:group_id\tPL:illumina\tSM:sample_id' /biodata/index/hg19/bwa/ucsc.hg19 /fastq/A_R1.fastq.gz  /fastq/A_R2.fastq.gz > /fastq/A.sam




The process should be complete in a few minutes because the input fastq files are very small, as a test.


Now you can safely terminate the executor container by pressing "Ctrl-D".

Since we previously mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container, now back to the host, we will see the persistent output "A.sam" under "/home/hadoop/fastq".


However, if we use the "/biodata" or "/bioapp" as the output folder, for example, "bwa ... > /biodata/A.sam", then it is NOT persistent. If you terminate the "biodata" container, all the changes on that container will be lost. (stop & restart the container is OK, as long as it was not terminated).

Friday, April 17, 2015

Docker container to host reference application and library?

Recently I am considering to migrate from AMI to Docker Image
for the reference repository. Thinking about launching 20 or more on-demand AWS EC2 c3.4xlarge instances for WGS data analysis.

===AMI===

Pros:
Fast to start
No installation required
No configuration required
Secure

Cons:
AWS only
Building AMI is a little time-consuming
Version control is tricky
Volume size will increase gradually

===Docker image===

Pros:
Easy to build
Deploy on any cloud or local platforms
Easy to tag or version control
Seems more popular and cooler

Cons:
Where to host repository? The Hub or S3? Many considerations
Pulling images from one repository on lots of worker nodes at the same time is not applicable. Networking would be challenging.

Definitely need more time to do lots of tests. Work in progress...

Friday, April 10, 2015

Lightweight MRAppMaster?

hadoop v2 moves the master application from Master Node to one container. The pro is that it greatly reduced the worload on Master Node while launching multiple MapReduce applications and easily make the hadoop framework more scalable beyond thousands of Worker Nodes.

However this brings a problem in a small cluster- the Master Application "MRAppMaster" now will occupy a whole container.

Supposedly we have five Worker Nodes and every node has just one
container. Since "MRAppMaster" will use one container, then you have only four containers can be used for real processing. 20% of computing resource were "wasted". We can mitigate this problem by assigning two containers per node. By this way only 10% of computing resource were wasted. However if we divide a node into two containers, then every container's memory and CPU will be cut into half too. The memory size is very precious in many bioinformatics applications. With less than 8G memory your aligner probably will fail.

If your hadoop cluster has more than 10 nodes, then do not bother to take it into consideration.

How much space does a file use in HDFS

I have downloaded NA12878 from http://www.illumina.com/platinumgenomes/.

The total size is 572GB from 36 gzipped FASTQ files.

Now we want to put this datatset into HDFS with replication factor as 3. Then how much space will it use?

#before:
$hdfs dfsadmin -report | less

Configured Capacity: 39378725642240 (35.81 TB)
Present Capacity: 36885873225728 (33.55 TB)
DFS Remaining: 28098049122304 (25.56 TB)
DFS Used: 8787824103424 (7.99 TB)
DFS Used%: 23.82%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

$time hdfs -put NA12878 /fastq/

real 113m45.058s
user 32m0.252s

sys 23m34.897s

#after:
$hdfs dfsadmin -report | less

Configured Capacity: 39378725642240 (35.81 TB)
Present Capacity: 36885873225728 (33.55 TB)
DFS Remaining: 26241290846208 (23.87 TB)
DFS Used: 10644582379520 (9.68 TB)
DFS Used%: 28.86%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Comparing before and after we can see the 572GB dataset actually uses 1.69TB (9.68TB - 7.99TB) HDFS space. That is 3x - exactly the same number of "dfs.replication" in "hdfs-site.xml".

Saturday, April 4, 2015

Xen step1

##########################################
##########################################
sudo vim /etc/network/interfaces

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface
auto lo em1 xenbr0
iface lo inet loopback

# The primary network interface
#auto em1
#iface em1 inet dhcp

iface xenbr0 inet dhcp
bridge_ports em1
iface em1 inet manual

##########################################
#
#cat > /etc/xen/n0-hvm.cfg
##########################################
builder = "hvm"
name = "n0-hvm"
memory = "10240"
vcpus = 1
vif = ['']
disk = ['phy:/dev/vg0/lv0,hda,w','file:/home/hadoop/ubuntu-14.04.2-server-amd64.iso,hdc:cdrom,r']
vnc = 1
boot="dc"

xl create /etc/xen/n0-hvm.cfg

##########################################
#Connect to the installation process using VNC
##########################################
sudo apt-get install gvncviewer
gvncviewer localhost:0

##########################################
#list Virtual Machines
##########################################
xl list
Name ID Mem VCPUs State Time(s)
Domain-0 0 53285 12 r----- 81.2
n0-hvm 2 10240 1 r----- 8.4

##########################################
##########################################
sudo apt-get install virtinst virt-viewer

#"-r 10240": 10240MB memory
#"-n n0": name of this virtual machine
#"-f /dev/vg0/lv0": LV disk path
#"-c /home/hadoop/ubuntu-14.04.2-server-amd64.iso": path to the ISO
#"--vcpus 1": use 1 CPU
#"--network bridge=xenbr0": use xenbr0 as network

virt-install -n n0 \
--vcpus 1 \
-r 1024 \
--network bridge=xenbr0 \
-f /dev/vg0/lv0 \
-c /home/hadoop/ubuntu-14.04.2-server-amd64.iso

Next Generation Sequencing and Data Analysis