To run a BWA mappping job inside a docker, I want to create three containers. one for "bwa" executable, one for "bwa" genome index, and one for the input fastq files. All the benefits can be summarized by one word: "isolation".
- 1. The bwa executable application. I would put it into a volume /bioapp/bwa/0.7.9a/ in an docker image named "yings/bioapp"
- 2. The reference genome index which was created using "bwa index".
I would put it into a volume /biodata/hg19/index/bwa/ in a docker image named "yings/biodata"
- 3. The input FASTQ files. Assuming they can be found under "/home/hadoop/fastq" in the host.
Create a image name "bioapp" with tag "v04202015"
copy all files and folders under "app" to the "/bioapp/" in the container. For simplicity here other operations like installing dependencies were not included in the Dockerfile.
The Dockerfile looks like this:
FROM ubuntu:14.04
RUN mkdir -p /bioapp
COPY app /bioapp/
VOLUME /bioapp
ENTRYPOINT /usr/bin/tail -f /dev/null
build the bioapp image
-$docker build -t yings/bioapp:v04202015 .
Create a image name "biodata" with tag "v04202015"
copy all files and folders under "data" to the "/biodata/" in the container
$cat >Dockerfile<
FROM ubuntu:14.04
RUN mkdir -p /biodata
COPY data /biodata/
VOLUME /biodata
ENTRYPOINT /usr/bin/tail -f /dev/null
EOF
Build the bioapp image
$docker build -t yings/biodata:v04202015 .
Start the biodata container as daemon, name it as "biodata"
$docker run -d --name biodata yings/biodata:v04202015
Start the bioapp container as daemon, name it as "bioapp"
$docker run -d --name bioapp yings/bioapp:v04202015
Now we should have two data volume containers running in the backend. It is time to launch the final executor container
$docker run -it --volumes-from biodata --volumes-from bioapp -v /home/hadoop/fastq:/fastq ubuntu:14.04 /bin/bash
Those parameters mean:
"-it" run the executor container interactively
"--volumes-from biodata" load the data volume from container "biodata" (do not confused it with image "yings/biodata")
"--volumes-from bioapp" load the data volume from container "bioapp" (again, do not confused it with image "yings/bioapp")
"-v /home/hadoop/fastq:/fastq ubuntu:14.04" mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container.
"ubuntu:14.04" this is the standard image, as our OS
"/bin/bash" command to be executed as entry point.
If everything goes well, you will see you are root now in the executor container
root@5927eecc8530:/#
Is bwa there?
root@5927eecc8530:/# ls -R /bioapp/
Is genome index there?
root@5927eecc8530:/# ls -R /biodata/
Is fastq there?
root@5927eecc8530:/# ls -R /fastq/
Launch the job, save the alignment as "/fastq/output/A.sam"
root@5927eecc8530:/#/bioapp/bwa/0.7.9a/bwa mem -t 8 -R '@RG\tID:group_id\tPL:illumina\tSM:sample_id' /biodata/index/hg19/bwa/ucsc.hg19 /fastq/A_R1.fastq.gz /fastq/A_R2.fastq.gz > /fastq/A.sam
The process should be complete in a few minutes because the input fastq files are very small, as a test.
Now you can safely terminate the executor container by pressing "Ctrl-D".
Since we previously mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container, now back to the host, we will see the persistent output "A.sam" under "/home/hadoop/fastq".
However, if we use the "/biodata" or "/bioapp" as the output folder, for example, "bwa ... > /biodata/A.sam", then it is NOT persistent. If you terminate the "biodata" container, all the changes on that container will be lost. (stop & restart the container is OK, as long as it was not terminated).