To run a BWA mappping job inside a docker, I want to create three containers. one for "bwa" executable, one for "bwa" genome index, and one for the input fastq files. All the benefits can be summarized by one word: "isolation".
- 1. The bwa executable application. I would put it into a volume /bioapp/bwa/0.7.9a/ in an docker image named "yings/bioapp"
- 2. The reference genome index which was created using "bwa index". I would put it into a volume /biodata/hg19/index/bwa/ in a docker image named "yings/biodata"
- 3. The input FASTQ files. Assuming they can be found under "/home/hadoop/fastq" in the host.
Create a image name "bioapp" with tag "v04202015"
copy all files and folders under "app" to the "/bioapp/" in the container. For simplicity here other operations like installing dependencies were not included in the Dockerfile.
The Dockerfile looks like this:
FROM ubuntu:14.04 RUN mkdir -p /bioapp COPY app /bioapp/ VOLUME /bioapp ENTRYPOINT /usr/bin/tail -f /dev/null
build the bioapp image -$docker build -t yings/bioapp:v04202015 .
Create a image name "biodata" with tag "v04202015" copy all files and folders under "data" to the "/biodata/" in the container$cat >Dockerfile< FROM ubuntu:14.04 RUN mkdir -p /biodata COPY data /biodata/ VOLUME /biodata ENTRYPOINT /usr/bin/tail -f /dev/null EOF
Build the bioapp image$docker build -t yings/biodata:v04202015 .
Start the biodata container as daemon, name it as "biodata"$docker run -d --name biodata yings/biodata:v04202015
Start the bioapp container as daemon, name it as "bioapp"$docker run -d --name bioapp yings/bioapp:v04202015
Now we should have two data volume containers running in the backend. It is time to launch the final executor container$docker run -it --volumes-from biodata --volumes-from bioapp -v /home/hadoop/fastq:/fastq ubuntu:14.04 /bin/bash
Those parameters mean: "-it" run the executor container interactively "--volumes-from biodata" load the data volume from container "biodata" (do not confused it with image "yings/biodata") "--volumes-from bioapp" load the data volume from container "bioapp" (again, do not confused it with image "yings/bioapp") "-v /home/hadoop/fastq:/fastq ubuntu:14.04" mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container. "ubuntu:14.04" this is the standard image, as our OS "/bin/bash" command to be executed as entry point. If everything goes well, you will see you are root now in the executor containerroot@5927eecc8530:/#
Is bwa there?root@5927eecc8530:/# ls -R /bioapp/
Is genome index there?root@5927eecc8530:/# ls -R /biodata/
Is fastq there?root@5927eecc8530:/# ls -R /fastq/
Launch the job, save the alignment as "/fastq/output/A.sam"root@5927eecc8530:/#/bioapp/bwa/0.7.9a/bwa mem -t 8 -R '@RG\tID:group_id\tPL:illumina\tSM:sample_id' /biodata/index/hg19/bwa/ucsc.hg19 /fastq/A_R1.fastq.gz /fastq/A_R2.fastq.gz > /fastq/A.sam
The process should be complete in a few minutes because the input fastq files are very small, as a test. Now you can safely terminate the executor container by pressing "Ctrl-D". Since we previously mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container, now back to the host, we will see the persistent output "A.sam" under "/home/hadoop/fastq". However, if we use the "/biodata" or "/bioapp" as the output folder, for example, "bwa ... > /biodata/A.sam", then it is NOT persistent. If you terminate the "biodata" container, all the changes on that container will be lost. (stop & restart the container is OK, as long as it was not terminated).