Next Generation Sequencing and Data Analysis: 2015

Friday, October 2, 2015

Add and delete users in batch

#add some users to group "research" with initial password "123"

g=research
for u in t1 t2 t3
do
sudo useradd -d /home/${u} -m ${u} -g ${g}
echo ${u}:${u}123 | sudo chpasswd
echo -ne "${u}123\n${u}123\n" | sudo smbpasswd -a ${u}
done

#delete these users as well as their home directory

for u in t1 t2 t3
do
sudo userdel -r $u 2>/dev/null
done

#configure samba

sudo apt-get install -y samba
cd /etc/samba/
sudo cp smb.conf smb.conf.bak
sudo vim smb.conf

############################
[global]
security = user

[store]
path = /store/public
valid users = @research
browsable =yes
writable = yes
guest ok = no
############################

sudo mkdir -p /store/public
sudo chmod -R 0755 /store/public
sudo chown -R hadoop:research /store/public
sudo service smbd restart

Thursday, September 10, 2015

Redirect nohup's output

Some novice Linux users like to execute a long time running pipeline using nohup

"nohup mycommand"

which by default will direct stdout and stderr to the current working directory with name "nohup.out". If we have many of these files then we can easily get lost due to the same names under different folders.

While we can tell the user to redirect their output to another managed path, it is hard to let them run something like

"nohup 1>/log/a.log 2>&1"

It is just too complicated.

TO solve this problem without changing their old habit, we can re-define the nohup. Add following code to the end of "/etc/bash.bashrc"

function nohup () {
UUID=$(cat /proc/sys/kernel/random/uuid)
logfile=/log/tmp/${UUID}.txt
printf "log:\n${logfile}\n"
/usr/bin/nohup $@ 1> ${logfile} 2>&1}

Next time when user launched a nohup, the system will print a message like this:

log:

/log/tmp/16170f02-7413-49fb-a8ea-0b5c621cdd93.txt

Tuesday, August 18, 2015

Evaluation on speedseq

Recently read an interesting article from here:

http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3505.html

The author is kind enough to open source it
https://github.com/hall-lab/speedseq

So I checked out the code and make some tests. Why is it faster than popular GATK package? After spending few hours , my conclusions:

1. The speedseq use simplified process.

Basically it use works like:

"mapping"->"remove duplicates"->"Call variants"

while traditional GATK works like:

"mapping"->"remove duplicates"->"realign"->"recalibrate"->"Call variants" plus lots of sorting and indexing between steps.

As indicated in freebayes' website, freebayes internally handle realign and recalibrate.

2. User GNU parallel to parallel the variants calling, because freebayes does not support threading or sub-processing.

So supposedly you assign 8 CPU cores to the speedseq, it will launch a 8-processes pool to call variants using freebayes on each chromosome independently.

3. Other improvements.

For example, use sambamba instead of picard (in my experience sambamba is at least 3x faster than picard). use samblaster to get discordant and split reads simultaneously for later SV and CNV callings right after the alignment.

So what are the catches? Nothing can be perfect.

1. The components are tightly wrapped into the pipeline. Changing of individual parameters are no easy tasks.

2. The variant calling process did not split the BAM actually. If fact it only splits the BAM header for a chromosome based region to accelerate the processing because the caller will not need to go through the whole genome. Each process has to load the full size reference genome and full size BAM file(s). As a result, the memory requirement will be a problem for some machines. Also the disk I/O could be a bottleneck if you have many processes trying to read/write large data on the same time.

No wonder why the authors did the test on a pretty good AWS EC2 c3.8xlarge instance with big memory and large SSD.

Final words: Even it is not perfect, it is a wonderful solution for some people. The best of all, it is free! No need to pay greedy and lovely Broad Institute $$$$$$$$$$$$$$$$$$ for the full GATK package.

Increasing the amount of inotify watchers

echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -p

Saturday, August 15, 2015

The famous NA12878 WGS at 50x

Due to the lacking of true variants from real genome sequencing, many tools use NA12878 for benchmarking. For whole genome sequencing at 50x depth, we can download these two files: ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz

Some statistics of these two files

NA12878 WGS 50x
Facts	ERR194147_1.fastq.gz	ERR194147_2.fastq.gz
URI	ERR194147_1.fastq.gz	ERR194147_2.fastq.gz
Size	48G	49G
Reads#	787265109	787265109
Reads Length	101	101
Coverage	101*787265109/3000000000=26.5	101*787265109/3000000000=26.5

Friday, June 12, 2015

Download sequencing data from illumina's basespace

Spent 30 mins creating a script to download sequencing data from illumina's BaseSpace, by using its REST API

Here is the code:

https://gist.github.com/anonymous/91c5accc3988cb233347

Tuesday, May 12, 2015

Simple SAMBA

sudo apt-get install -y samba
mkdir /project/share
chmod -R a+rxw /project/share
###############################
#/etc/samba/smb.conf
[share]
path = /project/share
available = yes
guest only=yes
read only = no
browseable = yes
public = yes
writable = yes
###############################
sudo service smbd restart

\\10.2.5.212\share

Monday, April 20, 2015

Run a BWA job with docker

To run a BWA mappping job inside a docker, I want to create three containers. one for "bwa" executable, one for "bwa" genome index, and one for the input fastq files. All the benefits can be summarized by one word: "isolation".

1. The bwa executable application. I would put it into a volume /bioapp/bwa/0.7.9a/ in an docker image named "yings/bioapp"
2. The reference genome index which was created using "bwa index". I would put it into a volume /biodata/hg19/index/bwa/ in a docker image named "yings/biodata"
3. The input FASTQ files. Assuming they can be found under "/home/hadoop/fastq" in the host.

Create a image name "bioapp" with tag "v04202015" copy all files and folders under "app" to the "/bioapp/" in the container. For simplicity here other operations like installing dependencies were not included in the Dockerfile.

The Dockerfile looks like this:

    FROM ubuntu:14.04
    RUN mkdir -p /bioapp
    COPY app /bioapp/
    VOLUME /bioapp
    ENTRYPOINT /usr/bin/tail -f /dev/null

build the bioapp image

-$docker build -t yings/bioapp:v04202015 .




Create a image name "biodata" with tag "v04202015" 
copy all files and folders under "data" to the "/biodata/" in the container  

 $cat >Dockerfile<
 FROM ubuntu:14.04
 RUN mkdir -p /biodata
 COPY data /biodata/
 VOLUME /biodata
 ENTRYPOINT /usr/bin/tail -f /dev/null
 EOF



Build the bioapp image

 $docker build -t yings/biodata:v04202015 .




Start the biodata container as daemon, name it as "biodata"

$docker run -d --name biodata yings/biodata:v04202015



Start the bioapp container as daemon, name it as "bioapp"

$docker run -d --name bioapp yings/bioapp:v04202015




Now we should have two data volume containers running in the backend. It is time to launch the final executor container

$docker run -it --volumes-from biodata --volumes-from bioapp -v /home/hadoop/fastq:/fastq ubuntu:14.04 /bin/bash



Those parameters mean: 

"-it" run the executor container interactively
"--volumes-from biodata" load the data volume from container "biodata" (do not confused it with image "yings/biodata")
"--volumes-from bioapp" load the data volume from container "bioapp" (again, do not confused it with image "yings/bioapp")

"-v /home/hadoop/fastq:/fastq ubuntu:14.04" mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container.
"ubuntu:14.04" this is the standard image, as our OS
"/bin/bash" command to be executed as entry point.


If everything goes well, you will see you are root now in the executor container

root@5927eecc8530:/#




Is bwa there?
root@5927eecc8530:/# ls -R /bioapp/



Is genome index  there?
root@5927eecc8530:/# ls -R /biodata/



Is fastq  there?
root@5927eecc8530:/# ls -R /fastq/




Launch the job, save the alignment as "/fastq/output/A.sam"


root@5927eecc8530:/#/bioapp/bwa/0.7.9a/bwa mem -t 8 -R '@RG\tID:group_id\tPL:illumina\tSM:sample_id' /biodata/index/hg19/bwa/ucsc.hg19 /fastq/A_R1.fastq.gz  /fastq/A_R2.fastq.gz > /fastq/A.sam




The process should be complete in a few minutes because the input fastq files are very small, as a test.


Now you can safely terminate the executor container by pressing "Ctrl-D".

Since we previously mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container, now back to the host, we will see the persistent output "A.sam" under "/home/hadoop/fastq".


However, if we use the "/biodata" or "/bioapp" as the output folder, for example, "bwa ... > /biodata/A.sam", then it is NOT persistent. If you terminate the "biodata" container, all the changes on that container will be lost. (stop & restart the container is OK, as long as it was not terminated).

Friday, April 17, 2015

Docker container to host reference application and library?

Recently I am considering to migrate from AMI to Docker Image
for the reference repository. Thinking about launching 20 or more on-demand AWS EC2 c3.4xlarge instances for WGS data analysis.

===AMI===

Pros:
Fast to start
No installation required
No configuration required
Secure

Cons:
AWS only
Building AMI is a little time-consuming
Version control is tricky
Volume size will increase gradually

===Docker image===

Pros:
Easy to build
Deploy on any cloud or local platforms
Easy to tag or version control
Seems more popular and cooler

Cons:
Where to host repository? The Hub or S3? Many considerations
Pulling images from one repository on lots of worker nodes at the same time is not applicable. Networking would be challenging.

Definitely need more time to do lots of tests. Work in progress...

Friday, April 10, 2015

Lightweight MRAppMaster?

hadoop v2 moves the master application from Master Node to one container. The pro is that it greatly reduced the worload on Master Node while launching multiple MapReduce applications and easily make the hadoop framework more scalable beyond thousands of Worker Nodes.

However this brings a problem in a small cluster- the Master Application "MRAppMaster" now will occupy a whole container.

Supposedly we have five Worker Nodes and every node has just one
container. Since "MRAppMaster" will use one container, then you have only four containers can be used for real processing. 20% of computing resource were "wasted". We can mitigate this problem by assigning two containers per node. By this way only 10% of computing resource were wasted. However if we divide a node into two containers, then every container's memory and CPU will be cut into half too. The memory size is very precious in many bioinformatics applications. With less than 8G memory your aligner probably will fail.

If your hadoop cluster has more than 10 nodes, then do not bother to take it into consideration.

How much space does a file use in HDFS

I have downloaded NA12878 from http://www.illumina.com/platinumgenomes/.

The total size is 572GB from 36 gzipped FASTQ files.

Now we want to put this datatset into HDFS with replication factor as 3. Then how much space will it use?

#before:
$hdfs dfsadmin -report | less

Configured Capacity: 39378725642240 (35.81 TB)
Present Capacity: 36885873225728 (33.55 TB)
DFS Remaining: 28098049122304 (25.56 TB)
DFS Used: 8787824103424 (7.99 TB)
DFS Used%: 23.82%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

$time hdfs -put NA12878 /fastq/

real 113m45.058s
user 32m0.252s

sys 23m34.897s

#after:
$hdfs dfsadmin -report | less

Configured Capacity: 39378725642240 (35.81 TB)
Present Capacity: 36885873225728 (33.55 TB)
DFS Remaining: 26241290846208 (23.87 TB)
DFS Used: 10644582379520 (9.68 TB)
DFS Used%: 28.86%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Comparing before and after we can see the 572GB dataset actually uses 1.69TB (9.68TB - 7.99TB) HDFS space. That is 3x - exactly the same number of "dfs.replication" in "hdfs-site.xml".

Saturday, April 4, 2015

Xen step1

##########################################
##########################################
sudo vim /etc/network/interfaces

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface
auto lo em1 xenbr0
iface lo inet loopback

# The primary network interface
#auto em1
#iface em1 inet dhcp

iface xenbr0 inet dhcp
bridge_ports em1
iface em1 inet manual

##########################################
#
#cat > /etc/xen/n0-hvm.cfg
##########################################
builder = "hvm"
name = "n0-hvm"
memory = "10240"
vcpus = 1
vif = ['']
disk = ['phy:/dev/vg0/lv0,hda,w','file:/home/hadoop/ubuntu-14.04.2-server-amd64.iso,hdc:cdrom,r']
vnc = 1
boot="dc"

xl create /etc/xen/n0-hvm.cfg

##########################################
#Connect to the installation process using VNC
##########################################
sudo apt-get install gvncviewer
gvncviewer localhost:0

##########################################
#list Virtual Machines
##########################################
xl list
Name ID Mem VCPUs State Time(s)
Domain-0 0 53285 12 r----- 81.2
n0-hvm 2 10240 1 r----- 8.4

##########################################
##########################################
sudo apt-get install virtinst virt-viewer

#"-r 10240": 10240MB memory
#"-n n0": name of this virtual machine
#"-f /dev/vg0/lv0": LV disk path
#"-c /home/hadoop/ubuntu-14.04.2-server-amd64.iso": path to the ISO
#"--vcpus 1": use 1 CPU
#"--network bridge=xenbr0": use xenbr0 as network

virt-install -n n0 \
--vcpus 1 \
-r 1024 \
--network bridge=xenbr0 \
-f /dev/vg0/lv0 \
-c /home/hadoop/ubuntu-14.04.2-server-amd64.iso

Friday, March 6, 2015

Download from BaseSpace using R

#under shell
sudo apt-get update
sudo apt-get install libcurl4-gnutls-dev
sudo R

#under R
source('http://bioconductor.org/biocLite.R')
biocLite('RCurl')
biocLite('BaseSpaceR')
quit()


#Start R again
library(BaseSpaceR)
ACCESS_TOKEN<- ''
PROJECT_ID<- '3289289'
aAuth<- AppAuth(access_token = ACCESS_TOKEN)
selProj <- Projects(aAuth, id = PROJECT_ID, simplify = TRUE) 
sampl <- listSamples(selProj, limit= 1000)
inSample <- Samples(aAuth, id = Id(sampl), simplify = TRUE)
for(s in inSample)
{
    f <- listFiles(s, Extensions = ".gz")
    print(Name(f))
    getFiles(aAuth, id= Id(f), destDir = 'fastq/', verbose = TRUE)
}

Reference: http://seqanswers.com/forums/showthread.php?t=47633

Friday, February 20, 2015

Install bcl2fastq-1.8.4 and bcl2fastq 2.15.0 to Ubuntu 14.04 Server 64 Bit

Goal:

build a streamlined workflow from Sequencer to Analysis results directly, aka "one stop solution".

After a user starts an experiment by pushing a button on the touch screen on the sequencer, all the downstream analysis are automatically done.

The outputs from MiSeq will be transfered to another Linux machine, from where the intensities files will be converted into FASTQ files. After that, depending on the type of this experiment, an appropriate pipeline is launched on a cluster computer to processing these data. At last a report is generated and uploaded to a web server for the user to view. Everything was done automatically without any manual operations. No bioinformaticians or computer scientists involved in this process.

Here I list the steps for the installation of "bcl2fastq" on the Linux machine.

#system
Ubuntu 14.04 Server 64 bit

#package
bcl2fastq-1.8.4

#dependency
sudo apt-get install alien dpkg-dev debhelper build-essential xsltproc gnuplot -y

#make a tmp folder
mkdir -p ~/tmp ; cd ~/tmp

#download the bcl2fastq RPM package from illumina.
#The tar ball source code failed to compile on my system

wget ftp://webdata:webdata@ussd-ftp.illumina.com/Downloads/Software/bcl2fastq/bcl2fastq-1.8.4-Linux-x86_64.rpm

#http://seqanswers.com/forums/showthread.php?t=45649
sudo alien -i bcl2fastq-1.8.4-Linux-x86_64.rpm
curl -kL http://install.perlbrew.pl | bash
echo >> ~/.bash_profile "source ~/perl5/perlbrew/etc/bashrc"
perlbrew install perl-5.14.4
perlbrew switch perl-5.14.4
perlbrew install-cpanm

#install expat-2.1.0
wget http://downloads.sourceforge.net/project/expat/expat/2.1.0/expat-2.1.0.tar.gz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fexpat%2Ffiles%2Fexpat%2F2.1.0%2F&ts=1424461084&use_mirror=softlayer-dal
tar -zxvf expat-2.1.0.tar.gz
cd expat-2.1.0 && ./configure && make && sudo make install

#install XML-Parser-2.41
wget http://pkgs.fedoraproject.org/repo/pkgs/perl-XML-Parser/XML-Parser-2.41.tar.gz/c320d2ffa459e6cdc6f9f59c1185855e/XML-Parser-2.41.tar.gz
cd XML-Parser-2.41 && perl Makefile.PL && make && sudo make install

#install XML module
cpanm XML/Simple.pm

#exit
exit

#the installation is done. Now make a test run.

#To run bcl2fastq we have to switch to a less-strict PERL environment
perlbrew switch perl-5.14.4

#assume "test/Data/Intensities/BaseCalls" is the output folder from your sequecning machine
/usr/local/bin/configureBclToFastq.pl --input-dir test/Data/Intensities/BaseCalls --output-dir output

#change to new output folder "output" and start the "from bcl to fastq" conversion
cd output && make -j $(grep -c ^processor /proc/cpuinfo)

#if you find "INFO: all completed successfully" in the last line of output then the test passed
#check out the output fastq files, here the "000000000-ABCDEF" is the FCID (Flow Cell ID)
ls -al Project_000000000-ABCDEF/Sample_lane1/

##################################
UPDATE!!!

interestingly, illumina claimed
"""
Use the bcl2fastq 2.15.0 conversion software to convert NextSeq 500 or HiSeq X output.
Version 2.15.0 is only for use with NextSeq and HiSeq X data.
Use bcl2fastq 1.8.4 for MiSeq and HiSeq data conversion.
The software is available for download in either an rpm or tarball (.tar.gz) format.
"""
http://support.illumina.com/downloads.html

However I found after the MiSeq's outputs were uploaded into BaseSpace, it actually
converted by bcl2fastq 2.15.0.
One key difference is that the "SampleSheet.csv" has different formats for "bcl2fastq 1.8.4" and "bcl2fastq 2.15.0".
The "SampleSheet.csv" generated by MiSeq match the format for "bcl2fastq 2.15.0".

It is a breeze to install bcl2fastq2-v2.15.0 on Ubuntu from source code.
It is a mission impossible (amlost, I spent a whole day trying different methods then gave up)to do so with bcl2fastq-1.8.4 from source code

wget ftp://webdata2:webdata2@ussd-ftp.illumina.com/downloads/Software/bcl2fastq/bcl2fastq2-v2.15.0.4.tar.gz
tar -zxvf bcl2fastq2-v2.15.0.4.tar.gz
cd bcl2fastq && mkdir build && cd build
../src/configure --prefix=/home/hadoop/tool/bcl2fastq && make && make install

#now make a test. Assuming "exp123" is our input folder
#firstly all the ".bcl" files MUST be gzipped otherwise the job will fail (need improvements here, illumina!)

find exp123 -name "*.bcl" -exec gzip {} \;

#pull the trigger
/home/hadoop/tool/bcl2fastq/bin/bcl2fastq -R exp123 -o exp123_fastq

Nice! Sorry I take back my complaining on illumina's software engineering team this morning after frustration on installing and running bcl2fastq 1.8.4

Good job, illumina team on bcl2fastq 2.15.0. You have my love again.

Next Generation Sequencing and Data Analysis