Next Generation Sequencing and Data Analysis: August 2014

Wednesday, August 27, 2014

AWS AMI for Hadoop DataNode

##Create an AWS AMI as the snapshot for hadoop datanode
##This image should contain all necessary tools, packages and libraries
##to be used by the pipeline and your hadoop-application

##assume the we already started a Ubuntu14.04-64-PV instance
##Public IP for the instance is 12.34.56.78 and our private key file was saved as
##/home/hadoop/.ssh/aws.pem

AWS_IP=12.34.56.78
KEYFILE=/home/hadoop/.ssh/aws.pem
USER=ubuntu
########################
#step1. log in our AWS instance
########################
ssh -i ~/.ssh/aws.pem ubuntu@${AWS_IP}

########################
#step2. install basic packages
########################
sudo apt-get install openjdk-7-jdk -y
sudo apt-get install make -y
sudo apt-get install cmake -y
sudo apt-get install gcc -y
sudo apt-get install g++ -y
sudo apt-get install zlib1g-dev -y
sudo apt-get install unzip -y
sudo apt-get install libncurses5-dev -y
sudo apt-get install r-base -y
sudo apt-get install python-dev -y
sudo apt-get install python-dateutil -y
sudo apt-get install python-psutil -y
sudo apt-get install python-pip -y
sudo apt-get install maven2 -y
sudo apt-get install libxml2-dev -y
sudo apt-get install gradle -y

#install R packages
sudo R
source('http://www.bioconductor.org/biocLite.R')
biocLite('edgeR')
biocLite('DESeq')
biocLite('limma')
#... all other necessary packages

########################
#step3. create folder structure
########################
TOOLS_HOME=~/tools
BIO_HOME=~/bioinformatics
APP=$BIO_HOME/app
DATA=$BIO_HOME/data

mkdir -p ${TOOLS_HOME}
mkdir -p ${APP}
mkdir -p ${DATA}

########################
#step4. install hadoop under ~/tools/
########################
cd $TOOLS_HOME
wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.5.0/hadoop-2.5.0.tar.gz
tar -zxvf hadoop-2.4.0.tar.gz && rm -f hadoop-2.5.0.tar.gz && cd

#install s3tools under ~/tools/
cd $TOOLS_HOME
wget https://github.com/s3tools/s3cmd/archive/master.zip
unzip master.zip

########################
#step5. install bioinformatics applications
########################

#BWA
cd $APP
wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.9a.tar.bz2
tar -jxvf bwa-0.7.9a.tar.bz2 && rm -fr bwa-0.7.9a.tar.bz2
cd bwa-0.7.9a && make && cd $APP
mkdir -p $APP/bwa/0.7.9a
find bwa-0.7.9a -executable -type f -print0 | xargs -0 -I {} mv {} $APP/bwa/0.7.9a/
rm -fr bwa-0.7.9a

#bowtie1
cd ${APP}
wget http://downloads.sourceforge.net/project/bowtie-bio/bowtie/1.0.1/bowtie-1.0.1-linux-x86_64.zip
unzip bowtie-1.0.1-linux-x86_64.zip && rm bowtie-1.0.1-linux-x86_64.zip
mkdir -p ${APP}/bowtie/1.0.1 && mv bowtie-1.0.1/* ${APP}/bowtie/1.0.1/ && rm -fr bowtie-1.0.1/

#bowtie2
cd ${APP}
wget http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.3/bowtie2-2.2.3-linux-x86_64.zip
unzip bowtie2-2.2.3-linux-x86_64.zip && rm bowtie2-2.2.3-linux-x86_64.zip
mkdir -p ${APP}/bowtie/2.2.3 && mv bowtie2-2.2.3/* ${APP}/bowtie/2.2.3/ && rm -fr bowtie2-2.2.3/

#SNAP
cd ${APP}
curl http://snap.cs.berkeley.edu/downloads/snap-1.0beta.10-linux.tar.gz | tar xvz
mkdir -p ${APP}/snap/1.0beta.10/
mv snap-1.0beta.10-linux/* ${APP}/snap/1.0beta.10/ && rm -fr snap-1.0beta.10-linux

#GSNAP
cd ${APP}
curl http://research-pub.gene.com/gmap/src/gmap-gsnap-2014-06-10.tar.gz | tar xvz
cd gmap-2014-06-10 && ./configure --prefix=${APP}/gmap/2014-06-10/ && make && make install
rm -fr gmap-2014-06-10

#STAR
cd ${APP}
curl https://rna-star.googlecode.com/files/STAR_2.3.0e.Linux_x86_64.tgz | tar xvz
mkdir -p ${APP}/star/2.3.0e/
mv STAR_2.3.0e.Linux_x86_64/* ${APP}/star/2.3.0e/ && rm -fr STAR_2.3.0e.Linux_x86_64

#Tophat2
cd ${APP}
curl http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.11.Linux_x86_64.tar.gz | tar xvz
mkdir -p ${APP}/tophat/2.0.11 && mv tophat-2.0.11.Linux_x86_64/* ${APP}/tophat/2.0.11/ && rm -fr tophat-2.0.11.Linux_x86_64

#cufflinks
cd ${APP}
curl http://cufflinks.cbcb.umd.edu/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz | tar xvz
mkdir -p ${APP}/cufflinks/2.2.1 && mv cufflinks-2.2.1.Linux_x86_64/* ${APP}/cufflinks/2.2.1/ && rm -fr cufflinks-2.2.1.Linux_x86_64

#HTSeq
cd ${APP}
echo -e "y\n" | sudo apt-get install python-pip
sudo pip install numpy
sudo pip install scipy
curl https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1p1.tar.gz | tar xvz
cd HTSeq-0.6.1p1 && python setup.py build && sudo python setup.py install && cd -
sudo rm -fr HTSeq-0.6.1p1

#samtools
cd ${APP}
wget http://downloads.sourceforge.net/project/samtools/samtools/0.1.19/samtools-0.1.19.tar.bz2
tar -jxvf samtools-0.1.19.tar.bz2 && cd samtools-0.1.19/ && make && cd
mkdir -p $APP/samtools/0.1.19/
find samtools-0.1.19 -executable -type f -print0 | xargs -0 -I {} mv {} $APP/samtools/0.1.19/
rm -fr samtools-0.1.19*

#picard
cd ${APP}
wget http://downloads.sourceforge.net/project/picard/picard-tools/1.114/picard-tools-1.114.zip
unzip picard-tools-1.114.zip
mkdir -p ${APP}/picard/1.114/ && mv picard-tools-1.114/* ${APP}/picard/1.114/
rm -fr picard-tools-1.114 picard-tools-1.114.zip

#bamtools
cd $APP
git clone https://github.com/pezmaster31/bamtools.git
cd $APP/bamtools && mkdir build && cd build && cmake .. && make
mkdir -p $APP/bamtools/2.3.0/ && mv $APP/bamtools/* $APP/bamtools/2.3.0/

########################
#step5. install bioinformatics annotation files
########################

#hg19
mkdir -p ${DATA}/fasta/hg19/ && cd ${DATA}/fasta/hg19/
for i in {1..22} X Y M; do wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz; done
gunzip *.gz

#create index and dict for each chromosome files
for i in *.fa
do
j=$(echo $i | cut -d"." -f1)
echo $j
java -jar ${APP}/picard/1.114/CreateSequenceDictionary.jar R=$j.fa O=$j.dict
${APP}/samtools/0.1.19/samtools faidx $j.fa
done

#build hg19 genome index for BWA
mkdir -p $DATA/index/hg19/bwa/
${APP}/bwa/0.7.9a/bwa index -p ${DATA}/index/hg19/bwa/hg19 ${DATA}/fasta/hg19/hg19.fa

#build hg19 genome index for novoalign
mkdir -p ${DATA}/index/novoalign/hg19
${APP}/novocraft/3.02.05/novoindex -k 14 -s 1 ${DATA}/index/hg19/novoalign/hg19.nix ${DATA}/fasta/hg19/hg19.fa

#build hg19 genome index for bowtie1
mkdir -p $DATA/index/hg19/bowtie1/
$APP/bowtie/1.0.1/bowtie-build $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie1/hg19
cp $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie1/

#build hg19 genome index for bowtie2
mkdir -p $DATA/index/hg19/bowtie2/
$APP/bowtie/2.2.3/bowtie2-build $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie2/hg19
cp $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie2/

#build hg19 genome index for SNAP -require at least 64 GB of memory
mkdir -p ${DATA}/index/snap/hg19
sudo sysctl vm.overcommit_memory=1
${APP}/snap/1.0beta.10/snap index ${DATA}/fasta/hg19/hg19.fa ${DATA}/index/snap/hg19
#{APP}/snap/1.0beta.10/snap paired ${DATA}/index/snap/hg19

#TODO: build hg19 genome index for GSNAP

#TODO: build hg19 genome index for STAR

#dbsnp
curl ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All.vcf.gz | gunzip -c > dbsnp_hg19.vcf
mkdir -p ${DATA}/misc/ && cd ${DATA}/misc/
for chromosome in {1..22} X Y M;
do
awk -v c="$chromosome" '/^#/{print $0;next} $1~"^chr"c {print $0}' dbsnp.vcf > dbsnp_hg19.chr${chromosome}.vcf
done

Extend /home partition under VMWare+Ubuntu

Say I want to add 512GB more space onto existing Ubuntu hosted by VMWare Workstation.

0. You have to turn off your Ubuntu system then switch VMWare Workstation
Right Click the Machine-> "Settings" -> "Hard Disk" in left panel -> click "Utilities" in right panel -> "Expand" -> Enter the new total space. Since my system has 512GB already, and I want to add 512GB, the total would be "1024".

1. Check existing disk partitions. write down the biggest number of End column. here it is 3221225471

$sudo fdisk -l

Device Boot Start End Blocks Id System
/dev/sda1 * 2048 499711 248832 83 Linux
/dev/sda2 501758 2147481599 1073489921 5 Extended
/dev/sda3 2147481600 3221225471 536871936 83 Linux
/dev/sda5 501760 2147481599 1073489920 8e Linux LVM

2. Add new partition

$sudo fdisk /dev/sda

press "n" to create new partition, when when prompting "Partition type" , enter "p".

when prompting First sector, enter 3221225472 (3221225471 + 1)

when prompting Last sector, accept defaults (use all remaining space)

then press "w" to write to disk.

3. reboot

$sudo reboot

4. Run "sudo fdisk -l" again. This time you should see a new partition "/dev/sda4"

Device Boot Start End Blocks Id System
/dev/sda1 * 2048 499711 248832 83 Linux
/dev/sda2 501758 2147481599 1073489921 5 Extended
/dev/sda3 2147481600 3221225471 536871936 83 Linux
/dev/sda4 3221225472 4294967294 536870911+ 83 Linux

/dev/sda5 501760 2147481599 1073489920 8e Linux LVM

4. Find a folder under "/dev/" which ends with "-vg". In should be named as "-vg". e.g. "master-vg" if your hostname is master.

5. Extend the volume group

$sudo vgextend cloud-vg /dev/sda4

You should see screen outputs like this:
No physical volume label read from /dev/sda4
Physical volume "/dev/sda4" successfully created

Volume group "cloud-vg" successfully extended

6. Add logical volume

#add 512G

$sudo lvextend -L+512G /dev/cloud-vg/root

#or add all free space

$sudo lvextend -l+100%FREE /dev/cloud-vg/root

You should see screen outputs like this:

Extending logical volume root to 1.98 TiB
Logical volume root successfully resized

7. Final step - Reize
$sudo resize2fs /dev/cloud-vg/root

8. Check out the disk usage now

$df -h

MongoDB start and stop

#MongoDB

#create
mkdir ~/mongo && mkdir -p ~/log/mongodb

#start
mongod --fork --dbpath ~/mongo --logpath ~/log/mongodb/main.log

#shutdown
mongod --shutdown --dbpath ~/mongo

Tuesday, August 19, 2014

bamsplitter

#!/bin/bash

#split the BAM file by chromosome on NameNode's local disk
pathSamtools=$1 #the full path to samtools. e.g /X/Y/Z/samtoools
bamInputFile=$2 #input BAM file name. e.g. /A/B/C/d.bam
bamOutputDir=$3 #output folder for splitted BAM file. The full path will be /A/B/C/E/

#bamInputFile="/A/B/C/d.bam"

#/A/B/C/
bamOutputPath=${bamInputFile%/*}"/"${bamOutputDir}

#"d.bam"
bamFileName=$(basename "$bamInputFile")

#"d"
bamFileNameNoExt=${bamFileName%%.*}

#"/A/B/C/d.bam.bai"
bamFileIndex=${bamInputFile}."bai"

#"/A/B/C/d"
bamFilePathNoExt=${bamInputFile%%.*}

#"bam"
#bamFileNameExt="${bam##*.}"

bam=${bamInputFile}

numCPU=$(grep -c ^processor /proc/cpuinfo)

sorted=$(samtools view -H ${bamInputFile} | grep SO:coordinate)
if [ -z "$sorted" ];then
#unsorted
${pathSamtools} sort ${bamInputFile} ${bamFilePathNoExt}".sort"
#create BAM index
bam=${bamFilePathNoExt}".sort.bam"
${pathSamtools} index ${bam}
else #sorted
if [ ! -f "${bamFileIndex}" ];then #create BAM index if not exist
${pathSamtools} index ${bam}
fi
fi

#create output folder if not exist
if [ ! -d "${bamOutputPath}" ]; then
mkdir -p ${bamOutputPath}
fi

#split the BAM by chromosome
for i in `${pathSamtools} view -H ${bam} | awk -F"\t" '/@SQ/{print $2}' | cut -d":" -f2`
do
${pathSamtools} view -@ ${numCPU} -h -F 0x4 -q 10 ${bam} $i | \
awk '{if(/^@SQ/ && !/^@SQ\tSN:'$i'/) next} {if(/^@/) print $0}{if ($3~/^'$i'/) print $0}' | \
${pathSamtools} view -@ ${numCPU} -hbS - > ${bamOutputPath}/${bamFileNameNoExt}.$i.bam 2>/dev/null
done

Friday, August 1, 2014

Size of container

In Hadoop YARN, a container is a sub unit of a physical data node. The size of a container affect the performance of MapReduce greatly especially when the application itself supports multiple threads.

Let us say we have one DataNode with 4 cores and 8 GB memory, now we want to run BWA with input "A_1.fastq", what are the options?

1) 1 container per DataNode. This container has all 4 cores and 6.4 GB memory (we do not want to starve the host DataNode). So we have only one BWA process running like "bwa mem -t 4 ... A_1.fastq" with 6.4GB available memory per BWA process.

2) 4 container, each container has 1 core and 1.6 GB memory. so we have to split the "A_1.fastq" into "A_1_1.fastq" ... "A_4_1.fastq", then start 4 parallel BWA processes running like "bwa mem -t 1 ... A_1_1.fastq" and "bwa mem -t 1 ... A_2_1.fastq", etc. with 1.6GB available memory per process.

Finally we have to merge the resulting SAM files. Since our goal is to optimize the execution, so the question is "Which one is faster?"

Before jumping to the answer, now we have to consider:

1) Smaller available memory means the input FASTQ files must be small, otherwise the process will fail.

2) The overhead. splitting and merging will add some time to the overall running time and every BWA process has to load genome index into memory before mapping the reads.

Since BWA itself supports multiple threads, it seems like the best way is option (1) - one container per DataNode. Is it the best solution? No! Why? Because the ApplicationMaster itself will occupy one container.

Assume we have 5 DataNodes, each DataNode has one container. When we start a YARN-based MapReduce application, the ResourceManager will find a container to host the ApplicationMaster,
subsequently the ApplicationMaster will start computing containers. ApplicationMaster itself will occupy a full container. As a result, we only have 4 computing containers. It is a waste of computing resource since we know ApplicationMaster does not need that much resource (4 cores and 7G memory). This figure shows a node in red box, is running ApplicationMaster without doing the "real computation"

If we "ssh" into that "ApplicationMaster" node, we can see it is running a process named "MRAppMaster".

In another word we wasted 20% of the computing resource. It is not a problem if you are running a 100-node cluster in that case only 1% resource was "wasted". However we do not need a big boss if we are a small team.

Considering two containers per DataNode? As a result we will have 2*5 = 10 containers in total with 10% of the containers were wasted. But we come into the multiple container problem again - the overhead...

This is just one the of tradeoff or balancing problems that we have encountered here and there.

Next Generation Sequencing and Data Analysis