Next Generation Sequencing and Data Analysis: 2014

Monday, October 6, 2014

Install oozie on ubuntu 14.04 with hadoop-2.5.0

Assuming we use /home/hadoop/tools/ as the installation path for oozie-4.0.1 with hadoop-2.5.0 installed on /home/hadoop/tools/hadoop-2.5.0.

1. download & build


cd /home/hadoop/tools/



curl http://apache.cs.utah.edu/oozie/4.0.1/oozie-4.0.1.tar.gz | tar -zx



cd /home/hadoop/tools/oozie-4.0.1




bin/mkdistro.sh -DskipTests -Dhadoopversion=2.5.0  -DjavaVersion=1.7 -DtargetJavaVersion=1.7

2. make a new folder for the build


cp -R /home/hadoop/tools/oozie-4.0.1/distro/target/oozie-4.0.1-distro/oozie-4.0.1 /home/hadoop/tools/oozie



cd /home/hadoop/tools/oozie/bin



echo "PATH=$PWD:\$PATH" >> ~/.bash_profile



source ~/.bash_profile



#now "/home/hadoop/tools/oozie" is the home path for oozie

3. prepare for oozie web console





cd /home/hadoop/tools/oozie/



mkdir libext



cp /home/hadoop/tools/oozie-4.0.1/hadooplibs/target/oozie-4.0.1-hadooplibs.tar.gz .



tar xzvf oozie-4.0.1-hadooplibs.tar.gz



cp oozie-4.0.1/hadooplibs/hadooplib-2.3.0.oozie-4.0.1/* libext/



cd libext



wget http://extjs.com/deploy/ext-2.2.zip



rm -fr /home/hadoop/tools/oozie/oozie-4.0.1  /home/hadoop/tools/oozie/oozie-4.0.1-hadooplibs.tar.gz

4. add these two lines to "core-site.xml" under "/home/hadoop/tools/hadoop-2.5.0/etc/hadoop"




property name "hadoop.proxyuser.hadoop.hosts", property value "*"
property name "hadoop.proxyuser.hadoop.groups, property value "*"

5. Start oozie





oozie-setup.shoozie-setup.sh prepare-war



#oozie-setup.sh sharelib create -fs hdfs://localhost:8020



oozie-setup.sh db create –run



oozied.sh start



oozie admin –oozie http://localhost:11000 -status

6. Run an example


cd /home/hadoop/tools/oozie/


tar -zxvf oozie-examples.tar.gz



find examples/ -name "job.properties" -exec sed -i "s/localhost/master/g" '{}' \;



hdfs dfs -rm -f -r /user/hadoop/examples 



hdfs dfs -put examples examples



oozie job -oozie http://master:11000/oozie -config examples/apps/map-reduce/job.properties -run





#now open a web browser and access "http://master:11000/oozie", you will see the submitted job.

#reference:
http://www.quora.com/Has-anyone-tried-installing-oozie-4-0-0-with-apache-hadoop-2-2-0-and-java1-7-0_45-on-ubuntu

Monday, September 22, 2014

Learn Spark with Python

1. Install Spark


cd ~/tools/




wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz



tar -zxvf spark-1.1.0.tgz

2. Build spark for hadoop2

cd ~/tools/spark-1.1.0

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

3. Install py4j

sudo pip install py4j

4. Modify ~/.bash_profile by adding two lines


export SPARK_HOME=$HOME/tools/spark-1.1.0



export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

5. source the ~/.bash_profile


source ~/.bash_profile

6. Test. Start a python shell and type

import pyspark

Wednesday, August 27, 2014

##Create an AWS AMI as the snapshot for hadoop datanode
##This image should contain all necessary tools, packages and libraries
##to be used by the pipeline and your hadoop-application

##assume the we already started a Ubuntu14.04-64-PV instance
##Public IP for the instance is 12.34.56.78 and our private key file was saved as
##/home/hadoop/.ssh/aws.pem

AWS_IP=12.34.56.78
KEYFILE=/home/hadoop/.ssh/aws.pem
USER=ubuntu
########################
#step1. log in our AWS instance
########################
ssh -i ~/.ssh/aws.pem ubuntu@${AWS_IP}

########################
#step2. install basic packages
########################
sudo apt-get install openjdk-7-jdk -y
sudo apt-get install make -y
sudo apt-get install cmake -y
sudo apt-get install gcc -y
sudo apt-get install g++ -y
sudo apt-get install zlib1g-dev -y
sudo apt-get install unzip -y
sudo apt-get install libncurses5-dev -y
sudo apt-get install r-base -y
sudo apt-get install python-dev -y
sudo apt-get install python-dateutil -y
sudo apt-get install python-psutil -y
sudo apt-get install python-pip -y
sudo apt-get install maven2 -y
sudo apt-get install libxml2-dev -y
sudo apt-get install gradle -y

#install R packages
sudo R
source('http://www.bioconductor.org/biocLite.R')
biocLite('edgeR')
biocLite('DESeq')
biocLite('limma')
#... all other necessary packages

########################
#step3. create folder structure
########################
TOOLS_HOME=~/tools
BIO_HOME=~/bioinformatics
APP=$BIO_HOME/app
DATA=$BIO_HOME/data

mkdir -p ${TOOLS_HOME}
mkdir -p ${APP}
mkdir -p ${DATA}

########################
#step4. install hadoop under ~/tools/
########################
cd $TOOLS_HOME
wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.5.0/hadoop-2.5.0.tar.gz
tar -zxvf hadoop-2.4.0.tar.gz && rm -f hadoop-2.5.0.tar.gz && cd

#install s3tools under ~/tools/
cd $TOOLS_HOME
wget https://github.com/s3tools/s3cmd/archive/master.zip
unzip master.zip

########################
#step5. install bioinformatics applications
########################

#BWA
cd $APP
wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.9a.tar.bz2
tar -jxvf bwa-0.7.9a.tar.bz2 && rm -fr bwa-0.7.9a.tar.bz2
cd bwa-0.7.9a && make && cd $APP
mkdir -p $APP/bwa/0.7.9a
find bwa-0.7.9a -executable -type f -print0 | xargs -0 -I {} mv {} $APP/bwa/0.7.9a/
rm -fr bwa-0.7.9a

#bowtie1
cd ${APP}
wget http://downloads.sourceforge.net/project/bowtie-bio/bowtie/1.0.1/bowtie-1.0.1-linux-x86_64.zip
unzip bowtie-1.0.1-linux-x86_64.zip && rm bowtie-1.0.1-linux-x86_64.zip
mkdir -p ${APP}/bowtie/1.0.1 && mv bowtie-1.0.1/* ${APP}/bowtie/1.0.1/ && rm -fr bowtie-1.0.1/

#bowtie2
cd ${APP}
wget http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.3/bowtie2-2.2.3-linux-x86_64.zip
unzip bowtie2-2.2.3-linux-x86_64.zip && rm bowtie2-2.2.3-linux-x86_64.zip
mkdir -p ${APP}/bowtie/2.2.3 && mv bowtie2-2.2.3/* ${APP}/bowtie/2.2.3/ && rm -fr bowtie2-2.2.3/

#SNAP
cd ${APP}
curl http://snap.cs.berkeley.edu/downloads/snap-1.0beta.10-linux.tar.gz | tar xvz
mkdir -p ${APP}/snap/1.0beta.10/
mv snap-1.0beta.10-linux/* ${APP}/snap/1.0beta.10/ && rm -fr snap-1.0beta.10-linux

#GSNAP
cd ${APP}
curl http://research-pub.gene.com/gmap/src/gmap-gsnap-2014-06-10.tar.gz | tar xvz
cd gmap-2014-06-10 && ./configure --prefix=${APP}/gmap/2014-06-10/ && make && make install
rm -fr gmap-2014-06-10

#STAR
cd ${APP}
curl https://rna-star.googlecode.com/files/STAR_2.3.0e.Linux_x86_64.tgz | tar xvz
mkdir -p ${APP}/star/2.3.0e/
mv STAR_2.3.0e.Linux_x86_64/* ${APP}/star/2.3.0e/ && rm -fr STAR_2.3.0e.Linux_x86_64

#Tophat2
cd ${APP}
curl http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.11.Linux_x86_64.tar.gz | tar xvz
mkdir -p ${APP}/tophat/2.0.11 && mv tophat-2.0.11.Linux_x86_64/* ${APP}/tophat/2.0.11/ && rm -fr tophat-2.0.11.Linux_x86_64

#cufflinks
cd ${APP}
curl http://cufflinks.cbcb.umd.edu/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz | tar xvz
mkdir -p ${APP}/cufflinks/2.2.1 && mv cufflinks-2.2.1.Linux_x86_64/* ${APP}/cufflinks/2.2.1/ && rm -fr cufflinks-2.2.1.Linux_x86_64

#HTSeq
cd ${APP}
echo -e "y\n" | sudo apt-get install python-pip
sudo pip install numpy
sudo pip install scipy
curl https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1p1.tar.gz | tar xvz
cd HTSeq-0.6.1p1 && python setup.py build && sudo python setup.py install && cd -
sudo rm -fr HTSeq-0.6.1p1

#samtools
cd ${APP}
wget http://downloads.sourceforge.net/project/samtools/samtools/0.1.19/samtools-0.1.19.tar.bz2
tar -jxvf samtools-0.1.19.tar.bz2 && cd samtools-0.1.19/ && make && cd
mkdir -p $APP/samtools/0.1.19/
find samtools-0.1.19 -executable -type f -print0 | xargs -0 -I {} mv {} $APP/samtools/0.1.19/
rm -fr samtools-0.1.19*

#picard
cd ${APP}
wget http://downloads.sourceforge.net/project/picard/picard-tools/1.114/picard-tools-1.114.zip
unzip picard-tools-1.114.zip
mkdir -p ${APP}/picard/1.114/ && mv picard-tools-1.114/* ${APP}/picard/1.114/
rm -fr picard-tools-1.114 picard-tools-1.114.zip

#bamtools
cd $APP
git clone https://github.com/pezmaster31/bamtools.git
cd $APP/bamtools && mkdir build && cd build && cmake .. && make
mkdir -p $APP/bamtools/2.3.0/ && mv $APP/bamtools/* $APP/bamtools/2.3.0/

########################
#step5. install bioinformatics annotation files
########################

#hg19
mkdir -p ${DATA}/fasta/hg19/ && cd ${DATA}/fasta/hg19/
for i in {1..22} X Y M; do wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz; done
gunzip *.gz

#create index and dict for each chromosome files
for i in *.fa
do
j=$(echo $i | cut -d"." -f1)
echo $j
java -jar ${APP}/picard/1.114/CreateSequenceDictionary.jar R=$j.fa O=$j.dict
${APP}/samtools/0.1.19/samtools faidx $j.fa
done

#build hg19 genome index for BWA
mkdir -p $DATA/index/hg19/bwa/
${APP}/bwa/0.7.9a/bwa index -p ${DATA}/index/hg19/bwa/hg19 ${DATA}/fasta/hg19/hg19.fa

#build hg19 genome index for novoalign
mkdir -p ${DATA}/index/novoalign/hg19
${APP}/novocraft/3.02.05/novoindex -k 14 -s 1 ${DATA}/index/hg19/novoalign/hg19.nix ${DATA}/fasta/hg19/hg19.fa

#build hg19 genome index for bowtie1
mkdir -p $DATA/index/hg19/bowtie1/
$APP/bowtie/1.0.1/bowtie-build $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie1/hg19
cp $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie1/

#build hg19 genome index for bowtie2
mkdir -p $DATA/index/hg19/bowtie2/
$APP/bowtie/2.2.3/bowtie2-build $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie2/hg19
cp $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie2/

#build hg19 genome index for SNAP -require at least 64 GB of memory
mkdir -p ${DATA}/index/snap/hg19
sudo sysctl vm.overcommit_memory=1
${APP}/snap/1.0beta.10/snap index ${DATA}/fasta/hg19/hg19.fa ${DATA}/index/snap/hg19
#{APP}/snap/1.0beta.10/snap paired ${DATA}/index/snap/hg19

#TODO: build hg19 genome index for GSNAP

#TODO: build hg19 genome index for STAR

#dbsnp
curl ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All.vcf.gz | gunzip -c > dbsnp_hg19.vcf
mkdir -p ${DATA}/misc/ && cd ${DATA}/misc/
for chromosome in {1..22} X Y M;
do
awk -v c="$chromosome" '/^#/{print $0;next} $1~"^chr"c {print $0}' dbsnp.vcf > dbsnp_hg19.chr${chromosome}.vcf
done

Extend /home partition under VMWare+Ubuntu

Say I want to add 512GB more space onto existing Ubuntu hosted by VMWare Workstation.

0. You have to turn off your Ubuntu system then switch VMWare Workstation
Right Click the Machine-> "Settings" -> "Hard Disk" in left panel -> click "Utilities" in right panel -> "Expand" -> Enter the new total space. Since my system has 512GB already, and I want to add 512GB, the total would be "1024".

1. Check existing disk partitions. write down the biggest number of End column. here it is 3221225471

$sudo fdisk -l

Device Boot Start End Blocks Id System
/dev/sda1 * 2048 499711 248832 83 Linux
/dev/sda2 501758 2147481599 1073489921 5 Extended
/dev/sda3 2147481600 3221225471 536871936 83 Linux
/dev/sda5 501760 2147481599 1073489920 8e Linux LVM

2. Add new partition

$sudo fdisk /dev/sda

press "n" to create new partition, when when prompting "Partition type" , enter "p".

when prompting First sector, enter 3221225472 (3221225471 + 1)

when prompting Last sector, accept defaults (use all remaining space)

then press "w" to write to disk.

3. reboot

$sudo reboot

4. Run "sudo fdisk -l" again. This time you should see a new partition "/dev/sda4"

Device Boot Start End Blocks Id System
/dev/sda1 * 2048 499711 248832 83 Linux
/dev/sda2 501758 2147481599 1073489921 5 Extended
/dev/sda3 2147481600 3221225471 536871936 83 Linux
/dev/sda4 3221225472 4294967294 536870911+ 83 Linux

/dev/sda5 501760 2147481599 1073489920 8e Linux LVM

4. Find a folder under "/dev/" which ends with "-vg". In should be named as "-vg". e.g. "master-vg" if your hostname is master.

5. Extend the volume group

$sudo vgextend cloud-vg /dev/sda4

You should see screen outputs like this:
No physical volume label read from /dev/sda4
Physical volume "/dev/sda4" successfully created

Volume group "cloud-vg" successfully extended

6. Add logical volume

#add 512G

$sudo lvextend -L+512G /dev/cloud-vg/root

#or add all free space

$sudo lvextend -l+100%FREE /dev/cloud-vg/root

You should see screen outputs like this:

Extending logical volume root to 1.98 TiB
Logical volume root successfully resized

7. Final step - Reize
$sudo resize2fs /dev/cloud-vg/root

8. Check out the disk usage now

$df -h

MongoDB start and stop

#MongoDB

#create
mkdir ~/mongo && mkdir -p ~/log/mongodb

#start
mongod --fork --dbpath ~/mongo --logpath ~/log/mongodb/main.log

#shutdown
mongod --shutdown --dbpath ~/mongo

Tuesday, August 19, 2014

bamsplitter

#!/bin/bash

#split the BAM file by chromosome on NameNode's local disk
pathSamtools=$1 #the full path to samtools. e.g /X/Y/Z/samtoools
bamInputFile=$2 #input BAM file name. e.g. /A/B/C/d.bam
bamOutputDir=$3 #output folder for splitted BAM file. The full path will be /A/B/C/E/

#bamInputFile="/A/B/C/d.bam"

#/A/B/C/
bamOutputPath=${bamInputFile%/*}"/"${bamOutputDir}

#"d.bam"
bamFileName=$(basename "$bamInputFile")

#"d"
bamFileNameNoExt=${bamFileName%%.*}

#"/A/B/C/d.bam.bai"
bamFileIndex=${bamInputFile}."bai"

#"/A/B/C/d"
bamFilePathNoExt=${bamInputFile%%.*}

#"bam"
#bamFileNameExt="${bam##*.}"

bam=${bamInputFile}

numCPU=$(grep -c ^processor /proc/cpuinfo)

sorted=$(samtools view -H ${bamInputFile} | grep SO:coordinate)
if [ -z "$sorted" ];then
#unsorted
${pathSamtools} sort ${bamInputFile} ${bamFilePathNoExt}".sort"
#create BAM index
bam=${bamFilePathNoExt}".sort.bam"
${pathSamtools} index ${bam}
else #sorted
if [ ! -f "${bamFileIndex}" ];then #create BAM index if not exist
${pathSamtools} index ${bam}
fi
fi

#create output folder if not exist
if [ ! -d "${bamOutputPath}" ]; then
mkdir -p ${bamOutputPath}
fi

#split the BAM by chromosome
for i in `${pathSamtools} view -H ${bam} | awk -F"\t" '/@SQ/{print $2}' | cut -d":" -f2`
do
${pathSamtools} view -@ ${numCPU} -h -F 0x4 -q 10 ${bam} $i | \
awk '{if(/^@SQ/ && !/^@SQ\tSN:'$i'/) next} {if(/^@/) print $0}{if ($3~/^'$i'/) print $0}' | \
${pathSamtools} view -@ ${numCPU} -hbS - > ${bamOutputPath}/${bamFileNameNoExt}.$i.bam 2>/dev/null
done

Friday, August 1, 2014

Size of container

In Hadoop YARN, a container is a sub unit of a physical data node. The size of a container affect the performance of MapReduce greatly especially when the application itself supports multiple threads.

Let us say we have one DataNode with 4 cores and 8 GB memory, now we want to run BWA with input "A_1.fastq", what are the options?

1) 1 container per DataNode. This container has all 4 cores and 6.4 GB memory (we do not want to starve the host DataNode). So we have only one BWA process running like "bwa mem -t 4 ... A_1.fastq" with 6.4GB available memory per BWA process.

2) 4 container, each container has 1 core and 1.6 GB memory. so we have to split the "A_1.fastq" into "A_1_1.fastq" ... "A_4_1.fastq", then start 4 parallel BWA processes running like "bwa mem -t 1 ... A_1_1.fastq" and "bwa mem -t 1 ... A_2_1.fastq", etc. with 1.6GB available memory per process.

Finally we have to merge the resulting SAM files. Since our goal is to optimize the execution, so the question is "Which one is faster?"

Before jumping to the answer, now we have to consider:

1) Smaller available memory means the input FASTQ files must be small, otherwise the process will fail.

2) The overhead. splitting and merging will add some time to the overall running time and every BWA process has to load genome index into memory before mapping the reads.

Since BWA itself supports multiple threads, it seems like the best way is option (1) - one container per DataNode. Is it the best solution? No! Why? Because the ApplicationMaster itself will occupy one container.

Assume we have 5 DataNodes, each DataNode has one container. When we start a YARN-based MapReduce application, the ResourceManager will find a container to host the ApplicationMaster,
subsequently the ApplicationMaster will start computing containers. ApplicationMaster itself will occupy a full container. As a result, we only have 4 computing containers. It is a waste of computing resource since we know ApplicationMaster does not need that much resource (4 cores and 7G memory). This figure shows a node in red box, is running ApplicationMaster without doing the "real computation"

If we "ssh" into that "ApplicationMaster" node, we can see it is running a process named "MRAppMaster".

In another word we wasted 20% of the computing resource. It is not a problem if you are running a 100-node cluster in that case only 1% resource was "wasted". However we do not need a big boss if we are a small team.

Considering two containers per DataNode? As a result we will have 2*5 = 10 containers in total with 10% of the containers were wasted. But we come into the multiple container problem again - the overhead...

This is just one the of tradeoff or balancing problems that we have encountered here and there.

Monday, July 28, 2014

Split the BAM file by chromosome then process it with GATK's UnifiedGenotyper

To parallel the execution of genotyping, one possible solution is to split the BAM files by chromosome (the safe method). For example, we have one big BAM file named "A.bam". Now we want to use GATK's UnifiedGenotyper to call variants on "A.bam". The command may looks like this:


time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper 
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.bam -o A.vcf

It may take hours to get above command completed. #The 1st improvement As an improvement, we can split the "A.bam" by chromosome, then process splitted BAM files in paralle. so we can do:



for i in `samtools view -H A.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do
    samtools view -h -F 0x4 -q 10 A.bam $i | samtools view -hbS - > A.$i.bam
done

Now we have generated 20+ files named "A.chr1.bam", "A.chr2.bam", ... Now call GATK's UnifiedGenotyper:


time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.chr1.bam -o A.chr1.vcf

real    32m

It takes 32 mins to complete. If you look at the outputs of above command you may find that UnifiedGenotyper does not stop after iterating "chr1". Since A.chr1.bam only contains alignments on chr1, it make no sense to waste time on other chromosomes. Let us make another improvement #The 2nd improvement This time, we split the reference FASTA file as well. supposedly we have "hg19.chr1.fa", "hg19.chr2.fa" under a folder.


for i in *.fa  
do  
   j=$(echo $i | cut -d"." -f1)  
   echo $j  
   java -jar ~/app/picard/1.114/CreateSequenceDictionary.jar R=$j.fa O=$j.dict  
   ~/app/samtools/0.1.19/samtools faidx $j.fa  
done

Now call GATK's UnifiedGenotyper:



time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf

Let us see how much time it use ... Again, UnifiedGenotyper does not stop after iterating "chr1", it continue goes to "chr2" then throw an error message and terminate the process. What's wrong here? The header of splitted BAM files let us have a look at the header of "A.bam"



$samtools view -h A.bam | less

@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:162008842
...
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

let us have a look at the header of "A.chr1.bam"



$samtools view -h A.chr1.bam | less

@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:162008842
...
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

They are the same. Since "A.chr1.bam" only has the mappings on chr1, we do not need information on other chromosomes like "@SQ SN:chr2 LN:243199373" Unfortunately, both "samtools view" and "bamtools split" are unable to process the header file directly, we have to do it manually. Here is a script to do so



#!/bin/bash
#usage   

SAMTOOLS=$1
BAM_INPUT=$2
PATH_OUTPUT=$3
BAMTOOLS=~/bio/app/bamtools/bin/bamtools/2.3.0/bamtools
#create the target directory
mkdir -p ${PATH_OUTPUT}

if [ -f $BAMTOOLS ];
then

 cd ${PATH_OUTPUT} && ~/bio/app/bamtools/bin/bamtools split -in ${BAM_INPUT} -refPrefix "" -reference
 for i in *.chr*bam 
 do 
  sampleName=${i%.bam}
  chrName=${sampleName#*.}
  ${SAMTOOLS} view -h -F 0x4 -q 10 ${i} | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$chrName'/) next} {if(/^@/) print $0}{if ($3~/^'$chrName'/) print $0}' | ${SAMTOOLS} view -hbS - > ${PATH_OUTPUT}/${sampleName}.tmp
  mv ${PATH_OUTPUT}/${sampleName}.tmp ${PATH_OUTPUT}/${sampleName}.bam
 done
else
 fileName=$(basename "$BAM_INPUT")
 #extension="${fileName##*.}"
 sampleName="${fileName%.*}"
 #split the bam
 for i in `${SAMTOOLS} view -H ${BAM_INPUT} | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
 do
  ${SAMTOOLS} view -h -F 0x4 -q 10 ${BAM_INPUT} $i | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$i'/) next} {if(/^@/) print $0}{if ($3~/^'$i'/) print $0}' | $1 view -hbS - > ${PATH_OUTPUT}/${sampleName}.$i.tmp
  mv ${PATH_OUTPUT}/${sampleName}.$i.tmp ${PATH_OUTPUT}/${sampleName}.$i.bam
 done
fi

Save above script as "bam_split_by_chr.sh". #split



$chmod +x bam_split_by_chr.sh

$./bam_split_by_chr.sh /home/hadoop/samtools /home/hadoop/A.bam 
/home/hadoop/output

now under /home/hadoop/output you will see "A.chr1.bam".



$samtools view -h A.chr1.bam | less
@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

OK with this cleaned BAM file let us run UnifiedGenotyper again.



time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf

real:   7m

Now it only takes 7 mins to process the "A.chr1.bam". Finally, as a side effect, you will not be able to merge the splitted BAM files into the original "A.bam" again by using "samtools merge"



$samtools merge A.bam A.chr*.bam
[bam_merge_core] different target sequence name: 'chr1' != 'chr2' in file 'A.chr2.bam'

It does not matter, anyway. We do not need to do that.

Thursday, July 24, 2014

Counting lines in "fastq.gz"

Before splitting the "fastq.gz" into fragments, we need to counting how many lines in the raw file, then we can calculate how many lines per fragment.

If the raw file is quite big, the counting itself will take a long time to process because the file has to be decompressed firstly.

The common approach is using "zcat" with "wc". Is it the fastest?

#1. use zcat - 9.381s 

$time zcat A_R1.fq.gz | wc -l

20000000

real 0m9.381s

user 0m7.423s

sys 0m1.919s

#2. use zgrep - 16.258s

$time zgrep -Ec "$" A_R1.fq.gz

20000000

real 0m16.258s

user 0m13.109s

sys 0m0.490s

#3. use pigz - 7.227s

$sudo apt-get install pigz

$time pigz -d -p 1 -c A_R1.fq.gz | wc -l

20000000

real 0m7.227s

user 0m5.544s

sys 0m0.386s

#4. use pigz with 4 threads - 5.973s

$time pigz -d -p 4 -c A_R1.fq.gz | wc -l

20000000

real 0m5.973s

user 0m5.599s

sys 0m2.200s

By default, pigz will use all available processors so "-p" is not necessary. The above command can be simplified as

$pigz -dc A_R1.fq.gz | wc -l

Clearly, our winner is pigz.

I am going to write another blog on splitting the "fastq.gz" file with only one goal - as fast as possible.

1. Count lines

2. Determine how many lines per splitted fastq file

3. Unzip and split the paired "fastq.gz" files

4. Zip splitted "fastq" files again.

Reference:

http://superuser.com/questions/135329/count-lines-in-a-compressed-file

Build hadoop-2.4.1-src

1. http://apache.cs.utah.edu/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz

2. http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.bz2

./configure && make && sudo make install && sudo ldconfig

mvn compile && mvn package

Monday, June 9, 2014

VirtualBox 4.3.12 configuration

File->"Preferences"->Click "Network" in left panel -> Add a new NAT network

Name -> "MyNAT"
CIDR: 192.168.0.0/24
Click the "Port forwarding" add port "22" in Ipv4.
Name "SSH"
Protocol "TCP"
Host IP "10.2.2.72" (run "cmd"->"ipconfig" to get it)
Host Port "22"
Guest IP "192.168.0.4" (The IP of the primary virtual machine)
Guest Port "22"

For each virtual machine, in "Settings"->"network"->"Adapter1"->"NAT Network" choose "MyNAT"

Wednesday, May 21, 2014

Windows+VirtualBox+Ubuntu+Xen+Ubuntu

What we have:

Native machine installed with Windows 7.

Now I want to play with some virtual machines:

1. Install VirtualBox into Windows 7, as a VM
2. Install Ubuntu(1) into VirtualBox, as a guest OS
3. Install Xen into Ubuntu, as a VM
4. Install Ubuntu(2) into Xen, as a guest OS

Versions:
VirtualBox 4.3.10 r93012
Ubuntu 14.04 Server
Xen 4.4

It is quite easy to install Ubuntu into VirtualBox (step1 and step2). One thing to remember: "Do not assign all allocated space to Ubuntu(1)". Supposedly we decide to give 100GB disk space to VMs inside the VirtualBox, then the Ubuntu(1), as a dom0 OS, should use as less space as possible, say, 2GB. The remaining 98GB disk space should be reserved for one or more guest Ubuntu(2) guest OS. Otherwise your Xen will hang up when trying to "brought up CPU".

#1. get the Xen package
$sudo apt-get install xen-hypervisor-4.4-amd64

#2. Set Xen as the default loading OS
$sudo sed -i 's/GRUB_DEFAULT=.*\+/GRUB_DEFAULT="Xen 4.4-amd64"/' /etc/default/grub
sudo update-grub

#3. use xl as toolstack
$sudo sed -i 's/TOOLSTACK=.*\+/TOOLSTACK="xl"/' /etc/default/xen
$sudo reboot

$sudo xl list

#4. Create primary partition on remaining disk space
sudo fdisk /dev/sda

... /dev/sda3

#5. restart

sudo reboot

#6. create physical storage

sudo pvcreate /dev/sda3

#7. create volume group "xen-vg" on /dev/sda3

sudo vgcreate xen-vg /dev/sda3

#8. take a look at PVs.

sudo pvs

PV VG Fmt Attr PSize PFree

/dev/sda3 xen-vg lvm2 a-- 51.62g 51.62g

#9. create logical volume "ubuntu" guest OS with 10G disk space

sudo lvcreate -L 10G -n ubuntu /dev/xen-vg

#10. Tell it where to get the image - Ubuntu 14.40 (a.k.a utopic)

sudo mkdir -p /var/lib/xen/images/ubuntu-netboot

$cd /var/lib/xen/images/ubuntu-netboot

$sudo wget http://ubuntu.cs.utah.edu/ubuntu/dists/utopic/main/installer-amd64/current/images/netboot/xen/initrd.gz

$sudo wget http://ubuntu.cs.utah.edu/ubuntu/dists/utopic/main/installer-amd64/current/images/netboot/xen/vmlinuz

#11. Configuration

$sudo vim /etc/xen/ubuntu.cfg

name = "ubuntu"

memory = 512

disk = ['phy:/dev/xen-vg/ubuntu,xvda,w']

vif = [' ']

kernel = "/var/lib/xen/images/ubuntu-netboot/vmlinuz"

ramdisk = "/var/lib/xen/images/ubuntu-netboot/initrd.gz"

extra = "debian-installer/exit/always_halt=true -- console=hvc0"

#12. Start installing the Ubuntu(2)

sudo xl create /etc/xen/ubuntu.cfg -c

Reference

https://help.ubuntu.com/community/Xen

Wednesday, April 23, 2014

LocalCache in hadoop MRv2 aka YARN

The old DistributedCache is deprecated in the new API of hadoop 2.2.3. Now the sugguested method is Job.addCacheFile

By default, the cached file was add to a special folder on each slave node.

Assume we have two slave nodes, namely "n1" and "n2". In the configuration file of "yarn-site.xml" which is under /HADOOP_PATH/etc/hadoop/, you will find a property
like this:

yarn.nodemanager.local-dirs
/home/hadoop/localdirs

Here "/home/hadoop/localdirs" is the home path for all cached files.

>ls /home/hadoop/localdirs
filecache nmPrivate usercache

If we add cache files per application, the cache files will be put under "usercache".

Job job = Job.getInstance(conf, "HELLO");
FileStatus[] fileStatus = fs.listStatus(new Path("/user/hadoop/data/"));
for (FileStatus f : fileStatus)
{
job.addCacheFile(new URI(f.getPath().toUri().toString() + "#_" + f.getPath().getName()));
}

Now switch to machine n1 or n2, have a look at the local cache directory you will find it somewhere:

>find /hadoop/localdirs/ -name "hg19.fa"

/hadoop/localdirs/usercache/hadoop/appcache/application_1398026166795_0007/container_1398026166795_0007_01_000010/hg19.fa

To use the cache files in the MR application is interesting. Our Mapper will looks like this

public static class MyMapper extends Mapper
{

@Override
private ListcacheFiles;
protected void setup(Context context) throws IOException
{
/*
URI[] uris = context.getCacheFiles();
for (URI u : uris)
{
System.out.println("CACHED:" + u.getPath());
}
*/
Path[] uris = context.getLocalCacheFiles();
for (Path u : uris)
{
System.out.println("CACHED:" + u.toString());

}

}

//static Log LOG = LogFactory.getLog(MyMapper.class);
@Override
public void map(Text keySampleName, Text valueSplitFileName, Context context) throws IOException, InterruptedException
{
//use the cachedFiles here.
//context.write(keySampleName, valueSplitFileName);
}

}

The new API to retrieve cached files is "context.getCacheFiles()" while the "context.getLocalCacheFiles()" is deprecated. However, context.getCacheFiles() returns a URI array which each element is a HDFS path(hdfs://master:50011/user/hadoop/...),
while context.getLocalCacheFiles() returns a Path array which each element is a local path (//hadoop/localdirs/usercache/...)

Besides, each container in every slave node will have a copy of all cached files. Which means in slave node "n1", if "n1" has 10 containers, then you get 10 copies of cached files in n1.

Next Generation Sequencing and Data Analysis

Monday, October 6, 2014

Install oozie on ubuntu 14.04 with hadoop-2.5.0

Monday, September 22, 2014

Learn Spark with Python

Wednesday, August 27, 2014

AWS AMI for Hadoop DataNode

Extend /home partition under VMWare+Ubuntu

MongoDB start and stop

Tuesday, August 19, 2014

bamsplitter

Friday, August 1, 2014

Size of container

Monday, July 28, 2014

Split the BAM file by chromosome then process it with GATK's UnifiedGenotyper

Thursday, July 24, 2014

Counting lines in "fastq.gz"

Build hadoop-2.4.1-src

Monday, June 9, 2014

VirtualBox 4.3.12 configuration

Wednesday, May 21, 2014

Windows+VirtualBox+Ubuntu+Xen+Ubuntu

Reference

Wednesday, April 23, 2014

LocalCache in hadoop MRv2 aka YARN

About Me

Blog Archive