Monday, October 6, 2014

Install oozie on ubuntu 14.04 with hadoop-2.5.0

Assuming we use /home/hadoop/tools/ as the installation path for oozie-4.0.1 with hadoop-2.5.0 installed on /home/hadoop/tools/hadoop-2.5.0. 



1. download & build
cd /home/hadoop/tools/
curl http://apache.cs.utah.edu/oozie/4.0.1/oozie-4.0.1.tar.gz | tar -zx
cd /home/hadoop/tools/oozie-4.0.1
bin/mkdistro.sh -DskipTests -Dhadoopversion=2.5.0  -DjavaVersion=1.7 -DtargetJavaVersion=1.7

2. make a new folder for the build
cp -R /home/hadoop/tools/oozie-4.0.1/distro/target/oozie-4.0.1-distro/oozie-4.0.1 /home/hadoop/tools/oozie
cd /home/hadoop/tools/oozie/bin
echo "PATH=$PWD:\$PATH" >> ~/.bash_profile
source ~/.bash_profile
#now "/home/hadoop/tools/oozie" is the home path for oozie


3. prepare for oozie web console
cd /home/hadoop/tools/oozie/
mkdir libext
cp /home/hadoop/tools/oozie-4.0.1/hadooplibs/target/oozie-4.0.1-hadooplibs.tar.gz .
tar xzvf oozie-4.0.1-hadooplibs.tar.gz
cp oozie-4.0.1/hadooplibs/hadooplib-2.3.0.oozie-4.0.1/* libext/
cd libext
wget http://extjs.com/deploy/ext-2.2.zip
rm -fr /home/hadoop/tools/oozie/oozie-4.0.1  /home/hadoop/tools/oozie/oozie-4.0.1-hadooplibs.tar.gz

4. add these two lines to "core-site.xml" under "/home/hadoop/tools/hadoop-2.5.0/etc/hadoop"
property name "hadoop.proxyuser.hadoop.hosts", property value "*" property name "hadoop.proxyuser.hadoop.groups, property value "*"



5. Start oozie
oozie-setup.shoozie-setup.sh prepare-war
#oozie-setup.sh sharelib create -fs hdfs://localhost:8020
oozie-setup.sh db create –run
oozied.sh start
oozie admin –oozie http://localhost:11000 -status

6. Run an example
cd /home/hadoop/tools/oozie/
tar -zxvf oozie-examples.tar.gz find examples/ -name "job.properties" -exec sed -i "s/localhost/master/g" '{}' \; hdfs dfs -rm -f -r /user/hadoop/examples  hdfs dfs -put examples examples oozie job -oozie http://master:11000/oozie -config examples/apps/map-reduce/job.properties -run #now open a web browser and access "http://master:11000/oozie", you will see the submitted job.
#reference:
http://www.quora.com/Has-anyone-tried-installing-oozie-4-0-0-with-apache-hadoop-2-2-0-and-java1-7-0_45-on-ubuntu

Monday, September 22, 2014

Learn Spark with Python

1. Install Spark
cd ~/tools/
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
tar -zxvf spark-1.1.0.tgz

2. Build spark for hadoop2
cd ~/tools/spark-1.1.0
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly


3. Install py4j
sudo pip install py4j


4. Modify ~/.bash_profile by adding two lines
export SPARK_HOME=$HOME/tools/spark-1.1.0
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

5. source the ~/.bash_profile
source ~/.bash_profile

6. Test. Start a python shell and type
import pyspark



Wednesday, August 27, 2014

AWS AMI for Hadoop DataNode





##Create an AWS AMI as the snapshot for hadoop datanode
##This image should contain all necessary tools, packages and libraries
##to be used by the pipeline and your hadoop-application

##assume the we already started a Ubuntu14.04-64-PV instance
##Public IP for the instance is 12.34.56.78 and our private key file was saved as 
##/home/hadoop/.ssh/aws.pem

AWS_IP=12.34.56.78
KEYFILE=/home/hadoop/.ssh/aws.pem
USER=ubuntu
########################
#step1. log in our AWS instance
########################
ssh -i ~/.ssh/aws.pem ubuntu@${AWS_IP}

########################
#step2. install basic packages
########################
sudo apt-get install openjdk-7-jdk -y
sudo apt-get install make -y
sudo apt-get install cmake -y
sudo apt-get install gcc -y
sudo apt-get install g++ -y
sudo apt-get install zlib1g-dev -y
sudo apt-get install unzip -y
sudo apt-get install libncurses5-dev -y
sudo apt-get install r-base -y
sudo apt-get install python-dev -y
sudo apt-get install python-dateutil -y
sudo apt-get install python-psutil -y
sudo apt-get install python-pip -y
sudo apt-get install maven2 -y
sudo apt-get install libxml2-dev -y
sudo apt-get install gradle -y

#install R packages
sudo R
source('http://www.bioconductor.org/biocLite.R')
biocLite('edgeR')
biocLite('DESeq')
biocLite('limma')
#... all other necessary packages

########################
#step3. create folder structure
########################
TOOLS_HOME=~/tools
BIO_HOME=~/bioinformatics
APP=$BIO_HOME/app
DATA=$BIO_HOME/data

mkdir -p ${TOOLS_HOME}
mkdir -p ${APP}
mkdir -p ${DATA}

########################
#step4. install hadoop under ~/tools/
########################
cd $TOOLS_HOME 
wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.5.0/hadoop-2.5.0.tar.gz
tar -zxvf hadoop-2.4.0.tar.gz &&  rm -f hadoop-2.5.0.tar.gz && cd

#install s3tools under ~/tools/
cd $TOOLS_HOME 
wget https://github.com/s3tools/s3cmd/archive/master.zip
unzip master.zip

########################
#step5. install bioinformatics applications
########################

#BWA
cd $APP
wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.9a.tar.bz2
tar -jxvf bwa-0.7.9a.tar.bz2 && rm -fr bwa-0.7.9a.tar.bz2
cd bwa-0.7.9a && make && cd $APP
mkdir -p $APP/bwa/0.7.9a
find bwa-0.7.9a -executable -type f -print0 | xargs -0 -I {} mv {} $APP/bwa/0.7.9a/
rm -fr bwa-0.7.9a

#bowtie1
cd ${APP}
wget http://downloads.sourceforge.net/project/bowtie-bio/bowtie/1.0.1/bowtie-1.0.1-linux-x86_64.zip
unzip bowtie-1.0.1-linux-x86_64.zip && rm bowtie-1.0.1-linux-x86_64.zip
mkdir -p ${APP}/bowtie/1.0.1 && mv bowtie-1.0.1/* ${APP}/bowtie/1.0.1/ && rm -fr bowtie-1.0.1/

#bowtie2
cd ${APP}
wget http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.3/bowtie2-2.2.3-linux-x86_64.zip
unzip bowtie2-2.2.3-linux-x86_64.zip && rm bowtie2-2.2.3-linux-x86_64.zip
mkdir -p ${APP}/bowtie/2.2.3 && mv bowtie2-2.2.3/* ${APP}/bowtie/2.2.3/ && rm -fr bowtie2-2.2.3/

#SNAP
cd ${APP}
curl http://snap.cs.berkeley.edu/downloads/snap-1.0beta.10-linux.tar.gz | tar xvz
mkdir -p ${APP}/snap/1.0beta.10/
mv snap-1.0beta.10-linux/* ${APP}/snap/1.0beta.10/ && rm -fr snap-1.0beta.10-linux

#GSNAP
cd ${APP}
curl http://research-pub.gene.com/gmap/src/gmap-gsnap-2014-06-10.tar.gz | tar xvz
cd gmap-2014-06-10 && ./configure --prefix=${APP}/gmap/2014-06-10/ && make && make install
rm -fr gmap-2014-06-10

#STAR
cd ${APP}
curl https://rna-star.googlecode.com/files/STAR_2.3.0e.Linux_x86_64.tgz | tar xvz
mkdir -p ${APP}/star/2.3.0e/
mv STAR_2.3.0e.Linux_x86_64/* ${APP}/star/2.3.0e/ && rm -fr STAR_2.3.0e.Linux_x86_64 

#Tophat2
cd ${APP}
curl http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.11.Linux_x86_64.tar.gz | tar xvz 
mkdir -p ${APP}/tophat/2.0.11 && mv tophat-2.0.11.Linux_x86_64/* ${APP}/tophat/2.0.11/ && rm -fr tophat-2.0.11.Linux_x86_64

#cufflinks
cd ${APP}
curl http://cufflinks.cbcb.umd.edu/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz | tar xvz 
mkdir -p ${APP}/cufflinks/2.2.1 && mv cufflinks-2.2.1.Linux_x86_64/* ${APP}/cufflinks/2.2.1/ && rm -fr cufflinks-2.2.1.Linux_x86_64

#HTSeq
cd ${APP}
echo -e "y\n" | sudo apt-get install python-pip
sudo pip install numpy
sudo pip install scipy
curl https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1p1.tar.gz | tar xvz 
cd HTSeq-0.6.1p1 &&  python setup.py build && sudo python setup.py install && cd -
sudo rm -fr HTSeq-0.6.1p1

#samtools
cd ${APP}
wget http://downloads.sourceforge.net/project/samtools/samtools/0.1.19/samtools-0.1.19.tar.bz2
tar -jxvf samtools-0.1.19.tar.bz2 && cd samtools-0.1.19/ && make && cd
mkdir -p $APP/samtools/0.1.19/
find  samtools-0.1.19 -executable -type f -print0 | xargs -0 -I {} mv {} $APP/samtools/0.1.19/
rm -fr samtools-0.1.19*

#picard
cd ${APP}
wget http://downloads.sourceforge.net/project/picard/picard-tools/1.114/picard-tools-1.114.zip
unzip picard-tools-1.114.zip
mkdir -p ${APP}/picard/1.114/ && mv picard-tools-1.114/* ${APP}/picard/1.114/
rm -fr picard-tools-1.114 picard-tools-1.114.zip


#bamtools
cd $APP
git clone https://github.com/pezmaster31/bamtools.git
cd $APP/bamtools && mkdir build && cd build && cmake .. && make
mkdir -p $APP/bamtools/2.3.0/ && mv $APP/bamtools/* $APP/bamtools/2.3.0/


########################
#step5. install bioinformatics annotation files
########################

#hg19
mkdir -p ${DATA}/fasta/hg19/ && cd ${DATA}/fasta/hg19/
for i in {1..22} X Y M; do wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz; done
gunzip *.gz

#create index and dict for each chromosome files 
for i in *.fa
do
j=$(echo $i | cut -d"." -f1)
echo $j
java -jar ${APP}/picard/1.114/CreateSequenceDictionary.jar R=$j.fa O=$j.dict
${APP}/samtools/0.1.19/samtools faidx $j.fa
done

#build hg19 genome index for BWA
mkdir -p $DATA/index/hg19/bwa/
${APP}/bwa/0.7.9a/bwa index -p ${DATA}/index/hg19/bwa/hg19 ${DATA}/fasta/hg19/hg19.fa

#build hg19 genome index for novoalign
mkdir -p ${DATA}/index/novoalign/hg19
${APP}/novocraft/3.02.05/novoindex -k 14 -s 1 ${DATA}/index/hg19/novoalign/hg19.nix ${DATA}/fasta/hg19/hg19.fa

#build hg19 genome index for bowtie1
mkdir -p $DATA/index/hg19/bowtie1/
$APP/bowtie/1.0.1/bowtie-build $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie1/hg19
cp $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie1/

#build hg19 genome index for bowtie2
mkdir -p $DATA/index/hg19/bowtie2/
$APP/bowtie/2.2.3/bowtie2-build $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie2/hg19
cp $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie2/

#build hg19 genome index for SNAP -require at least 64 GB of memory
mkdir -p ${DATA}/index/snap/hg19
sudo sysctl vm.overcommit_memory=1
${APP}/snap/1.0beta.10/snap index ${DATA}/fasta/hg19/hg19.fa ${DATA}/index/snap/hg19
#{APP}/snap/1.0beta.10/snap paired ${DATA}/index/snap/hg19

#TODO: build hg19 genome index for GSNAP

#TODO: build hg19 genome index for STAR

#dbsnp
curl ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All.vcf.gz | gunzip -c > dbsnp_hg19.vcf
mkdir -p ${DATA}/misc/ && cd ${DATA}/misc/
for chromosome in {1..22} X Y M; 
do
awk -v c="$chromosome" '/^#/{print $0;next} $1~"^chr"c {print $0}' dbsnp.vcf > dbsnp_hg19.chr${chromosome}.vcf
done

Extend /home partition under VMWare+Ubuntu

Say I want to add 512GB more space onto existing Ubuntu hosted by VMWare Workstation.


0. You have to turn off your Ubuntu system then switch VMWare Workstation
Right Click the Machine-> "Settings" -> "Hard Disk" in left panel -> click "Utilities" in right panel -> "Expand" -> Enter the new total space. Since my system has 512GB already, and I want to add 512GB, the total would be "1024".

1. Check existing disk partitions. write down the biggest number of End column. here it is 3221225471 

$sudo fdisk -l

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      499711      248832   83  Linux
/dev/sda2          501758  2147481599  1073489921    5  Extended
/dev/sda3      2147481600  3221225471   536871936   83  Linux
/dev/sda5          501760  2147481599  1073489920   8e  Linux LVM



2. Add new partition

$sudo fdisk /dev/sda

press "n" to create new partition, when when prompting "Partition type" , enter "p".

when prompting First sector, enter 3221225472 (3221225471 + 1) 

when prompting Last sector, accept defaults (use all remaining space)

then press "w" to write to disk.

3. reboot 

$sudo reboot

4. Run "sudo fdisk -l" again. This time you should see a new partition  "/dev/sda4"

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      499711      248832   83  Linux
/dev/sda2          501758  2147481599  1073489921    5  Extended
/dev/sda3      2147481600  3221225471   536871936   83  Linux
/dev/sda4      3221225472  4294967294   536870911+  83  Linux

/dev/sda5          501760  2147481599  1073489920   8e  Linux LVM


4. Find a folder under "/dev/" which ends with "-vg". In should be named as "-vg". e.g. "master-vg" if your hostname is master.


5. Extend the volume group

$sudo vgextend cloud-vg /dev/sda4

You should see screen outputs like this:
  No physical volume label read from /dev/sda4
  Physical volume "/dev/sda4" successfully created

  Volume group "cloud-vg" successfully extended


6. Add logical volume

#add 512G

$sudo lvextend -L+512G /dev/cloud-vg/root

#or add all free space

$sudo lvextend -l+100%FREE /dev/cloud-vg/root

You should see screen outputs like this:

  Extending logical volume root to 1.98 TiB
  Logical volume root successfully resized


7. Final step - Reize 
$sudo resize2fs /dev/cloud-vg/root


8. Check out the disk usage now

$df -h

MongoDB start and stop

#MongoDB

#create
mkdir ~/mongo && mkdir -p ~/log/mongodb

#start
mongod --fork --dbpath ~/mongo --logpath ~/log/mongodb/main.log


#shutdown
mongod --shutdown --dbpath ~/mongo

Tuesday, August 19, 2014

bamsplitter

#!/bin/bash

#split the BAM file by chromosome on NameNode's local disk
pathSamtools=$1 #the full path to samtools. e.g /X/Y/Z/samtoools
bamInputFile=$2 #input BAM file name. e.g. /A/B/C/d.bam
bamOutputDir=$3 #output folder for splitted BAM file. The full path will be /A/B/C/E/


#bamInputFile="/A/B/C/d.bam"

#/A/B/C/
bamOutputPath=${bamInputFile%/*}"/"${bamOutputDir}

#"d.bam"
bamFileName=$(basename "$bamInputFile")

#"d"
bamFileNameNoExt=${bamFileName%%.*}

#"/A/B/C/d.bam.bai"
bamFileIndex=${bamInputFile}."bai"

#"/A/B/C/d"
bamFilePathNoExt=${bamInputFile%%.*}

#"bam"
#bamFileNameExt="${bam##*.}"

bam=${bamInputFile}

numCPU=$(grep -c ^processor /proc/cpuinfo)

sorted=$(samtools view -H ${bamInputFile} | grep SO:coordinate)
if [ -z "$sorted" ];then
        #unsorted
        ${pathSamtools} sort ${bamInputFile} ${bamFilePathNoExt}".sort"
        #create BAM index
bam=${bamFilePathNoExt}".sort.bam"
        ${pathSamtools} index ${bam}
else #sorted
        if [ ! -f "${bamFileIndex}" ];then #create BAM index if not exist
        ${pathSamtools} index ${bam}
        fi
fi

#create output folder if not exist
if [ ! -d "${bamOutputPath}" ]; then
        mkdir -p ${bamOutputPath}
fi

#split the BAM by chromosome
for i in `${pathSamtools} view -H ${bam} | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do
${pathSamtools} view -@ ${numCPU} -h -F 0x4 -q 10 ${bam} $i  | \
awk '{if(/^@SQ/ && !/^@SQ\tSN:'$i'/) next} {if(/^@/) print $0}{if ($3~/^'$i'/) print $0}' | \
${pathSamtools} view -@ ${numCPU} -hbS - > ${bamOutputPath}/${bamFileNameNoExt}.$i.bam 2>/dev/null
done

Friday, August 1, 2014

Size of container

In Hadoop YARN, a container is a sub unit of a physical data node. The size of a container affect the performance of MapReduce greatly especially when the application itself supports multiple threads.

Let us say we have one DataNode with 4 cores and 8 GB memory, now we want to run BWA with input "A_1.fastq", what are the options? 

1) 1 container per DataNode. This container has all 4 cores and 6.4 GB memory (we do not want to starve the host DataNode). So we have only one BWA process running like "bwa mem -t 4 ... A_1.fastq" with 6.4GB available memory per BWA process. 

2) 4 container, each container has 1 core and 1.6 GB memory. so we have to split the "A_1.fastq" into "A_1_1.fastq" ... "A_4_1.fastq", then start 4 parallel BWA processes running like "bwa mem -t 1 ... A_1_1.fastq" and "bwa mem -t 1 ... A_2_1.fastq", etc. with 1.6GB available memory per process. 

Finally we have to merge the resulting SAM files. Since our goal is to optimize the execution, so the question is "Which one is faster?" 

Before jumping to the answer, now we have to consider: 

1) Smaller available memory means the input FASTQ files must be small, otherwise the process will fail. 

2) The overhead. splitting and merging will add some time to the overall running time and every BWA process has to load genome index into memory before mapping the reads. 

Since BWA itself supports multiple threads, it seems like the best way is option (1) - one container per DataNode. Is it the best solution? No! Why? Because the ApplicationMaster itself will occupy one container. 

Assume we have 5 DataNodes, each DataNode has one container. When we start a YARN-based MapReduce application, the ResourceManager will find a container to host the ApplicationMaster, 
subsequently the ApplicationMaster will start computing containers. ApplicationMaster itself will occupy a full container. As a result, we only have 4 computing containers. It is a waste of computing resource since we know ApplicationMaster does not need that much resource (4 cores and 7G memory). This figure shows a node in red box, is running ApplicationMaster without doing the "real computation"







If we "ssh" into that "ApplicationMaster" node, we can see it is running a process named "MRAppMaster".














In another word we wasted 20% of the computing resource. It is not a problem if you are running a 100-node cluster in that case only 1% resource was "wasted". However we do not need a big boss if we are a small team. 

Considering two containers per DataNode? As a result we will have 2*5 = 10 containers in total with 10% of the containers were wasted. But we come into the multiple container problem again - the overhead... 

This is just one the of tradeoff or balancing problems that we have encountered here and there.


Monday, July 28, 2014

Split the BAM file by chromosome then process it with GATK's UnifiedGenotyper

To parallel the execution of genotyping, one possible solution is to split the BAM files by chromosome (the safe method). For example, we have one big BAM file named "A.bam". Now we want to use GATK's UnifiedGenotyper to call variants on "A.bam". The command may looks like this:

time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper 
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.bam -o A.vcf
It may take hours to get above command completed. #The 1st improvement As an improvement, we can split the "A.bam" by chromosome, then process splitted BAM files in paralle. so we can do:


for i in `samtools view -H A.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do
    samtools view -h -F 0x4 -q 10 A.bam $i | samtools view -hbS - > A.$i.bam
done
Now we have generated 20+ files named "A.chr1.bam", "A.chr2.bam", ... Now call GATK's UnifiedGenotyper:

time java -jar /home/hadoop/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/data/fasta/hg19/hg19.fa -I A.chr1.bam -o A.chr1.vcf

real    32m
It takes 32 mins to complete. If you look at the outputs of above command you may find that UnifiedGenotyper does not stop after iterating "chr1". Since A.chr1.bam only contains alignments on chr1, it make no sense to waste time on other chromosomes. Let us make another improvement #The 2nd improvement This time, we split the reference FASTA file as well. supposedly we have "hg19.chr1.fa", "hg19.chr2.fa" under a folder.

for i in *.fa  
do  
   j=$(echo $i | cut -d"." -f1)  
   echo $j  
   java -jar ~/app/picard/1.114/CreateSequenceDictionary.jar R=$j.fa O=$j.dict  
   ~/app/samtools/0.1.19/samtools faidx $j.fa  
done  
Now call GATK's UnifiedGenotyper:


time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf

Let us see how much time it use ... Again, UnifiedGenotyper does not stop after iterating "chr1", it continue goes to "chr2" then throw an error message and terminate the process. What's wrong here? The header of splitted BAM files let us have a look at the header of "A.bam"


$samtools view -h A.bam | less

@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:162008842
...
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

let us have a look at the header of "A.chr1.bam"


$samtools view -h A.chr1.bam | less

@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:162008842
...
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

They are the same. Since "A.chr1.bam" only has the mappings on chr1, we do not need information on other chromosomes like "@SQ SN:chr2 LN:243199373" Unfortunately, both "samtools view" and "bamtools split" are unable to process the header file directly, we have to do it manually. Here is a script to do so


#!/bin/bash
#usage   

SAMTOOLS=$1
BAM_INPUT=$2
PATH_OUTPUT=$3
BAMTOOLS=~/bio/app/bamtools/bin/bamtools/2.3.0/bamtools
#create the target directory
mkdir -p ${PATH_OUTPUT}

if [ -f $BAMTOOLS ];
then

 cd ${PATH_OUTPUT} && ~/bio/app/bamtools/bin/bamtools split -in ${BAM_INPUT} -refPrefix "" -reference
 for i in *.chr*bam 
 do 
  sampleName=${i%.bam}
  chrName=${sampleName#*.}
  ${SAMTOOLS} view -h -F 0x4 -q 10 ${i} | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$chrName'/) next} {if(/^@/) print $0}{if ($3~/^'$chrName'/) print $0}' | ${SAMTOOLS} view -hbS - > ${PATH_OUTPUT}/${sampleName}.tmp
  mv ${PATH_OUTPUT}/${sampleName}.tmp ${PATH_OUTPUT}/${sampleName}.bam
 done
else
 fileName=$(basename "$BAM_INPUT")
 #extension="${fileName##*.}"
 sampleName="${fileName%.*}"
 #split the bam
 for i in `${SAMTOOLS} view -H ${BAM_INPUT} | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
 do
  ${SAMTOOLS} view -h -F 0x4 -q 10 ${BAM_INPUT} $i | awk '{if(/^@SQ/ && !/^@SQ\tSN:'$i'/) next} {if(/^@/) print $0}{if ($3~/^'$i'/) print $0}' | $1 view -hbS - > ${PATH_OUTPUT}/${sampleName}.$i.tmp
  mv ${PATH_OUTPUT}/${sampleName}.$i.tmp ${PATH_OUTPUT}/${sampleName}.$i.bam
 done
fi

Save above script as "bam_split_by_chr.sh". #split


$chmod +x bam_split_by_chr.sh

$./bam_split_by_chr.sh /home/hadoop/samtools /home/hadoop/A.bam 
/home/hadoop/output

now under /home/hadoop/output you will see "A.chr1.bam".


$samtools view -h A.chr1.bam | less
@HD     VN:1.3  SO:coordinate
@SQ     SN:chr1 LN:249250621
@RG     ID:HELLO        LB:L01  PL:illumina     PU:barcode      SM:001
@PG     ID:bwa  PN:bwa  VN:0.7.9a-r786 
...

OK with this cleaned BAM file let us run UnifiedGenotyper again.


time java -jar /home/hadoop/bio/app/gatk/3.1-1/GenomeAnalysisTK.jar -T UnifiedGenotyper \
-R /home/hadoop/bio/data/fasta/hg19/hg19.chr1.fa -I A.chr1.bam -o A1.vcf

real:   7m

Now it only takes 7 mins to process the "A.chr1.bam". Finally, as a side effect, you will not be able to merge the splitted BAM files into the original "A.bam" again by using "samtools merge"


$samtools merge A.bam A.chr*.bam
[bam_merge_core] different target sequence name: 'chr1' != 'chr2' in file 'A.chr2.bam'

It does not matter, anyway. We do not need to do that.

Thursday, July 24, 2014

Counting lines in "fastq.gz"

Before splitting the "fastq.gz" into fragments, we need to counting how many lines in the raw file, then we can calculate how many lines per fragment.

If the raw file is quite big, the counting itself will take a long time to process because the file has to be decompressed firstly.

The common approach is using "zcat" with "wc". Is it the fastest?



#1. use zcat - 9.381s 
$time zcat A_R1.fq.gz | wc -l
 20000000

 real    0m9.381s
 user    0m7.423s
 sys     0m1.919s


#2. use zgrep - 16.258s
 $time zgrep -Ec "$" A_R1.fq.gz
 20000000

 real    0m16.258s
 user    0m13.109s
 sys     0m0.490s


#3. use pigz - 7.227s
 $sudo apt-get install pigz
 $time pigz -d -p 1 -c A_R1.fq.gz  | wc -l
 20000000

 real    0m7.227s
 user    0m5.544s
 sys     0m0.386s

#4. use pigz with 4 threads - 5.973s
 $time pigz -d -p 4 -c A_R1.fq.gz  | wc -l
 20000000

 real    0m5.973s
 user    0m5.599s
 sys     0m2.200s

By default, pigz will use all available processors so "-p" is not necessary. The above command can be simplified as 

 $pigz -dc A_R1.fq.gz  | wc -l


Clearly, our winner is pigz.

I am going to write another blog on splitting the "fastq.gz" file with only one goal - as fast as possible. 

 1. Count lines
 2. Determine how many lines per splitted fastq file
 3. Unzip and split the paired "fastq.gz" files
 4. Zip splitted "fastq" files again.


Reference:
http://superuser.com/questions/135329/count-lines-in-a-compressed-file



Build hadoop-2.4.1-src

1. http://apache.cs.utah.edu/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz

2. http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.bz2

./configure && make && sudo make install && sudo ldconfig
mvn compile && mvn package


Monday, June 9, 2014

VirtualBox 4.3.12 configuration

File->"Preferences"->Click "Network" in left panel -> Add a new NAT network

Name -> "MyNAT"
CIDR: 192.168.0.0/24
Click the "Port forwarding" add port "22" in Ipv4.
Name "SSH"
Protocol "TCP"
Host IP "10.2.2.72" (run "cmd"->"ipconfig" to get it)
Host Port "22"
Guest IP "192.168.0.4" (The IP of the primary virtual machine)
Guest Port "22"


For each virtual machine, in "Settings"->"network"->"Adapter1"->"NAT Network" choose "MyNAT"

Wednesday, May 21, 2014

Windows+VirtualBox+Ubuntu+Xen+Ubuntu


What we have:

    Native machine installed with Windows 7.

Now I want to play with some virtual machines:

    1. Install VirtualBox into Windows 7, as a VM
    2. Install Ubuntu(1) into VirtualBox, as a guest OS
    3. Install Xen into Ubuntu, as a VM
    4. Install Ubuntu(2) into Xen, as a guest OS

Versions:
    VirtualBox 4.3.10 r93012
    Ubuntu 14.04 Server 
    Xen 4.4


It is quite easy to install Ubuntu into VirtualBox (step1 and step2). One thing to remember: "Do not assign all allocated space to Ubuntu(1)". Supposedly we decide to give 100GB disk space to VMs inside the VirtualBox, then the Ubuntu(1), as a dom0 OS, should use as less space as possible, say, 2GB. The remaining 98GB disk space should be reserved for one or more guest Ubuntu(2) guest OS. Otherwise your Xen will hang up when trying to "brought up CPU".


#1. get the Xen package
$sudo apt-get install xen-hypervisor-4.4-amd64

#2. Set Xen as the default loading OS
$sudo sed -i 's/GRUB_DEFAULT=.*\+/GRUB_DEFAULT="Xen 4.4-amd64"/' /etc/default/grub
sudo update-grub

#3. use xl as toolstack
$sudo sed -i 's/TOOLSTACK=.*\+/TOOLSTACK="xl"/' /etc/default/xen
$sudo reboot

$sudo xl list

#4. Create primary partition on remaining disk space
sudo fdisk /dev/sda

... /dev/sda3

#5. restart 

sudo reboot

#6. create physical storage

sudo pvcreate /dev/sda3

#7. create volume group "xen-vg" on /dev/sda3

sudo vgcreate xen-vg /dev/sda3

#8. take a look at PVs. 

sudo pvs

  PV         VG     Fmt  Attr PSize  PFree

  /dev/sda3  xen-vg lvm2 a--  51.62g 51.62g

#9. create logical volume "ubuntu" guest OS with 10G disk space
sudo lvcreate -L 10G -n ubuntu /dev/xen-vg


#10. Tell it where to get the image - Ubuntu 14.40 (a.k.a utopic)
sudo mkdir -p /var/lib/xen/images/ubuntu-netboot
$cd /var/lib/xen/images/ubuntu-netboot
$sudo wget http://ubuntu.cs.utah.edu/ubuntu/dists/utopic/main/installer-amd64/current/images/netboot/xen/initrd.gz
$sudo wget http://ubuntu.cs.utah.edu/ubuntu/dists/utopic/main/installer-amd64/current/images/netboot/xen/vmlinuz

#11. Configuration
$sudo vim /etc/xen/ubuntu.cfg

name = "ubuntu"
memory = 512
disk = ['phy:/dev/xen-vg/ubuntu,xvda,w']
vif = [' ']
kernel = "/var/lib/xen/images/ubuntu-netboot/vmlinuz"
ramdisk = "/var/lib/xen/images/ubuntu-netboot/initrd.gz"
extra = "debian-installer/exit/always_halt=true -- console=hvc0"

#12. Start installing the Ubuntu(2) 
sudo xl create /etc/xen/ubuntu.cfg -c  




Reference 

https://help.ubuntu.com/community/Xen

Wednesday, April 23, 2014

LocalCache in hadoop MRv2 aka YARN

The old DistributedCache is deprecated in the new API of hadoop 2.2.3. Now the sugguested method is Job.addCacheFile

By default, the cached file was add to a special folder on each slave node.

Assume we have two slave nodes, namely "n1" and "n2". In the configuration file of "yarn-site.xml" which is under /HADOOP_PATH/etc/hadoop/, you will find a property
like this:

 yarn.nodemanager.local-dirs
 /home/hadoop/localdirs


Here "/home/hadoop/localdirs" is the home path for all cached files. 

>ls /home/hadoop/localdirs
filecache  nmPrivate  usercache


If we add cache files per application, the cache files will be put under "usercache".





Job job = Job.getInstance(conf, "HELLO");
FileStatus[] fileStatus = fs.listStatus(new Path("/user/hadoop/data/"));
for (FileStatus f : fileStatus)
{
job.addCacheFile(new URI(f.getPath().toUri().toString() + "#_" + f.getPath().getName()));
}



Now switch to machine n1 or n2, have a look at the local cache directory you will find it somewhere:

>find /hadoop/localdirs/ -name "hg19.fa"

/hadoop/localdirs/usercache/hadoop/appcache/application_1398026166795_0007/container_1398026166795_0007_01_000010/hg19.fa

To use the cache files in the MR application is interesting. Our Mapper will looks like this





public static class MyMapper extends Mapper
{

@Override
                private ListcacheFiles;
protected void setup(Context context) throws IOException
{
/*
URI[] uris = context.getCacheFiles();
for (URI u : uris)
{
System.out.println("CACHED:" + u.getPath());
}
*/
Path[] uris = context.getLocalCacheFiles();
for (Path u : uris)
{
System.out.println("CACHED:" + u.toString());

}
}

//static Log LOG = LogFactory.getLog(MyMapper.class);
@Override
public void map(Text keySampleName, Text valueSplitFileName, Context context) throws IOException, InterruptedException
{
                        //use the cachedFiles here.
//context.write(keySampleName, valueSplitFileName);
}

}



The new API to retrieve cached files is  "context.getCacheFiles()" while the "context.getLocalCacheFiles()" is deprecated.  However, context.getCacheFiles() returns a URI array which each element is a HDFS path(hdfs://master:50011/user/hadoop/...),
while context.getLocalCacheFiles() returns a Path array which each element is a local path (//hadoop/localdirs/usercache/...)

Besides, each container in every slave node will have a copy of all cached files. Which means in slave node "n1", if "n1" has 10 containers, then you get 10 copies of cached files in n1.