Thursday, October 10, 2013

Implement a hadoop pipeline for NGS data analysis - 1


Setup a hadoop cluster

1. Software


Windows8 + VMWare-Workstation9.0 + Ubuntu-13.04-x64-server + hadoop-2.1.0-beta


Name the first ubuntu installation as "master", this will be the master server (namenode + secondarynamenode)

then snapshot/clone 3 images from "master" and name them as "n1", "n2" and "n3". These 3 images will be the slave nodes.


2. Hadoop configuration


Too many steps so I will not list them one by one in here. 
Google is your friend. DO not forget to change the /etc/hosts, /etc/hostname in n1, n2 and n3. Start the hadoop and open a web browser to see if it works http://192.168.1.2:8088/cluster
192.168.1.2 is the IP of my master server.

You should be able to see 3 active nodes in the Cluster Metrics table.


Get all necessary apps and data


1. install the BWA package



  1. #get the code
  2. git clone https://github.com/lh3/bwa.git
  3. cd bwa

  4. #install two required libs
  5. sudo apt-get install gcc
  6. sudo apt-get install zlib1g-dev

  7. #compile
  8. make

  9. #delete other files, only keep the executable BWA
  10. find . ! -name "bwa" -exec rm -fr {} \;


2. Download some reference genome data -  for test purpose, we don't need to get all of them


  1. wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz
  2. wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr2.fa.gz
  3. zcat *.gz > hg19.fasta

3. Build genome index for BWA

  1. bwa index -p hg19 hg19.fasta

########################################


#/etc/hosts
192.168.221.128 prime
192.168.221.133 n1
192.168.221.139 n2
192.168.221.140 n3

#/etc/hostname
prime

sudo apt-get install vim
sudo apt-get install openssh-server
sudo apt-get install screen
sudo apt-get install openjdk-7-jdk 
sudo apt-get install eclipse-platform

wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
tar -zxvf hadoop-2.2.0.tar.gz

#.bash_profile
alias hj="hadoop jar"
alias hf="hadoop fs"
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export DISPLAY=192.168.1.133:0.0
export HADOOP_HOME=/home/hadoop/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/hadoop/hadoop-2.2.0/lib/native
export EC2_HOME=$HOME/tools/ec2-api-tools-1.6.11.0
export PATH=$PATH:/home/hadoop/hadoop-2.2.0/bin
export PATH=$PATH:/home/hadoop/hadoop-2.2.0/sbin
export PATH=$PATH:/home/hadoop/tools/bwa:/home/hadoop/tools/snap-0.15.4-linux
export PATH=$PATH:$HOME/tools/ec2-api-tools-1.6.11.0/bin


#hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

#core-site.xml





fs.default.name
hdfs://prime:50011/
true


hadoop.tmp.dir
/home/hadoop/dfstmp






#hdfs-site.xml






dfs.namenode.name.dir
/home/hadoop/dfsname



        dfs.datanode.data.dir
        /home/hadoop/dfsdata



dfs.namenode.secondary.http-address
prime:50012



     dfs.replication
     3





#yarn-site.xml




yarn.nodemanager.log-dirs
/home/hadoop/logs



yarn.nodemanager.aux-services
mapreduce_shuffle



yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler



yarn.resourcemanager.scheduler.address
prime:8030



yarn.resourcemanager.resource-tracker.address
prime:8031



yarn.resourcemanager.address
prime:8032



yarn.resourcemanager.admin.address
prime:8033



yarn.resourcemanager.webapp.address
prime:8088



#slaves
n1
n2

n3

No comments:

Post a Comment