tag:blogger.com,1999:blog-75158467434465812142024-03-18T02:48:09.139-07:00Next Generation Sequencing and Data AnalysisNotes and scratchesThe Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.comBlogger184125tag:blogger.com,1999:blog-7515846743446581214.post-75033233807924836322017-02-11T03:18:00.002-08:002017-02-11T04:20:37.724-08:00Create 5 brokers in 1 kafka server<span style="font-family: "courier new" , "courier" , monospace;">Why are we doing this:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Testing the high performance of Kafka with multiple partitions and multiple consumer for single topic.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<br />
<span style="font-family: "courier new" , "courier" , monospace;">What are we going to do:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">1. Create 5 brokers </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">2. Create 3 </span><span style="font-family: "courier new" , "courier" , monospace;">partitions for a topic</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">3. Create 2 producer to send messages to one topic which will distribute messages to 3 </span><span style="font-family: "courier new" , "courier" , monospace;">partitions, evenly</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">4. Create 3 consumers, one for each </span><span style="font-family: "courier new" , "courier" , monospace;">partition, to consume messages</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">in parallel, by order.</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">What do we need:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">1. Ubuntu 14.04</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">2. <a href="http://apache.cs.utah.edu/kafka/0.10.1.1/kafka_2.11-0.10.1.1.tgz">kafka_2.11-0.10.1.1</a></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">#1. create configuration file</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">cd /PATH/kafka_2.11-0.10.1.1/config</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">for i in $(seq 1 5);</span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">do</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">cp server.properties server.properties${i};</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">#change id</span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">sed -i "s/broker.id=0/broker.id=${i}/g" server.properties${i};</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">#change port</span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">port=$(expr ${i} + 9092)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">sed -i "s/#listeners=PLAINTEXT:\/\/:9092/listeners=PLAINTEXT:\/\/:${port}/g" server.properties${i};</span><br />
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">#change log file</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">sed -i "s/log.dirs=\/tmp\/kafka-logs/log.dirs=\/tmp\/kafka-logs${i}/g" server.properties${i};</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">done</span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">#2. create starting script "</span>kafka-start.sh"</span><br />
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<span style="font-family: "courier new" , "courier" , monospace;">####################################</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">#start zookeeper</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">zookeeper-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/zookeeper.properties &</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"># start 5 kafka brokers </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties1 &</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties2 &</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties3 &</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties4 &</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties5 &</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">####################################</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">#3. Run above script to launch brokers and type "jps", you should see 5 kafka processes</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">$jps</span></div>
<span style="font-family: "courier new" , "courier" , monospace;">8700 Jps</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">4928 Kafka</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">4924 Kafka</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">4925 Kafka</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">4926 Kafka</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">4927 Kafka</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">4923 QuorumPeerMain</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">#4. Create a topic "Hello "with 3 partitions</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">$kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic Hello</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">#5. Check out the partitions</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">$kafka-topics.sh --describe --zookeeper localhost:2181 --topic Hello</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">Topic: Hello Partition: 0 ...</span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Topic: Hello Partition: 1 ... </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Topic: Hello Partition: 2 ...</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<br />
<span style="font-family: "courier new" , "courier" , monospace;">#5. Open</span><span style="font-family: "courier new" , "courier" , monospace;"> 3 terminals</span><span style="font-family: "courier new" , "courier" , monospace;"> then s</span><span style="font-family: "courier new" , "courier" , monospace;">tart 3 consumers which only consume message from partition 0, 1 and 2 respectively</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">kafka-console-consumer.sh --bootstrap-server localhost:9093 --new-consumer --partition 0 -topic Hello</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">kafka-console-consumer.sh --bootstrap-server localhost:9093 --new-consumer --partition 1 -topic Hello</span><br />
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">kafka-console-consumer.sh --bootstrap-server localhost:9093 --new-consumer --partition 2 -topic Hello</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">#6. Open 2 terminals to produce some messages</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">$kafka-console-producer.sh --broker-list localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097, --topic Hello</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">m1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">m2</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">m3</span><br />
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">$kafka-console-producer.sh --broker-list localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097, --topic Hello</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">m4</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">m5</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">m6</span><br />
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<span style="font-family: "courier new" , "courier" , monospace;">#7. Switch back to the terminals from step #5. you will see each consumer will consume exactly 2 messages</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">m1</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">m4</span><span style="font-family: "courier new" , "courier" , monospace;"></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">#8. Let Kafka do the load balance. To do so we must assign all consumers to one group</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 8.1. create a file named "kafka.consumer.group" with one line:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> </span><span style="font-family: "courier new" , "courier" , monospace;">group.id=group1</span><span style="font-family: "courier new" , "courier" , monospace;"> </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 8.2 Launch kafka consumer with that file</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> $kafka-console-consumer.sh --bootstrap-server localhost:9093 -- consumer.config kafka.consumer.group --new-consumer -topic Hello</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">#9. Now messages will be evenly distributed across all consumers without </span><span style="font-family: "courier new", courier, monospace;">specifying </span><span style="font-family: "courier new", courier, monospace;">partition</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<br />
<div>
<br /></div>
<div>
<br /></div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<br />
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-55888430312194482402016-12-01T13:10:00.005-08:002016-12-01T13:10:56.285-08:00Mount SSD as swap<span style="font-family: Courier New, Courier, monospace;"># use SSD as swap</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo parted /dev/xvdf</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>mklabel gpt</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>mkpart primary 0 100%</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo mkfs.ext4 /dev/xvdf1</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo mount /dev/xvdf1 ssd</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo fallocate -l 160G ssd/swapfile</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo chmod 600 ssd/swapfile</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo mkswap ssd/swapfile</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo swapon ssd/swapfile</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#LVM</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo pvcreate /dev/xvdb1 /dev/xvdc1 </span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo vgcreate sysvg /dev/xvdb1 /dev/xvdc1 </span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo lvcreate -l 100%FREE -n syslv sysvg</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo lvdisplay</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#make file system</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo mkfs.ext4 /dev/sysvg/syslv</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo mount /dev/sysvg/syslv /ssd</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com2tag:blogger.com,1999:blog-7515846743446581214.post-85022711639391177202016-08-10T11:17:00.001-07:002016-08-10T11:18:06.479-07:00install credentials for AWS CLI<span style="font-family: "courier new" , "courier" , monospace;">mkdir -p $HOME/.aws</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">aws_key_id=YOUR_KEY_ID<your_key_id></your_key_id></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">aws_secret_access_key=<your_secret_key></your_secret_key></span><span style="font-family: "courier new", courier, monospace;">YOUR_SECRET_KEY</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">printf "[default]\nregion = us-west-2\n" > $HOME/.aws/config</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">printf "[default]\naws_access_key_id=${aws_key_id}\naws_secret_access_key=${aws_secret_access_key}\n" > $HOME/.aws/credentials</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">chmod 600 $HOME/.aws/config $HOME/.aws/credentials</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com1tag:blogger.com,1999:blog-7515846743446581214.post-9853576455979760472016-08-10T11:14:00.001-07:002016-08-10T11:14:07.048-07:00Create a dockerized bcl2fastq<span style="font-family: Courier New, Courier, monospace;">1. Create a working dir with name "bcl2fast_docker"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">>mkdir bcl2fast_docker</span><br />
<span style="font-family: Courier New, Courier, monospace;">>cd bcl2fast_docker</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">2. Download the bcl2fastq package "bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm" from illunima.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">>wget ftp://webdata2:webdata2@ussd-ftp.illumina.com/downloads/software/bcl2fastq/bcl2fastq2-v2.17.1.14-Linux-x86_64.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">>unzip bcl2fastq2-v2.17.1.14-Linux-x86_64.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">3. Create a wrapper to run the bcl2fastq with name "launcher.py"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">4. Create the Docker file</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">FROM centos:centos7</span><br />
<span style="font-family: Courier New, Courier, monospace;">MAINTAINER yourname@yourcompany.com</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">COPY src/* /opt/bcl2fastq/</span><br />
<span style="font-family: Courier New, Courier, monospace;">COPY bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm .</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">RUN yum install -y epel-release \</span><br />
<span style="font-family: Courier New, Courier, monospace;"> && yum install -y python34 bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm \</span><br />
<span style="font-family: Courier New, Courier, monospace;"> && rm bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm \</span><br />
<span style="font-family: Courier New, Courier, monospace;"> && yum install -y pigz \</span><br />
<span style="font-family: Courier New, Courier, monospace;"> && chmod +x /opt/bcl2fastq/launch.py</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"># User will need to also mount input data at #/data/input/<flowcellname></flowcellname></span><br />
<span style="font-family: Courier New, Courier, monospace;">VOLUME /data/output</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;">ENTRYPOINT ["python", "/opt/bcl2fastq/launch.py"]</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">5. Build the image with name "bcl2fastq"</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">>docker build -t bcl2fastq .</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">6. Run the image</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">>docker run --rm \</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">-v /path/to/raw_bcl_input:/data/input:ro \</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">-v /path/to/fastq_output:/data/output \</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">bcl2fastq:latest</span></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-50612595861490820832015-10-02T22:28:00.001-07:002015-10-05T10:42:28.187-07:00Add and delete users in batch<span style="font-family: Courier New, Courier, monospace;">#add some users to group "research" with initial password "<username>123"</username></span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">g=research</span><br />
<span style="font-family: Courier New, Courier, monospace;">for u in t1 t2 t3</span><br />
<span style="font-family: Courier New, Courier, monospace;">do </span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo useradd -d /home/${u} -m ${u} -g ${g}</span><br />
<span style="font-family: Courier New, Courier, monospace;">echo ${u}:${u}123 | sudo chpasswd</span><br />
<span style="font-family: Courier New, Courier, monospace;">echo -ne "${u}123\n${u}123\n" | sudo smbpasswd -a ${u}</span><br />
<span style="font-family: Courier New, Courier, monospace;">done</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#delete these users as well as their home directory</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">for u in t1 t2 t3</span><br />
<span style="font-family: Courier New, Courier, monospace;">do </span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo userdel -r $u 2>/dev/null</span><br />
<span style="font-family: Courier New, Courier, monospace;">done</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#configure samba</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install -y samba</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd /etc/samba/</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo cp smb.conf smb.conf.bak</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo vim smb.conf</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">############################</span><br />
<span style="font-family: Courier New, Courier, monospace;">[global]</span><br />
<span style="font-family: Courier New, Courier, monospace;">security = user</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">[store]</span><br />
<span style="font-family: Courier New, Courier, monospace;">path = /store/public</span><br />
<span style="font-family: Courier New, Courier, monospace;">valid users = @research</span><br />
<span style="font-family: Courier New, Courier, monospace;">browsable =yes</span><br />
<span style="font-family: Courier New, Courier, monospace;">writable = yes</span><br />
<span style="font-family: Courier New, Courier, monospace;">guest ok = no</span><br />
<span style="font-family: Courier New, Courier, monospace;">############################</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">sudo mkdir -p /store/public</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo chmod -R 0755 /store/public</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo chown -R hadoop:research /store/public</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo service smbd restart</span><br />
<div>
<br /></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-16429867530731536822015-09-10T12:21:00.003-07:002015-09-10T12:25:06.257-07:00Redirect nohup's output<span style="font-family: Courier New, Courier, monospace;">Some novice Linux users like to execute a long time running pipeline using nohup </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">"nohup mycommand"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">which by default will direct stdout and stderr to the current working directory with name "nohup.out". If we have many of these files then we can easily get lost due to the same names under different folders.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">While we can tell the user to redirect their output to another managed path, it is hard to let them run something like </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">"nohup <command></command> 1>/log/a.log 2>&1" </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">It is just too complicated.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">TO solve this problem without changing their old habit, we can re-define the nohup. Add following code to the end of "/etc/bash.bashrc"</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">function nohup () {</span><br />
<span style="font-family: Courier New, Courier, monospace;">UUID=$(cat /proc/sys/kernel/random/uuid)</span><br />
<span style="font-family: Courier New, Courier, monospace;">logfile=/log/tmp/${UUID}.txt</span><br />
<span style="font-family: Courier New, Courier, monospace;">printf "log:\n${logfile}\n"</span><br />
<span style="font-family: Courier New, Courier, monospace;">/usr/bin/nohup $@ 1> ${logfile} 2>&1}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Next time when user launched a nohup, the system will print a message like this:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">log:</span><br />
<span style="font-family: Courier New, Courier, monospace;"></span><br />
<span style="font-family: Courier New, Courier, monospace;">/log/tmp/16170f02-7413-49fb-a8ea-0b5c621cdd93.txt</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-85552315235266728932015-08-18T11:57:00.000-07:002015-08-18T12:00:00.367-07:00Evaluation on speedseq<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Recently read an interesting article from here:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3505.html</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">The author is kind enough to open source it</span><br />
<span style="font-family: Courier New, Courier, monospace;">https://github.com/hall-lab/speedseq</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">So I checked out the code and make some tests. Why is </span><span style="font-family: 'Courier New', Courier, monospace;">it</span><span style="font-family: 'Courier New', Courier, monospace;"> </span><span style="font-family: Courier New, Courier, monospace;">faster than popular GATK package? After</span><span style="font-family: 'Courier New', Courier, monospace;"> spending few hours , my conclusions:</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;">1. The speedseq use simplified process.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Basically it use works like:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">"mapping"->"remove duplicates"->"Call variants"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">while traditional GATK works like:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">"mapping"->"remove duplicates"->"realign"->"recalibrate"->"Call variants" plus lots of sorting and indexing between steps.</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">As indicated in freebayes' website, </span><span style="font-family: 'Courier New', Courier, monospace;">freebayes internally handle realign and recalibrate.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<div>
<span style="font-family: Courier New, Courier, monospace;">2. User GNU parallel to </span><span style="font-family: 'Courier New', Courier, monospace;">parallel the variants calling, because freebayes does not support threading or sub-processing.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">So supposedly you assign 8 CPU cores to the speedseq, it will launch a 8-processes pool to call variants using freebayes on each chromosome independently.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">3. Other improvements.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">For example, </span><span style="font-family: 'Courier New', Courier, monospace;">use sambamba instead of picard (in my experience sambamba is at least 3x faster than picard). use samblaster to get discordant and split reads simultaneously for later SV and CNV callings right after the alignment.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">So what are the catches? Nothing can be perfect.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">1. The components are tightly wrapped into the </span><span style="font-family: 'Courier New', Courier, monospace;">pipeline. Changing of individual parameters</span><span style="font-family: 'Courier New', Courier, monospace;"> are no easy tasks.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">2. </span><span style="font-family: 'Courier New', Courier, monospace;">The variant calling process did not split the BAM actually. If fact it only splits the BAM header for a chromosome based region to accelerate the processing because the caller will not need to go through the whole genome. Each process has to load the full size reference genome and full size BAM file(s). </span><span style="font-family: 'Courier New', Courier, monospace;">As a result,</span><span style="font-family: 'Courier New', Courier, monospace;"> the memory requirement will be a problem for some machines. </span><span style="font-family: 'Courier New', Courier, monospace;">Also the disk I/O could be a bottleneck if you have many processes trying to read/write large data on the same time.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">No wonder why the authors did the test on a pretty good AWS EC2 c3.8xlarge </span><span style="font-family: 'Courier New', Courier, monospace;">instance</span><span style="font-family: 'Courier New', Courier, monospace;"> </span><span style="font-family: 'Courier New', Courier, monospace;">with big memory and large SSD.</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">Final words: Even it is not perfect, it is a wonderful solution for some people. The best of all, it is free! No need to pay greedy and lovely Broad Institute $$$$$$$$$$$$$$$$$$ for the full GATK package.</span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com1tag:blogger.com,1999:blog-7515846743446581214.post-91545324781955468312015-08-18T11:20:00.001-07:002015-08-18T11:20:30.424-07:00Increasing the amount of inotify watchersecho fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -pThe Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-46081775032752206742015-08-15T02:43:00.001-07:002015-08-15T02:44:49.071-07:00The famous NA12878 WGS at 50x<div>
Due to the lacking of true variants from real genome sequencing, many tools use NA12878 for benchmarking.
For whole genome sequencing at 50x depth, we can download these two files:
<a href="https://www.blogger.com/ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz">ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz</a>
<a href="https://www.blogger.com/ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz">ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz</a>
</div>
Some statistics of these two files<br />
<style type="text/css">.nobrtable br { display: none } tr {text-align: center;} tr.alt td {background-color: #eeeecc; color: black;} tr {text-align: center;} caption {caption-side:bottom;}</style>
<br />
<div class="nobrtable">
<table border="2" bordercolor="#0033FF" cellpadding="10" cellspacing="0" style="background-color: #99ffff; border-collapse: collapse; width: 100%px;">
<caption>NA12878 WGS 50x</caption>
<tbody>
<tr style="background-color: #0033ff; color: white; padding-bottom: 4px; padding-top: 5px;">
<th>Facts</th>
<th>ERR194147_1.fastq.gz</th>
<th>ERR194147_2.fastq.gz</th>
</tr>
<tr class="alt">
<td>URI</td>
<td><a href="https://www.blogger.com/ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz">ERR194147_1.fastq.gz</a></td>
<td><a href="https://www.blogger.com/ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz">ERR194147_2.fastq.gz</a></td>
</tr>
<tr class="alt">
<td>Size</td>
<td>48G</td>
<td>49G</td>
</tr>
<tr>
<td>Reads#</td>
<td>787265109</td>
<td>787265109</td>
</tr>
<tr class="alt">
<td>Reads Length</td>
<td>101</td>
<td>101</td>
</tr>
<tr class="alt">
<td>Coverage</td>
<td>101*787265109/3000000000=26.5</td>
<td>101*787265109/3000000000=26.5</td>
</tr>
</tbody></table>
</div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-180264524511579672015-06-12T22:03:00.000-07:002015-06-12T22:09:48.328-07:00Download sequencing data from illumina's basespace<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">Spent 30 mins creating a script to download sequencing data from illumina's <a href="https://basespace.illumina.com/" target="_blank">BaseSpace</a>, by using its <a href="https://developer.basespace.illumina.com/docs/content/documentation/rest-api/api-reference#APIReference" target="_blank">REST API</a></span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;">Here is the code:</span><br />
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Helvetica Neue, Arial, Helvetica, sans-serif;"><a href="https://gist.github.com/anonymous/91c5accc3988cb233347">https://gist.github.com/anonymous/91c5accc3988cb233347</a></span><br />
<br />The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com1tag:blogger.com,1999:blog-7515846743446581214.post-90500079207564690892015-05-12T13:23:00.002-07:002015-05-12T13:23:54.498-07:00Simple SAMBA<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install -y samba</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir /project/share</span><br />
<span style="font-family: Courier New, Courier, monospace;">chmod -R a+rxw /project/share</span><br />
<span style="font-family: Courier New, Courier, monospace;">###############################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#/etc/samba/smb.conf</span><br />
<span style="font-family: Courier New, Courier, monospace;">[share]</span><br />
<span style="font-family: Courier New, Courier, monospace;">path = /project/share</span><br />
<span style="font-family: Courier New, Courier, monospace;">available = yes</span><br />
<span style="font-family: Courier New, Courier, monospace;">guest only=yes</span><br />
<span style="font-family: Courier New, Courier, monospace;">read only = no</span><br />
<span style="font-family: Courier New, Courier, monospace;">browseable = yes</span><br />
<span style="font-family: Courier New, Courier, monospace;">public = yes</span><br />
<span style="font-family: Courier New, Courier, monospace;">writable = yes</span><br />
<span style="font-family: Courier New, Courier, monospace;">###############################</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo service smbd restart</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">\\10.2.5.212\share</span><br />
<div>
<br /></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-14046939208470153472015-04-20T12:25:00.002-07:002015-04-21T01:04:10.499-07:00Run a BWA job with docker<div>
<span style="font-family: Courier New, Courier, monospace;">To run a BWA mappping job inside a docker, I want to create three containers. one for "bwa" executable, one for "bwa" genome index, and one for the input fastq files. All the benefits can be summarized by one word: "isolation".
</span></div>
<ul>
<li><span style="font-family: Courier New, Courier, monospace;">1. The bwa executable application. I would put it into a volume /bioapp/bwa/0.7.9a/ in an docker image named "yings/bioapp"
</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">2. The reference genome index which was created using "bwa index".
I would put it into a volume /biodata/hg19/index/bwa/ in a docker image named "yings/biodata"
</span></li>
<li><span style="font-family: Courier New, Courier, monospace;">3. The input FASTQ files. Assuming they can be found under "/home/hadoop/fastq" in the host.
</span></li>
</ul>
<div>
<span style="font-family: Courier New, Courier, monospace;">Create a image name "bioapp" with tag "v04202015"
copy all files and folders under "app" to the "/bioapp/" in the container. For simplicity here other operations like installing dependencies were not included in the Dockerfile.
</span></div>
<pre class="brush:bash"><eof -f="" -p="" app="" bin="" bioapp="" copy="" dev="" entrypoint="" eof="" from="" mkdir="" null="" pre="" run="" tail="" ubuntu:14.04="" usr="" volume=""><span style="font-family: Courier New, Courier, monospace;">The Dockerfile looks like this:</span></eof></pre>
<pre class="brush:bash"><eof -f="" -p="" app="" bin="" bioapp="" copy="" dev="" entrypoint="" eof="" from="" mkdir="" null="" pre="" run="" tail="" ubuntu:14.04="" usr="" volume=""><span style="font-family: Courier New, Courier, monospace;">
</span></eof></pre>
<pre class="brush:bash"> FROM ubuntu:14.04
RUN mkdir -p /bioapp
COPY app /bioapp/
VOLUME /bioapp
ENTRYPOINT /usr/bin/tail -f /dev/null</pre>
<span style="font-family: monospace;"><span style="white-space: pre;"> </span></span><div>
<pre class="brush:bash"><eof -f="" -p="" app="" bin="" bioapp="" copy="" dev="" entrypoint="" eof="" from="" mkdir="" null="" pre="" run="" tail="" ubuntu:14.04="" usr="" volume=""><span style="font-family: Courier New, Courier, monospace;">build the bioapp image
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;">-$docker build -t yings/bioapp:v04202015 .
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Create a image name "biodata" with tag "v04202015"
copy all files and folders under "data" to the "/biodata/" in the container
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;"> $cat >Dockerfile<
FROM ubuntu:14.04
RUN mkdir -p /biodata
COPY data /biodata/
VOLUME /biodata
ENTRYPOINT /usr/bin/tail -f /dev/null
EOF
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Build the bioapp image
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;"> $docker build -t yings/biodata:v04202015 .
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Start the biodata container as daemon, name it as "biodata"
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;">$docker run -d --name biodata yings/biodata:v04202015
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Start the bioapp container as daemon, name it as "bioapp"
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;">$docker run -d --name bioapp yings/bioapp:v04202015
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Now we should have two data volume containers running in the backend. It is time to launch the final executor container
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;">$docker run -it --volumes-from biodata --volumes-from bioapp -v /home/hadoop/fastq:/fastq ubuntu:14.04 /bin/bash
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Those parameters mean:
"-it" run the executor container interactively
"--volumes-from biodata" load the data volume from container "biodata" (do not confused it with image "yings/biodata")
"--volumes-from bioapp" load the data volume from container "bioapp" (again, do not confused it with image "yings/bioapp")
"-v /home/hadoop/fastq:/fastq ubuntu:14.04" mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container.
"ubuntu:14.04" this is the standard image, as our OS
"/bin/bash" command to be executed as entry point.
If everything goes well, you will see you are root now in the executor container
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;">root@5927eecc8530:/#
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Is bwa there?
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;">root@5927eecc8530:/# ls -R /bioapp/
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Is genome index there?
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;">root@5927eecc8530:/# ls -R /biodata/
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Is fastq there?
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;">root@5927eecc8530:/# ls -R /fastq/
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
Launch the job, save the alignment as "/fastq/output/A.sam"
</span><pre class="brush:bash"><span style="font-family: Courier New, Courier, monospace;">root@5927eecc8530:/#/bioapp/bwa/0.7.9a/bwa mem -t 8 -R '@RG\tID:group_id\tPL:illumina\tSM:sample_id' /biodata/index/hg19/bwa/ucsc.hg19 /fastq/A_R1.fastq.gz /fastq/A_R2.fastq.gz > /fastq/A.sam
</span></pre>
<span style="font-family: Courier New, Courier, monospace;">
The process should be complete in a few minutes because the input fastq files are very small, as a test.
Now you can safely terminate the executor container by pressing "Ctrl-D".
Since we previously mount the volume "/home/hadoop/fastq" in host to "/fastq" in the executor container, now back to the host, we will see the persistent output "A.sam" under "/home/hadoop/fastq".
However, if we use the "/biodata" or "/bioapp" as the output folder, for example, "bwa ... > /biodata/A.sam", then it is NOT persistent. If you terminate the "biodata" container, all the changes on that container will be lost. (stop & restart the container is OK, as long as it was not terminated).
</span>
</eof></pre>
</div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com32tag:blogger.com,1999:blog-7515846743446581214.post-13469561383893205222015-04-17T23:45:00.001-07:002015-04-17T23:53:49.185-07:00Docker container to host reference application and library?<span style="font-family: Courier New, Courier, monospace;">Recently I am considering to migrate from AMI to Docker Image</span><br />
<span style="font-family: Courier New, Courier, monospace;">for the reference repository. </span><span style="font-family: 'Courier New', Courier, monospace;">Thinking about launching 20 or more on-demand AWS EC2 c3.4xlarge instances for WGS data analysis.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">===AMI===</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Pros:</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Fast to start</span><br />
<span style="font-family: Courier New, Courier, monospace;"> No installation required</span><br />
<span style="font-family: Courier New, Courier, monospace;"> No configuration required</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Secure</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<span style="font-family: Courier New, Courier, monospace;">Cons:</span><br />
<span style="font-family: Courier New, Courier, monospace;"> AWS only</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Building AMI is a little time-consuming</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Version control is tricky</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Volume size will increase gradually</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">===Docker image</span><span style="font-family: 'Courier New', Courier, monospace;">===</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Pros:</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Easy to build</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Deploy on any cloud or local platforms</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> Easy to tag or version control</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Seems more popular and cooler</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Cons:</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Where to host repository? The Hub or S3? Many considerations</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Pulling images from one repository </span><span style="font-family: 'Courier New', Courier, monospace;">on lots of worker nodes at the same time is not applicable. Networking would be challenging.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">Definitely need more time to do lots of tests. </span><span style="font-family: 'Courier New', Courier, monospace;">Work in progress...</span><br />
<span style="font-family: Courier New, Courier, monospace;"> </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"> </span>The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-10248644306056339062015-04-10T16:27:00.003-07:002015-04-10T16:27:55.792-07:00Lightweight MRAppMaster?<span style="font-family: Courier New, Courier, monospace;">hadoop v2 moves the master application from Master Node to one container. The pro is that it greatly reduced the worload on Master Node while launching multiple MapReduce applications and easily make the hadoop framework more scalable beyond thousands of Worker Nodes.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">However this brings a problem in a small cluster- the Master Application "MRAppMaster" now will occupy a whole container. </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Supposedly we have five Worker Nodes and every node has just one </span><br />
<span style="font-family: Courier New, Courier, monospace;">container. Since "</span><span style="font-family: 'Courier New', Courier, monospace;">MRAppMaster</span><span style="font-family: 'Courier New', Courier, monospace;">" will use one container, then you have only four </span><span style="font-family: 'Courier New', Courier, monospace;">containers can be used for real processing. 20% of computing resource were "wasted". We can mitigate this problem by assigning two containers per node. By this way only 10% of computing resource were wasted. </span><span style="font-family: 'Courier New', Courier, monospace;">However if we divide a node into two containers, then every container's memory and CPU will be cut into half too. The memory size is very precious in many bioinformatics applications. With less than 8G memory your aligner probably will fail.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">If your hadoop cluster has more than 10 nodes, then do not bother to take it into consideration.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<br />The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-15463932513903000922015-04-10T16:13:00.002-07:002015-04-10T16:14:10.799-07:00How much space does a file use in HDFS<span style="font-family: Courier New, Courier, monospace;">I have downloaded NA12878 from http://www.illumina.com/platinumgenomes/.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">The total size is 572GB from 36 gzipped FASTQ files.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Now we want to put this datatset into HDFS with replication factor as 3. Then how much space will it use? </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#before:</span><br />
<span style="font-family: Courier New, Courier, monospace;">$hdfs dfsadmin -report | less</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Configured Capacity: 39378725642240 (35.81 TB)</span><br />
<span style="font-family: Courier New, Courier, monospace;">Present Capacity: 36885873225728 (33.55 TB)</span><br />
<span style="font-family: Courier New, Courier, monospace;">DFS Remaining: 28098049122304 (25.56 TB)</span><br />
<span style="font-family: Courier New, Courier, monospace;">DFS Used: 8787824103424 (7.99 TB)</span><br />
<span style="font-family: Courier New, Courier, monospace;">DFS Used%: 23.82%</span><br />
<span style="font-family: Courier New, Courier, monospace;">Under replicated blocks: 0</span><br />
<span style="font-family: Courier New, Courier, monospace;">Blocks with corrupt replicas: 0</span><br />
<span style="font-family: Courier New, Courier, monospace;">Missing blocks: 0</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">$time hdfs -put NA12878 /fastq/</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">real 113m45.058s</span><br />
<span style="font-family: Courier New, Courier, monospace;">user 32m0.252s</span><br />
<span style="font-family: Courier New, Courier, monospace;"></span><br />
<span style="font-family: Courier New, Courier, monospace;">sys 23m34.897s</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;">#after:</span><br />
<span style="font-family: Courier New, Courier, monospace;">$hdfs dfsadmin -report | less</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<span style="font-family: Courier New, Courier, monospace;">Configured Capacity: 39378725642240 (35.81 TB)</span><br />
<span style="font-family: Courier New, Courier, monospace;">Present Capacity: 36885873225728 (33.55 TB)</span><br />
<span style="font-family: Courier New, Courier, monospace;">DFS Remaining: 26241290846208 (23.87 TB)</span><br />
<span style="font-family: Courier New, Courier, monospace;">DFS Used: 10644582379520 (9.68 TB)</span><br />
<span style="font-family: Courier New, Courier, monospace;">DFS Used%: 28.86%</span><br />
<span style="font-family: Courier New, Courier, monospace;">Under replicated blocks: 0</span><br />
<span style="font-family: Courier New, Courier, monospace;">Blocks with corrupt replicas: 0</span><br />
<span style="font-family: Courier New, Courier, monospace;">Missing blocks: 0</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">Comparing before and after we can see the 572GB dataset actually uses 1.69TB (9.68TB - 7.99TB) HDFS space. That is 3x - exactly the same number of "dfs.replication</span><span style="font-family: 'Courier New', Courier, monospace;">" in "hdfs-site.xml".</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<br /></div>
<div>
<br /></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-82049118864913025922015-04-04T00:28:00.000-07:002015-04-04T00:28:07.051-07:00Xen step1<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo vim /etc/network/interfaces</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"># This file describes the network interfaces available on your system</span><br />
<span style="font-family: Courier New, Courier, monospace;"># and how to activate them. For more information, see interfaces(5).</span><br />
<span style="font-family: Courier New, Courier, monospace;"># The loopback network interface</span><br />
<span style="font-family: Courier New, Courier, monospace;">auto lo em1 xenbr0</span><br />
<span style="font-family: Courier New, Courier, monospace;">iface lo inet loopback</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"># The primary network interface</span><br />
<span style="font-family: Courier New, Courier, monospace;">#auto em1</span><br />
<span style="font-family: Courier New, Courier, monospace;">#iface em1 inet dhcp</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">iface xenbr0 inet dhcp</span><br />
<span style="font-family: Courier New, Courier, monospace;">bridge_ports em1</span><br />
<span style="font-family: Courier New, Courier, monospace;">iface em1 inet manual</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#</span><br />
<span style="font-family: Courier New, Courier, monospace;">#cat > /etc/xen/n0-hvm.cfg</span><br />
<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">builder = "hvm"</span><br />
<span style="font-family: Courier New, Courier, monospace;">name = "n0-hvm"</span><br />
<span style="font-family: Courier New, Courier, monospace;">memory = "10240"</span><br />
<span style="font-family: Courier New, Courier, monospace;">vcpus = 1</span><br />
<span style="font-family: Courier New, Courier, monospace;">vif = ['']</span><br />
<span style="font-family: Courier New, Courier, monospace;">disk = ['phy:/dev/vg0/lv0,hda,w','file:/home/hadoop/ubuntu-14.04.2-server-amd64.iso,hdc:cdrom,r']</span><br />
<span style="font-family: Courier New, Courier, monospace;">vnc = 1</span><br />
<span style="font-family: Courier New, Courier, monospace;">boot="dc"</span><br />
<span class="Apple-tab-span" style="white-space: pre;"><span style="font-family: Courier New, Courier, monospace;"> </span></span><br />
<span style="font-family: Courier New, Courier, monospace;">xl create /etc/xen/n0-hvm.cfg</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#Connect to the installation process using VNC</span><br />
<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install gvncviewer</span><br />
<span style="font-family: Courier New, Courier, monospace;">gvncviewer localhost:0</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#list Virtual Machines</span><br />
<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">xl list</span><br />
<span style="font-family: Courier New, Courier, monospace;">Name ID Mem VCPUs State Time(s)</span><br />
<span style="font-family: Courier New, Courier, monospace;">Domain-0 0 53285 12 r----- 81.2</span><br />
<span style="font-family: Courier New, Courier, monospace;">n0-hvm 2 10240 1 r----- 8.4</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">##########################################</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install virtinst virt-viewer</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#"-r 10240": 10240MB memory</span><br />
<span style="font-family: Courier New, Courier, monospace;">#"-n n0": name of this virtual machine</span><br />
<span style="font-family: Courier New, Courier, monospace;">#"-f /dev/vg0/lv0": LV disk path</span><br />
<span style="font-family: Courier New, Courier, monospace;">#"-c /home/hadoop/ubuntu-14.04.2-server-amd64.iso": path to the ISO</span><br />
<span style="font-family: Courier New, Courier, monospace;">#"--vcpus 1": use 1 CPU</span><br />
<span style="font-family: Courier New, Courier, monospace;">#"--network bridge=xenbr0": use xenbr0 as network</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">virt-install -n n0 \</span><br />
<span style="font-family: Courier New, Courier, monospace;">--vcpus 1 \</span><br />
<span style="font-family: Courier New, Courier, monospace;">-r 1024 \</span><br />
<span style="font-family: Courier New, Courier, monospace;">--network bridge=xenbr0 \</span><br />
<span style="font-family: Courier New, Courier, monospace;">-f /dev/vg0/lv0 \</span><br />
<span style="font-family: Courier New, Courier, monospace;">-c /home/hadoop/ubuntu-14.04.2-server-amd64.iso</span><br />
<br />The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-57696220241759430732015-03-06T23:49:00.001-08:002015-03-06T23:55:26.360-08:00Download from BaseSpace using R<pre>
#under shell
sudo apt-get update
sudo apt-get install libcurl4-gnutls-dev
sudo R
#under R
source('http://bioconductor.org/biocLite.R')
biocLite('RCurl')
biocLite('BaseSpaceR')
quit()
#Start R again
library(BaseSpaceR)
ACCESS_TOKEN<- '<YOUR TOKEN>'
PROJECT_ID<- '3289289'
aAuth<- AppAuth(access_token = ACCESS_TOKEN)
selProj <- Projects(aAuth, id = PROJECT_ID, simplify = TRUE)
sampl <- listSamples(selProj, limit= 1000)
inSample <- Samples(aAuth, id = Id(sampl), simplify = TRUE)
for(s in inSample)
{
f <- listFiles(s, Extensions = ".gz")
print(Name(f))
getFiles(aAuth, id= Id(f), destDir = 'fastq/', verbose = TRUE)
}
</pre>
Reference: http://seqanswers.com/forums/showthread.php?t=47633The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-30441534226040617802015-02-20T13:54:00.005-08:002015-02-20T16:57:11.629-08:00Install bcl2fastq-1.8.4 and bcl2fastq 2.15.0 to Ubuntu 14.04 Server 64 Bit<span style="font-family: Courier New, Courier, monospace;">Goal:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">build a streamlined workflow from Sequencer to Analysis results directly, aka "one stop solution".</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">After a user starts an experiment by pushing a button on the touch screen on the sequencer, all the downstream analysis are automatically done. </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;"><b><span style="color: red;">The outputs from MiSeq will be transfered to another Linux machine, from where the intensities files will be converted into FASTQ files.</span></b> After that, depending on the type of this experiment, an appropriate pipeline is launched on a cluster computer to processing these data. At last a report is generated and uploaded to a web server for the user to view. </span><span style="font-family: 'Courier New', Courier, monospace;">Everything was done automatically </span><span style="font-family: 'Courier New', Courier, monospace;">without any manual operations. No bioinformaticians or computer scientists involved in this process. </span><br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">Here I list the steps for the installation of "bcl2fastq" on the Linux machine.</span><br />
<div>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span></div>
<br />
<span style="font-family: Courier New, Courier, monospace;">#system</span><br />
<span style="font-family: Courier New, Courier, monospace;">Ubuntu 14.04 Server 64 bit</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#package</span><br />
<span style="font-family: Courier New, Courier, monospace;">bcl2fastq-1.8.4</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#dependency</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install alien dpkg-dev debhelper build-essential xsltproc gnuplot -y</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#make a tmp folder</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ~/tmp ; cd ~/tmp</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#download the bcl2fastq RPM package from illumina. </span><br />
<span style="font-family: Courier New, Courier, monospace;">#The tar ball source code failed to compile on my system</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">wget ftp://webdata:webdata@ussd-ftp.illumina.com/Downloads/Software/bcl2fastq/bcl2fastq-1.8.4-Linux-x86_64.rpm</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#http://seqanswers.com/forums/showthread.php?t=45649</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo alien -i bcl2fastq-1.8.4-Linux-x86_64.rpm</span><br />
<span style="font-family: Courier New, Courier, monospace;">curl -kL http://install.perlbrew.pl | bash</span><br />
<span style="font-family: Courier New, Courier, monospace;">echo >> ~/.bash_profile "source ~/perl5/perlbrew/etc/bashrc"</span><br />
<span style="font-family: Courier New, Courier, monospace;">perlbrew install perl-5.14.4</span><br />
<span style="font-family: Courier New, Courier, monospace;">perlbrew switch perl-5.14.4</span><br />
<span style="font-family: Courier New, Courier, monospace;">perlbrew install-cpanm</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#install expat-2.1.0</span><br />
<span style="font-family: Courier New, Courier, monospace;">wget http://downloads.sourceforge.net/project/expat/expat/2.1.0/expat-2.1.0.tar.gz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fexpat%2Ffiles%2Fexpat%2F2.1.0%2F&ts=1424461084&use_mirror=softlayer-dal</span><br />
<span style="font-family: Courier New, Courier, monospace;">tar -zxvf expat-2.1.0.tar.gz</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd expat-2.1.0 && ./configure && make && sudo make install</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#install XML-Parser-2.41</span><br />
<span style="font-family: Courier New, Courier, monospace;">wget http://pkgs.fedoraproject.org/repo/pkgs/perl-XML-Parser/XML-Parser-2.41.tar.gz/c320d2ffa459e6cdc6f9f59c1185855e/XML-Parser-2.41.tar.gz</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd XML-Parser-2.41 && perl Makefile.PL && make && sudo make install</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#install XML module</span><br />
<span style="font-family: Courier New, Courier, monospace;">cpanm XML/Simple.pm</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#exit</span><br />
<span style="font-family: Courier New, Courier, monospace;">exit</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#the installation is done. Now make a test run.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#To run bcl2fastq we have to switch to a less-strict PERL environment</span><br />
<span style="font-family: Courier New, Courier, monospace;">perlbrew switch perl-5.14.4 </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#assume "test/Data/Intensities/BaseCalls" is the output folder from your sequecning machine</span><br />
<span style="font-family: Courier New, Courier, monospace;">/usr/local/bin/configureBclToFastq.pl --input-dir test/Data/Intensities/BaseCalls --output-dir output </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#change to new output folder "output" and start the "from bcl to fastq" conversion</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd output && make -j $(grep -c ^processor /proc/cpuinfo)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#if you find "INFO: all completed successfully" in the last line of output then the test passed</span><br />
<span style="font-family: Courier New, Courier, monospace;">#check out the output fastq files, here the "000000000-ABCDEF" is the FCID (Flow Cell ID)</span><br />
<span style="font-family: Courier New, Courier, monospace;">ls -al Project_000000000-ABCDEF/Sample_lane1/</span><br />
<div>
<br />
<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">##################################</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">UPDATE!!!</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">interestingly, illumina claimed </span><br />
<span style="font-family: Courier New, Courier, monospace;">"""</span><br />
<span style="font-family: Courier New, Courier, monospace;">Use the bcl2fastq 2.15.0 conversion software to convert NextSeq 500 or HiSeq X output. </span><br />
<span style="font-family: Courier New, Courier, monospace;">Version 2.15.0 is only for use with NextSeq and HiSeq X data. </span><br />
<span style="font-family: Courier New, Courier, monospace;">Use bcl2fastq 1.8.4 for MiSeq and HiSeq data conversion. </span><br />
<span style="font-family: Courier New, Courier, monospace;">The software is available for download in either an rpm or tarball (.tar.gz) format.</span><br />
<span style="font-family: Courier New, Courier, monospace;">"""</span><br />
<span style="font-family: Courier New, Courier, monospace;">http://support.illumina.com/downloads.html</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">However I found after the MiSeq's outputs were uploaded into BaseSpace, it actually </span><br />
<span style="font-family: Courier New, Courier, monospace;">converted by bcl2fastq 2.15.0.</span><br />
<span style="font-family: Courier New, Courier, monospace;">One key difference is that the "SampleSheet.csv" has different formats for "bcl2fastq 1.8.4" and "bcl2fastq 2.15.0". </span><br />
<span style="font-family: Courier New, Courier, monospace;">The "SampleSheet.csv" generated by MiSeq match the format for "bcl2fastq 2.15.0".</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">It is a breeze to install bcl2fastq2-v2.15.0 on Ubuntu from source code. </span><br />
<span style="font-family: Courier New, Courier, monospace;">It is a mission impossible (amlost, I spent a whole day trying different methods then gave up)to do so with bcl2fastq-1.8.4 from source code</span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;">wget ftp://webdata2:webdata2@ussd-ftp.illumina.com/downloads/Software/bcl2fastq/bcl2fastq2-v2.15.0.4.tar.gz</span><br />
<span style="font-family: Courier New, Courier, monospace;">tar -zxvf bcl2fastq2-v2.15.0.4.tar.gz</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd bcl2fastq && mkdir build && cd build</span><br />
<span style="font-family: Courier New, Courier, monospace;">../src/configure --prefix=/home/hadoop/tool/bcl2fastq && make && make install</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#now make a test. Assuming "exp123" is our input folder</span><br />
<span style="font-family: Courier New, Courier, monospace;">#firstly all the ".bcl" files MUST be gzipped otherwise the job will fail (need improvements here, illumina!)</span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;">find exp123 -name "*.bcl" -exec gzip {} \;</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#pull the trigger </span><br />
<span style="font-family: Courier New, Courier, monospace;">/home/hadoop/tool/bcl2fastq/bin/bcl2fastq -R exp123 -o exp123_fastq</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Nice! Sorry I take back my complaining on illumina's software engineering team this morning </span><span style="font-family: 'Courier New', Courier, monospace;">after frustration on installing and running bcl2fastq 1.8.4 </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">Good job, illumina team on bcl2fastq 2.15.0. You have my love again.</span></div>
<div>
<br /></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com6tag:blogger.com,1999:blog-7515846743446581214.post-57221191219828512652014-10-06T17:25:00.002-07:002014-10-06T17:29:16.828-07:00Install oozie on ubuntu 14.04 with hadoop-2.5.0<span style="font-family: Courier New, Courier, monospace;">Assuming we use /home/hadoop/tools/ as the installation path for oozie-4.0.1 with hadoop-2.5.0 installed on </span><span style="font-family: 'Courier New', Courier, monospace;">/home/hadoop/tools/hadoop-2.5.0.</span><span style="font-family: 'Courier New', Courier, monospace;"> </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">1. download & build</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); margin-bottom: 10px; padding: 9.5px; word-break: break-all; word-wrap: break-word;"><div>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">cd /home/hadoop/tools/</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">curl http://apache.cs.utah.edu/oozie/4.0.1/oozie-4.0.1.tar.gz | tar -zx</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">cd /home/hadoop/tools/oozie-4.0.1</span></span><br />
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: Courier New, Courier, monospace;">
</span></div>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">bin/mkdistro.sh -DskipTests -Dhadoopversion=2.5.0 -DjavaVersion=1.7 -DtargetJavaVersion=1.7</span></span><br />
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
</div>
</div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: 'Courier New', Courier, monospace;"></span></div>
</pre>
<br />
<span style="font-family: Courier New, Courier, monospace;">2. make a new folder for the build</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); margin-bottom: 10px; padding: 9.5px; word-break: break-all; word-wrap: break-word;"><div>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">cp -R /home/hadoop/tools/oozie-4.0.1/distro/target/oozie-4.0.1-distro/oozie-4.0.1 /home/hadoop/tools/oozie</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">cd /home/hadoop/tools/oozie/bin</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">echo "PATH=$PWD:\$PATH" >> ~/.bash_profile</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">source ~/.bash_profile</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">#now "/home/hadoop/tools/oozie" is the home path for oozie</span></span><br />
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
</div>
</div>
</pre>
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">3. prepare for oozie web console</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); margin-bottom: 10px; padding: 9.5px; word-break: break-all; word-wrap: break-word;"><div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: Courier New, Courier, monospace;">
</span></div>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">cd /home/hadoop/tools/oozie/</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">mkdir libext</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">cp /home/hadoop/tools/oozie-4.0.1/hadooplibs/target/oozie-4.0.1-hadooplibs.tar.gz .</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">tar xzvf oozie-4.0.1-hadooplibs.tar.gz</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">cp oozie-4.0.1/hadooplibs/hadooplib-2.3.0.oozie-4.0.1/* libext/</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">cd libext</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">wget http://extjs.com/deploy/ext-2.2.zip</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">rm -fr /home/hadoop/tools/oozie/oozie-4.0.1 /home/hadoop/tools/oozie/oozie-4.0.1-hadooplibs.tar.gz</span></span><br />
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: Courier New, Courier, monospace;">
</span></div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
</div>
</div>
</pre>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">4. add these two lines to "core-site.xml" under "/home/hadoop/tools/hadoop-2.5.0/etc/hadoop"</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); margin-bottom: 10px; padding: 9.5px; word-break: break-all; word-wrap: break-word;"><div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: Courier New, Courier, monospace;">
</span></div>
<property><name>property name "hadoop.proxyuser.hadoop.hosts", property value "</name><value>*"</value></property>
<property><name>property name "hadoop.proxyuser.hadoop.groups<name>, property value "</name><value>*"</value></name></property>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;"></span></span></pre>
<br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<br />
<div>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;"><br /></span></span></div>
<span style="font-family: Courier New, Courier, monospace;">5. Start oozie</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); margin-bottom: 10px; padding: 9.5px; word-break: break-all; word-wrap: break-word;"><div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: Courier New, Courier, monospace;">
</span></div>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">oozie-setup.shoozie-setup.sh prepare-war</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">#oozie-setup.sh sharelib create -fs hdfs://localhost:8020</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">oozie-setup.sh db create –run</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">oozied.sh start</span></span><br />
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">oozie admin –oozie http://localhost:11000 -status</span></span><br />
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
</div>
</div>
</pre>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">6. Run an example</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); margin-bottom: 10px; padding: 9.5px; word-break: break-all; word-wrap: break-word;"><div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: 'Courier New', Courier, monospace;">cd /home/hadoop/tools/oozie/</span></div>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">tar -zxvf oozie-examples.tar.gz</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">find examples/ -name "job.properties" -exec sed -i "s/localhost/master/g" '{}' \;</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">hdfs dfs -rm -f -r /user/hadoop/examples </span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">hdfs dfs -put examples examples</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">oozie job -oozie http://master:11000/oozie -config examples/apps/map-reduce/job.properties -run</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">
</span></span>
<span style="font-family: Courier New, Courier, monospace;"><span style="white-space: normal;">#now open a web browser and access "http://master:11000/oozie", you will see the submitted job.</span></span>
</pre>
<span style="font-family: Courier New, Courier, monospace;">#reference:</span><br />
<span style="font-family: Courier New, Courier, monospace;">http://www.quora.com/Has-anyone-tried-installing-oozie-4-0-0-with-apache-hadoop-2-2-0-and-java1-7-0_45-on-ubuntu</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com3tag:blogger.com,1999:blog-7515846743446581214.post-1266518347095847162014-09-22T01:01:00.002-07:002014-10-06T17:18:38.182-07:00Learn Spark with Python<span style="font-family: Courier New, Courier, monospace;">1. Install Spark</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); color: #333333; font-family: Menlo, 'Lucida Console', monospace; font-size: 13px; line-height: 20px; margin-bottom: 10px; padding: 9.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: Courier New, Courier, monospace;">cd ~/tools/</span></div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: 'Courier New', Courier, monospace;">
</span></div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: 'Courier New', Courier, monospace;">wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz</span></div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
</div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: 'Courier New', Courier, monospace;">tar -zxvf spark-1.1.0.tgz</span></div>
</pre>
<br />
<span style="font-family: 'Courier New', Courier, monospace;">2. Build spark for hadoop2</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); color: #333333; font-family: Menlo, 'Lucida Console', monospace; font-size: 13px; line-height: 20px; margin-bottom: 10px; padding: 9.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><code style="background: transparent; border-bottom-left-radius: 3px; border-bottom-right-radius: 3px; border-top-left-radius: 3px; border-top-right-radius: 3px; border: 0px; color: inherit; font-family: Menlo, 'Lucida Console', monospace; font-size: 12px; padding: 0px;">cd ~/tools/</code><span style="font-family: 'Courier New', Courier, monospace;">spark-1.1.0</span></pre>
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); color: #333333; font-family: Menlo, 'Lucida Console', monospace; font-size: 13px; line-height: 20px; margin-bottom: 10px; padding: 9.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><span style="background-color: transparent; color: inherit; font-size: 12px;">SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly</span></pre>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">3. Install py4j</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); color: #333333; font-family: Menlo, 'Lucida Console', monospace; font-size: 13px; line-height: 20px; margin-bottom: 10px; padding: 9.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><span style="color: black; font-family: 'Courier New', Courier, monospace; font-size: small; line-height: normal; white-space: normal;">sudo pip install py4j</span></pre>
<br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">4. Modify ~/.bash_profile by adding two lines</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); color: #333333; font-family: Menlo, 'Lucida Console', monospace; font-size: 13px; line-height: 20px; margin-bottom: 10px; padding: 9.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: 'Courier New', Courier, monospace;">export SPARK_HOME=$HOME/tools/spark-1.1.0</span></div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: Courier New, Courier, monospace;"></span></div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: Courier New, Courier, monospace;">export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH</span></div>
</pre>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">5. source the ~/.bash_profile</span><br />
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); color: #333333; font-size: 13px; line-height: 20px; margin-bottom: 10px; padding: 9.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><div style="color: black; font-size: medium; line-height: normal; white-space: normal;">
<span style="font-family: Courier New, Courier, monospace;">source </span><span style="font-family: 'Courier New', Courier, monospace;">~/.bash_profile</span></div>
</pre>
<span style="font-family: Courier New, Courier, monospace;"><br /></span><span style="font-family: Courier New, Courier, monospace;">6. Test. Start a python shell and type</span><br />
<div>
<pre style="background-color: whitesmoke; border-bottom-left-radius: 4px; border-bottom-right-radius: 4px; border-top-left-radius: 4px; border-top-right-radius: 4px; border: 1px solid rgba(0, 0, 0, 0.14902); color: #333333; font-family: Menlo, 'Lucida Console', monospace; font-size: 13px; line-height: 20px; margin-bottom: 10px; padding: 9.5px; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><span style="color: black; font-family: 'Courier New', Courier, monospace; font-size: small; line-height: normal; white-space: normal;">import pyspark</span></pre>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-60625963390975020582014-08-27T15:16:00.002-07:002014-08-27T15:17:01.247-07:00AWS AMI for Hadoop DataNode<span style="font-family: Courier New, Courier, monospace;"></span><br />
<pre></pre>
<span style="font-family: Courier New, Courier, monospace;">
</span><br />
<span style="font-family: Courier New, Courier, monospace;">##Create an AWS AMI as the snapshot for hadoop datanode</span><br />
<span style="font-family: Courier New, Courier, monospace;">##This image should contain all necessary tools, packages and libraries</span><br />
<span style="font-family: Courier New, Courier, monospace;">##to be used by the pipeline and your hadoop-application</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">##assume the we already started a Ubuntu14.04-64-PV instance</span><br />
<span style="font-family: Courier New, Courier, monospace;">##Public IP for the instance is 12.34.56.78 and our private key file was saved as </span><br />
<span style="font-family: Courier New, Courier, monospace;">##/home/hadoop/.ssh/aws.pem</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">AWS_IP=12.34.56.78</span><br />
<span style="font-family: Courier New, Courier, monospace;">KEYFILE=/home/hadoop/.ssh/aws.pem</span><br />
<span style="font-family: Courier New, Courier, monospace;">USER=ubuntu</span><br />
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#step1. log in our AWS instance</span><br />
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">ssh -i ~/.ssh/aws.pem ubuntu@${AWS_IP}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#step2. install basic packages</span><br />
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install openjdk-7-jdk -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install make -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install cmake -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install gcc -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install g++ -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install zlib1g-dev -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install unzip -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install libncurses5-dev -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install r-base -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install python-dev -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install python-dateutil -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install python-psutil -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install python-pip -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install maven2 -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install libxml2-dev -y</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo apt-get install gradle -y</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#install R packages</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo R</span><br />
<span style="font-family: Courier New, Courier, monospace;">source('http://www.bioconductor.org/biocLite.R')</span><br />
<span style="font-family: Courier New, Courier, monospace;">biocLite('edgeR')</span><br />
<span style="font-family: Courier New, Courier, monospace;">biocLite('DESeq')</span><br />
<span style="font-family: Courier New, Courier, monospace;">biocLite('limma')</span><br />
<span style="font-family: Courier New, Courier, monospace;">#... all other necessary packages</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#step3. create folder structure</span><br />
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">TOOLS_HOME=~/tools</span><br />
<span style="font-family: Courier New, Courier, monospace;">BIO_HOME=~/bioinformatics</span><br />
<span style="font-family: Courier New, Courier, monospace;">APP=$BIO_HOME/app</span><br />
<span style="font-family: Courier New, Courier, monospace;">DATA=$BIO_HOME/data</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${TOOLS_HOME}</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${DATA}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#step4. install hadoop under ~/tools/</span><br />
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd $TOOLS_HOME </span><br />
<span style="font-family: Courier New, Courier, monospace;">wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.5.0/hadoop-2.5.0.tar.gz</span><br />
<span style="font-family: Courier New, Courier, monospace;">tar -zxvf hadoop-2.4.0.tar.gz && rm -f hadoop-2.5.0.tar.gz && cd</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#install s3tools under ~/tools/</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd $TOOLS_HOME </span><br />
<span style="font-family: Courier New, Courier, monospace;">wget https://github.com/s3tools/s3cmd/archive/master.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;">unzip master.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#step5. install bioinformatics applications</span><br />
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#BWA</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd $APP</span><br />
<span style="font-family: Courier New, Courier, monospace;">wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.9a.tar.bz2</span><br />
<span style="font-family: Courier New, Courier, monospace;">tar -jxvf bwa-0.7.9a.tar.bz2 && rm -fr bwa-0.7.9a.tar.bz2</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd bwa-0.7.9a && make && cd $APP</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p $APP/bwa/0.7.9a</span><br />
<span style="font-family: Courier New, Courier, monospace;">find bwa-0.7.9a -executable -type f -print0 | xargs -0 -I {} mv {} $APP/bwa/0.7.9a/</span><br />
<span style="font-family: Courier New, Courier, monospace;">rm -fr bwa-0.7.9a</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#bowtie1</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">wget http://downloads.sourceforge.net/project/bowtie-bio/bowtie/1.0.1/bowtie-1.0.1-linux-x86_64.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;">unzip bowtie-1.0.1-linux-x86_64.zip && rm bowtie-1.0.1-linux-x86_64.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${APP}/bowtie/1.0.1 && mv bowtie-1.0.1/* ${APP}/bowtie/1.0.1/ && rm -fr bowtie-1.0.1/</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#bowtie2</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">wget http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.3/bowtie2-2.2.3-linux-x86_64.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;">unzip bowtie2-2.2.3-linux-x86_64.zip && rm bowtie2-2.2.3-linux-x86_64.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${APP}/bowtie/2.2.3 && mv bowtie2-2.2.3/* ${APP}/bowtie/2.2.3/ && rm -fr bowtie2-2.2.3/</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#SNAP</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">curl http://snap.cs.berkeley.edu/downloads/snap-1.0beta.10-linux.tar.gz | tar xvz</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${APP}/snap/1.0beta.10/</span><br />
<span style="font-family: Courier New, Courier, monospace;">mv snap-1.0beta.10-linux/* ${APP}/snap/1.0beta.10/ && rm -fr snap-1.0beta.10-linux</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#GSNAP</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">curl http://research-pub.gene.com/gmap/src/gmap-gsnap-2014-06-10.tar.gz | tar xvz</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd gmap-2014-06-10 && ./configure --prefix=${APP}/gmap/2014-06-10/ && make && make install</span><br />
<span style="font-family: Courier New, Courier, monospace;">rm -fr gmap-2014-06-10</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#STAR</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">curl https://rna-star.googlecode.com/files/STAR_2.3.0e.Linux_x86_64.tgz | tar xvz</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${APP}/star/2.3.0e/</span><br />
<span style="font-family: Courier New, Courier, monospace;">mv STAR_2.3.0e.Linux_x86_64/* ${APP}/star/2.3.0e/ && rm -fr STAR_2.3.0e.Linux_x86_64 </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#Tophat2</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">curl http://ccb.jhu.edu/software/tophat/downloads/tophat-2.0.11.Linux_x86_64.tar.gz | tar xvz </span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${APP}/tophat/2.0.11 && mv tophat-2.0.11.Linux_x86_64/* ${APP}/tophat/2.0.11/ && rm -fr tophat-2.0.11.Linux_x86_64</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#cufflinks</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">curl http://cufflinks.cbcb.umd.edu/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz | tar xvz </span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${APP}/cufflinks/2.2.1 && mv cufflinks-2.2.1.Linux_x86_64/* ${APP}/cufflinks/2.2.1/ && rm -fr cufflinks-2.2.1.Linux_x86_64</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#HTSeq</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">echo -e "y\n" | sudo apt-get install python-pip</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo pip install numpy</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo pip install scipy</span><br />
<span style="font-family: Courier New, Courier, monospace;">curl https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1p1.tar.gz | tar xvz </span><br />
<span style="font-family: Courier New, Courier, monospace;">cd HTSeq-0.6.1p1 && python setup.py build && sudo python setup.py install && cd -</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo rm -fr HTSeq-0.6.1p1</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#samtools</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">wget http://downloads.sourceforge.net/project/samtools/samtools/0.1.19/samtools-0.1.19.tar.bz2</span><br />
<span style="font-family: Courier New, Courier, monospace;">tar -jxvf samtools-0.1.19.tar.bz2 && cd samtools-0.1.19/ && make && cd</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p $APP/samtools/0.1.19/</span><br />
<span style="font-family: Courier New, Courier, monospace;">find samtools-0.1.19 -executable -type f -print0 | xargs -0 -I {} mv {} $APP/samtools/0.1.19/</span><br />
<span style="font-family: Courier New, Courier, monospace;">rm -fr samtools-0.1.19*</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#picard</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd ${APP}</span><br />
<span style="font-family: Courier New, Courier, monospace;">wget http://downloads.sourceforge.net/project/picard/picard-tools/1.114/picard-tools-1.114.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;">unzip picard-tools-1.114.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${APP}/picard/1.114/ && mv picard-tools-1.114/* ${APP}/picard/1.114/</span><br />
<span style="font-family: Courier New, Courier, monospace;">rm -fr picard-tools-1.114 picard-tools-1.114.zip</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#bamtools</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd $APP</span><br />
<span style="font-family: Courier New, Courier, monospace;">git clone https://github.com/pezmaster31/bamtools.git</span><br />
<span style="font-family: Courier New, Courier, monospace;">cd $APP/bamtools && mkdir build && cd build && cmake .. && make</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p $APP/bamtools/2.3.0/ && mv $APP/bamtools/* $APP/bamtools/2.3.0/</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;">#step5. install bioinformatics annotation files</span><br />
<span style="font-family: Courier New, Courier, monospace;">########################</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#hg19</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${DATA}/fasta/hg19/ && cd ${DATA}/fasta/hg19/</span><br />
<span style="font-family: Courier New, Courier, monospace;">for i in {1..22} X Y M; do wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz; done</span><br />
<span style="font-family: Courier New, Courier, monospace;">gunzip *.gz</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#create index and dict for each chromosome files </span><br />
<span style="font-family: Courier New, Courier, monospace;">for i in *.fa</span><br />
<span style="font-family: Courier New, Courier, monospace;">do</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>j=$(echo $i | cut -d"." -f1)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>echo $j</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>java -jar ${APP}/picard/1.114/CreateSequenceDictionary.jar R=$j.fa O=$j.dict</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>${APP}/samtools/0.1.19/samtools faidx $j.fa</span><br />
<span style="font-family: Courier New, Courier, monospace;">done</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#build hg19 genome index for BWA</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p $DATA/index/hg19/bwa/</span><br />
<span style="font-family: Courier New, Courier, monospace;">${APP}/bwa/0.7.9a/bwa index -p ${DATA}/index/hg19/bwa/hg19 ${DATA}/fasta/hg19/hg19.fa</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#build hg19 genome index for novoalign</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${DATA}/index/novoalign/hg19</span><br />
<span style="font-family: Courier New, Courier, monospace;">${APP}/novocraft/3.02.05/novoindex -k 14 -s 1 ${DATA}/index/hg19/novoalign/hg19.nix ${DATA}/fasta/hg19/hg19.fa</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#build hg19 genome index for bowtie1</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p $DATA/index/hg19/bowtie1/</span><br />
<span style="font-family: Courier New, Courier, monospace;">$APP/bowtie/1.0.1/bowtie-build $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie1/hg19</span><br />
<span style="font-family: Courier New, Courier, monospace;">cp $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie1/</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#build hg19 genome index for bowtie2</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p $DATA/index/hg19/bowtie2/</span><br />
<span style="font-family: Courier New, Courier, monospace;">$APP/bowtie/2.2.3/bowtie2-build $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie2/hg19</span><br />
<span style="font-family: Courier New, Courier, monospace;">cp $DATA/fasta/hg19/hg19.fa $DATA/index/hg19/bowtie2/</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#build hg19 genome index for SNAP -require at least 64 GB of memory</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${DATA}/index/snap/hg19</span><br />
<span style="font-family: Courier New, Courier, monospace;">sudo sysctl vm.overcommit_memory=1</span><br />
<span style="font-family: Courier New, Courier, monospace;">${APP}/snap/1.0beta.10/snap index ${DATA}/fasta/hg19/hg19.fa ${DATA}/index/snap/hg19</span><br />
<span style="font-family: Courier New, Courier, monospace;">#{APP}/snap/1.0beta.10/snap paired ${DATA}/index/snap/hg19</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#TODO: build hg19 genome index for GSNAP</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#TODO: build hg19 genome index for STAR</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#dbsnp</span><br />
<span style="font-family: Courier New, Courier, monospace;">curl ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All.vcf.gz | gunzip -c > dbsnp_hg19.vcf</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir -p ${DATA}/misc/ && cd ${DATA}/misc/</span><br />
<span style="font-family: Courier New, Courier, monospace;">for chromosome in {1..22} X Y M; </span><br />
<span style="font-family: Courier New, Courier, monospace;">do</span><br />
<span style="font-family: Courier New, Courier, monospace;">awk -v c="$chromosome" '/^#/{print $0;next} $1~"^chr"c {print $0}' dbsnp.vcf > dbsnp_hg19.chr${chromosome}.vcf</span><br />
<span style="font-family: Courier New, Courier, monospace;">done</span><br />
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<span style="font-family: Courier New, Courier, monospace;"></span>The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-84136081036319008742014-08-27T14:38:00.005-07:002014-11-13T15:43:32.013-08:00Extend /home partition under VMWare+Ubuntu<span style="font-family: Courier New, Courier, monospace;">Say I want to add 512GB more space onto existing Ubuntu hosted by VMWare Workstation.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;">0. You have to turn off your Ubuntu system then switch </span><span style="font-family: 'Courier New', Courier, monospace;">VMWare Workstation</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">Right Click the Machine-> "Settings" -> "Hard Disk" in left panel -> click "Utilities" in right panel -> "Expand" -> Enter the new total space. Since my system has 512GB already, and I want to add 512GB, the total would be "1024".</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">1. Check existing disk partitions. write down the biggest number of End column. here it is </span><span style="font-family: 'Courier New', Courier, monospace;">3221225471</span><span style="font-family: 'Courier New', Courier, monospace;"> </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">$sudo fdisk -l</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"> Device Boot Start End Blocks Id System</span><br />
<span style="font-family: Courier New, Courier, monospace;">/dev/sda1 * 2048 499711 248832 83 Linux</span><br />
<span style="font-family: Courier New, Courier, monospace;">/dev/sda2 501758 2147481599 1073489921 5 Extended</span><br />
<span style="font-family: Courier New, Courier, monospace;">/dev/sda3 2147481600 3221225471 536871936 83 Linux</span><br />
<span style="font-family: Courier New, Courier, monospace;">/dev/sda5 501760 2147481599 1073489920 8e Linux LVM</span><br />
<span style="font-family: Courier New, Courier, monospace;"></span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">2. Add new partition</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">$sudo fdisk /dev/sda</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">press "n" to create new partition, when </span><span style="font-family: 'Courier New', Courier, monospace;">when prompting</span><span style="font-family: 'Courier New', Courier, monospace;"> "</span><span style="font-family: 'Courier New', Courier, monospace;">Partition type</span><span style="font-family: 'Courier New', Courier, monospace;">" , enter "p".</span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;">when prompting First sector, enter </span><span style="font-family: 'Courier New', Courier, monospace;">3221225472 (</span><span style="font-family: 'Courier New', Courier, monospace;">3221225471</span><span style="font-family: 'Courier New', Courier, monospace;"> + 1</span><span style="font-family: 'Courier New', Courier, monospace;">)</span><span style="font-family: 'Courier New', Courier, monospace;"> </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">when </span><span style="font-family: 'Courier New', Courier, monospace;">prompting Last sector, accept defaults (use all remaining space)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">then press "w" to write to disk.</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">3. reboot </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">$sudo reboot</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">4. Run "sudo fdisk -l" again. This time you should see a new partition "/dev/sda4"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"> Device Boot Start End Blocks Id System</span><br />
<span style="font-family: Courier New, Courier, monospace;">/dev/sda1 * 2048 499711 248832 83 Linux</span><br />
<span style="font-family: Courier New, Courier, monospace;">/dev/sda2 501758 2147481599 1073489921 5 Extended</span><br />
<span style="font-family: Courier New, Courier, monospace;">/dev/sda3 2147481600 3221225471 536871936 83 Linux</span><br />
<span style="font-family: Courier New, Courier, monospace;">/dev/sda4 3221225472 4294967294 536870911+ 83 Linux</span><br />
<span style="font-family: Courier New, Courier, monospace;"></span><br />
<span style="font-family: Courier New, Courier, monospace;">/dev/sda5 501760 2147481599 1073489920 8e Linux LVM</span><br />
<div>
<br /></div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">4. Find a folder under "/dev/" which ends with "-vg". In should be named as "<machine_name>-vg". e.g. "master-vg" if your hostname is master.</machine_name></span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">5. Extend the volume group</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">$sudo vgextend cloud-vg /dev/sda4</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">You should see screen outputs like this:</span><br />
<span style="font-family: Courier New, Courier, monospace;"> No physical volume label read from /dev/sda4</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Physical volume "/dev/sda4" successfully created</span><br />
<span style="font-family: Courier New, Courier, monospace;"></span><br />
<span style="font-family: Courier New, Courier, monospace;"> Volume group "cloud-vg" successfully extended</span><br />
<div>
<br /></div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span><span style="font-family: Courier New, Courier, monospace;">6. Add logical volume</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#add 512G</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">$sudo lvextend -L+512G /dev/</span><span style="font-family: 'Courier New', Courier, monospace;">cloud</span><span style="font-family: Courier New, Courier, monospace;">-vg/root</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#or add all free space</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="background-color: white; color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18.4799995422363px;">$sudo lvextend -l+100%FREE /dev/</span><span style="background-color: white; color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18.4799995422363px;">cloud</span><span style="background-color: white; color: #222222; font-family: 'Courier New', Courier, monospace; font-size: 13px; line-height: 18.4799995422363px;">-vg/root</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">You should see screen outputs like this:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"> Extending logical volume root to 1.98 TiB</span><br />
<span style="font-family: Courier New, Courier, monospace;"> Logical volume root successfully resized</span><br />
<div>
<br /></div>
<div>
<br /></div>
<span style="font-family: 'Courier New', Courier, monospace;">7. Final step - Reize </span><span style="font-family: Courier New, Courier, monospace;"></span><br />
<span style="font-family: Courier New, Courier, monospace;">$sudo resize2fs /dev/</span><span style="font-family: 'Courier New', Courier, monospace;">cloud</span><span style="font-family: Courier New, Courier, monospace;">-vg/root</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">8. Check out the disk usage now</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">$df -h</span><br />
<div>
<br /></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-47492042748773750802014-08-27T14:35:00.002-07:002014-08-27T14:35:37.836-07:00MongoDB start and stop<span style="font-family: Courier New, Courier, monospace;">#MongoDB</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#create</span><br />
<span style="font-family: Courier New, Courier, monospace;">mkdir ~/mongo && mkdir -p ~/log/mongodb</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#start</span><br />
<span style="font-family: Courier New, Courier, monospace;">mongod --fork --dbpath ~/mongo --logpath ~/log/mongodb/main.log</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#shutdown</span><br />
<span style="font-family: Courier New, Courier, monospace;">mongod --shutdown --dbpath ~/mongo</span><br />
<div>
<br /></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-23262502372744719862014-08-19T11:14:00.003-07:002014-08-19T11:14:52.261-07:00bamsplitter<span style="font-family: Courier New, Courier, monospace;">#!/bin/bash</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#split the BAM file by chromosome on NameNode's local disk</span><br />
<span style="font-family: Courier New, Courier, monospace;">pathSamtools=$1 #the full path to samtools. e.g /X/Y/Z/samtoools</span><br />
<span style="font-family: Courier New, Courier, monospace;">bamInputFile=$2 #input BAM file name. e.g. /A/B/C/d.bam</span><br />
<span style="font-family: Courier New, Courier, monospace;">bamOutputDir=$3 #output folder for splitted BAM file. The full path will be /A/B/C/E/</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#bamInputFile="/A/B/C/d.bam"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#/A/B/C/<bamoutputdir></bamoutputdir></span><br />
<span style="font-family: Courier New, Courier, monospace;">bamOutputPath=${bamInputFile%/*}"/"${bamOutputDir}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#"d.bam"</span><br />
<span style="font-family: Courier New, Courier, monospace;">bamFileName=$(basename "$bamInputFile")</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#"d"</span><br />
<span style="font-family: Courier New, Courier, monospace;">bamFileNameNoExt=${bamFileName%%.*}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#"/A/B/C/d.bam.bai"</span><br />
<span style="font-family: Courier New, Courier, monospace;">bamFileIndex=${bamInputFile}."bai"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#"/A/B/C/d"</span><br />
<span style="font-family: Courier New, Courier, monospace;">bamFilePathNoExt=${bamInputFile%%.*}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#"bam"</span><br />
<span style="font-family: Courier New, Courier, monospace;">#bamFileNameExt="${bam##*.}"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">bam=${bamInputFile}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">numCPU=$(grep -c ^processor /proc/cpuinfo)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">sorted=$(samtools view -H ${bamInputFile} | grep SO:coordinate)</span><br />
<span style="font-family: Courier New, Courier, monospace;">if [ -z "$sorted" ];then</span><br />
<span style="font-family: Courier New, Courier, monospace;"> #unsorted</span><br />
<span style="font-family: Courier New, Courier, monospace;"> ${pathSamtools} sort ${bamInputFile} ${bamFilePathNoExt}".sort"</span><br />
<span style="font-family: Courier New, Courier, monospace;"> #create BAM index</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>bam=${bamFilePathNoExt}".sort.bam"</span><br />
<span style="font-family: Courier New, Courier, monospace;"> ${pathSamtools} index ${bam}<span class="Apple-tab-span" style="white-space: pre;"> </span></span><br />
<span style="font-family: Courier New, Courier, monospace;">else #sorted</span><br />
<span style="font-family: Courier New, Courier, monospace;"> if [ ! -f "${bamFileIndex}" ];then #create BAM index if not exist</span><br />
<span style="font-family: Courier New, Courier, monospace;"> ${pathSamtools} index ${bam}</span><br />
<span style="font-family: Courier New, Courier, monospace;"> fi</span><br />
<span style="font-family: Courier New, Courier, monospace;">fi</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#create output folder if not exist</span><br />
<span style="font-family: Courier New, Courier, monospace;">if [ ! -d "${bamOutputPath}" ]; then</span><br />
<span style="font-family: Courier New, Courier, monospace;"> mkdir -p ${bamOutputPath}</span><br />
<span style="font-family: Courier New, Courier, monospace;">fi</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">#split the BAM by chromosome</span><br />
<span style="font-family: Courier New, Courier, monospace;">for i in `${pathSamtools} view -H ${bam} | awk -F"\t" '/@SQ/{print $2}' | cut -d":" -f2`</span><br />
<span style="font-family: Courier New, Courier, monospace;">do</span><br />
<span style="font-family: Courier New, Courier, monospace;">${pathSamtools} view -@ ${numCPU} -h -F 0x4 -q 10 ${bam} $i | \</span><br />
<span style="font-family: Courier New, Courier, monospace;">awk '{if(/^@SQ/ && !/^@SQ\tSN:'$i'/) next} {if(/^@/) print $0}{if ($3~/^'$i'/) print $0}' | \</span><br />
<span style="font-family: Courier New, Courier, monospace;">${pathSamtools} view -@ ${numCPU} -hbS - > ${bamOutputPath}/${bamFileNameNoExt}.$i.bam 2>/dev/null</span><br />
<span style="font-family: Courier New, Courier, monospace;">done</span><br />
<div>
<br /></div>
The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0tag:blogger.com,1999:blog-7515846743446581214.post-86957890947849984592014-08-01T16:37:00.000-07:002014-08-02T13:54:56.505-07:00Size of container<span style="font-family: Courier New, Courier, monospace;">In Hadoop YARN, a container is a sub unit of a physical data node. </span><span style="font-family: 'Courier New', Courier, monospace;">The size of a container affect the performance of MapReduce greatly especially when the application itself supports multiple threads.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">Let us say we have one DataNode with 4 cores and 8 GB memory, now we want to run BWA with input "A_1.fastq", what are the options? </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">1) 1 container per DataNode. This container has all 4 cores and 6.4 GB memory (we do not want to starve the host DataNode).
So we have only one BWA process running like "bwa mem -t 4 ... A_1.fastq" with 6.4GB available memory per BWA process. </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><span style="font-family: 'Courier New', Courier, monospace;">2) 4 container, each container has 1 core and 1.6 GB memory. so we have to split the "A_1.fastq" into "A_1_1.fastq" ... "A_4_1.fastq", then start
4 parallel BWA processes running like "bwa mem -t 1 ... A_1_1.fastq" and "bwa mem -t 1 ... A_2_1.fastq", etc. with 1.6GB available memory per process. </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">Finally we have to merge the resulting SAM files.
Since our goal is to optimize the execution, so the question is "Which one is faster?"
</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">Before jumping to the answer, now we have to consider: </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">1) Smaller available memory means the input FASTQ files must be small, otherwise the process will fail. </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">2) The overhead. splitting and merging will add some time to the overall running time and every BWA process has to load genome index into memory before mapping the reads. </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">Since BWA itself supports multiple threads, it seems like the best way is option (1) - one container per DataNode.
Is it the best solution? No! Why? Because the ApplicationMaster itself will occupy one container. </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">Assume we have 5 DataNodes, each DataNode has one container. When we start a YARN-based MapReduce application, the ResourceManager will find a
container to host the ApplicationMaster, </span><br />
<span style="font-family: 'Courier New', Courier, monospace;">subsequently</span><span style="font-family: 'Courier New', Courier, monospace;"> </span><span style="font-family: 'Courier New', Courier, monospace;">the ApplicationMaster will start computing containers. </span><span style="font-family: 'Courier New', Courier, monospace;">ApplicationMaster </span><span style="font-family: 'Courier New', Courier, monospace;">itself will occupy a full container. </span><span style="font-family: 'Courier New', Courier, monospace;">As a result, we only have 4 computing containers. It is a waste of computing resource since we know ApplicationMaster does not need that much resource (4 cores and 7G memory).
This figure shows a node in red box, is running ApplicationMaster without doing the "real computation"</span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4M3PnS3-A6_jeNo-li-mWCUEO94uxZu_HiuFJvms-bw0HEGbLyMFAokGkfq99AoTwsCUWLUkK3QPNxP4P6NxEkd0zhfheTCLZ0qxz37qSZi4XKPeR9rJ9esPfp5EIMfuCa2GyRvhuOp90/s1600/hadoop_status.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><span style="clear: left; float: left; font-family: Courier New, Courier, monospace; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4M3PnS3-A6_jeNo-li-mWCUEO94uxZu_HiuFJvms-bw0HEGbLyMFAokGkfq99AoTwsCUWLUkK3QPNxP4P6NxEkd0zhfheTCLZ0qxz37qSZi4XKPeR9rJ9esPfp5EIMfuCa2GyRvhuOp90/s320/hadoop_status.png" height="154" width="640" /></span></a></div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">If we "ssh" into that "ApplicationMaster" node, we can see it is running a process named "MRAppMaster".
</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZGsxLaDhjuTNh162nOGtkfUEzm9Toi7fcv2KhcdVAZgmPkN_PfYkiOTcDs7vf0mt7B0scyCw5fAo1_PesgjCL7p_yLDzUqy0HhrXfeuMvYgzR_0HP5y4Y46P3X-0rwi8WPwVjsTaMbrnc/s1600/MRAppMaster.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><span style="font-family: Courier New, Courier, monospace;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZGsxLaDhjuTNh162nOGtkfUEzm9Toi7fcv2KhcdVAZgmPkN_PfYkiOTcDs7vf0mt7B0scyCw5fAo1_PesgjCL7p_yLDzUqy0HhrXfeuMvYgzR_0HP5y4Y46P3X-0rwi8WPwVjsTaMbrnc/s320/MRAppMaster.png" /></span></a></div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace;">In another word we wasted 20% of the computing resource. It is not a problem if you are running a 100-node cluster in that case only 1% resource was "wasted". </span><span style="font-family: 'Courier New', Courier, monospace;">However we do not need a big boss if we are a small team. </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">Considering two containers per DataNode? As a result we will have 2*5 = 10 containers in total with 10% of the containers were wasted. But we come into the multiple container problem again - the overhead... </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<span style="font-family: 'Courier New', Courier, monospace;">This is just one the of tradeoff or balancing problems that we have encountered here and there.</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span>
<br />The Daddy of Maomao and Doudouhttp://www.blogger.com/profile/10683399617439900283noreply@blogger.com0