Next Generation Sequencing and Data Analysis

Saturday, February 11, 2017

Create 5 brokers in 1 kafka server

Why are we doing this:
Testing the high performance of Kafka with multiple partitions and multiple consumer for single topic.

What are we going to do:

1. Create 5 brokers
2. Create 3 partitions for a topic
3. Create 2 producer to send messages to one topic which will distribute messages to 3 partitions, evenly
4. Create 3 consumers, one for each partition, to consume messages
in parallel, by order.

What do we need:
1. Ubuntu 14.04
2. kafka_2.11-0.10.1.1

#1. create configuration file

cd /PATH/kafka_2.11-0.10.1.1/config

for i in $(seq 1 5);
do
cp server.properties server.properties${i};

#change id
sed -i "s/broker.id=0/broker.id=${i}/g" server.properties${i};

#change port
port=$(expr ${i} + 9092)
sed -i "s/#listeners=PLAINTEXT:\/\/:9092/listeners=PLAINTEXT:\/\/:${port}/g" server.properties${i};

#change log file

sed -i "s/log.dirs=\/tmp\/kafka-logs/log.dirs=\/tmp\/kafka-logs${i}/g" server.properties${i};

done

#2. create starting script "kafka-start.sh"

####################################
#start zookeeper
zookeeper-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/zookeeper.properties &

# start 5 kafka brokers
kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties1 &
kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties2 &
kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties3 &
kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties4 &

kafka-server-start.sh /PATH/kafka_2.11-0.10.1.1/config/server.properties5 &

####################################

#3. Run above script to launch brokers and type "jps", you should see 5 kafka processes

$jps

8700 Jps
4928 Kafka
4924 Kafka
4925 Kafka
4926 Kafka
4927 Kafka
4923 QuorumPeerMain

#4. Create a topic "Hello "with 3 partitions

$kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic Hello

#5. Check out the partitions
$kafka-topics.sh --describe --zookeeper localhost:2181 --topic Hello
Topic: Hello Partition: 0 ...
Topic: Hello Partition: 1 ...
Topic: Hello Partition: 2 ...

#5. Open 3 terminals then start 3 consumers which only consume message from partition 0, 1 and 2 respectively
kafka-console-consumer.sh --bootstrap-server localhost:9093 --new-consumer --partition 0 -topic Hello

kafka-console-consumer.sh --bootstrap-server localhost:9093 --new-consumer --partition 1 -topic Hello

kafka-console-consumer.sh --bootstrap-server localhost:9093 --new-consumer --partition 2 -topic Hello

#6. Open 2 terminals to produce some messages
$kafka-console-producer.sh --broker-list localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097, --topic Hello
m1
m2

m3

$kafka-console-producer.sh --broker-list localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097, --topic Hello
m4
m5
m6

#7. Switch back to the terminals from step #5. you will see each consumer will consume exactly 2 messages
m1
m4

#8. Let Kafka do the load balance. To do so we must assign all consumers to one group
8.1. create a file named "kafka.consumer.group" with one line:
group.id=group1
8.2 Launch kafka consumer with that file
$kafka-console-consumer.sh --bootstrap-server localhost:9093 -- consumer.config kafka.consumer.group --new-consumer -topic Hello

#9. Now messages will be evenly distributed across all consumers without specifying partition

Thursday, December 1, 2016

Mount SSD as swap

# use SSD as swap
sudo parted /dev/xvdf
mklabel gpt
mkpart primary 0 100%
sudo mkfs.ext4 /dev/xvdf1
sudo mount /dev/xvdf1 ssd
sudo fallocate -l 160G ssd/swapfile
sudo chmod 600 ssd/swapfile
sudo mkswap ssd/swapfile
sudo swapon ssd/swapfile

#LVM
sudo pvcreate /dev/xvdb1 /dev/xvdc1
sudo vgcreate sysvg /dev/xvdb1 /dev/xvdc1
sudo lvcreate -l 100%FREE -n syslv sysvg
sudo lvdisplay

#make file system
sudo mkfs.ext4 /dev/sysvg/syslv
sudo mount /dev/sysvg/syslv /ssd

Wednesday, August 10, 2016

install credentials for AWS CLI

mkdir -p $HOME/.aws
aws_key_id=YOUR_KEY_ID
aws_secret_access_key=YOUR_SECRET_KEY
printf "[default]\nregion = us-west-2\n" > $HOME/.aws/config
printf "[default]\naws_access_key_id=${aws_key_id}\naws_secret_access_key=${aws_secret_access_key}\n" > $HOME/.aws/credentials
chmod 600 $HOME/.aws/config $HOME/.aws/credentials

Create a dockerized bcl2fastq

1. Create a working dir with name "bcl2fast_docker"

>mkdir bcl2fast_docker
>cd bcl2fast_docker

2. Download the bcl2fastq package "bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm" from illunima.

>wget ftp://webdata2:webdata2@ussd-ftp.illumina.com/downloads/software/bcl2fastq/bcl2fastq2-v2.17.1.14-Linux-x86_64.zip

>unzip bcl2fastq2-v2.17.1.14-Linux-x86_64.zip

3. Create a wrapper to run the bcl2fastq with name "launcher.py"

4. Create the Docker file

FROM centos:centos7
MAINTAINER yourname@yourcompany.com

COPY src/* /opt/bcl2fastq/
COPY bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm .

RUN yum install -y epel-release \
&& yum install -y python34 bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm \
&& rm bcl2fastq2-v2.17.1.14-Linux-x86_64.rpm \
&& yum install -y pigz \
&& chmod +x /opt/bcl2fastq/launch.py

# User will need to also mount input data at #/data/input/
VOLUME /data/output

ENTRYPOINT ["python", "/opt/bcl2fastq/launch.py"]

5. Build the image with name "bcl2fastq"

>docker build -t bcl2fastq .

6. Run the image

>docker run --rm \

-v /path/to/raw_bcl_input:/data/input:ro \

-v /path/to/fastq_output:/data/output \

bcl2fastq:latest

Friday, October 2, 2015

Add and delete users in batch

#add some users to group "research" with initial password "123"

g=research
for u in t1 t2 t3
do
sudo useradd -d /home/${u} -m ${u} -g ${g}
echo ${u}:${u}123 | sudo chpasswd
echo -ne "${u}123\n${u}123\n" | sudo smbpasswd -a ${u}
done

#delete these users as well as their home directory

for u in t1 t2 t3
do
sudo userdel -r $u 2>/dev/null
done

#configure samba

sudo apt-get install -y samba
cd /etc/samba/
sudo cp smb.conf smb.conf.bak
sudo vim smb.conf

############################
[global]
security = user

[store]
path = /store/public
valid users = @research
browsable =yes
writable = yes
guest ok = no
############################

sudo mkdir -p /store/public
sudo chmod -R 0755 /store/public
sudo chown -R hadoop:research /store/public
sudo service smbd restart

Thursday, September 10, 2015

Redirect nohup's output

Some novice Linux users like to execute a long time running pipeline using nohup

"nohup mycommand"

which by default will direct stdout and stderr to the current working directory with name "nohup.out". If we have many of these files then we can easily get lost due to the same names under different folders.

While we can tell the user to redirect their output to another managed path, it is hard to let them run something like

"nohup 1>/log/a.log 2>&1"

It is just too complicated.

TO solve this problem without changing their old habit, we can re-define the nohup. Add following code to the end of "/etc/bash.bashrc"

function nohup () {
UUID=$(cat /proc/sys/kernel/random/uuid)
logfile=/log/tmp/${UUID}.txt
printf "log:\n${logfile}\n"
/usr/bin/nohup $@ 1> ${logfile} 2>&1}

Next time when user launched a nohup, the system will print a message like this:

log:

/log/tmp/16170f02-7413-49fb-a8ea-0b5c621cdd93.txt

Tuesday, August 18, 2015

Evaluation on speedseq

Recently read an interesting article from here:

http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3505.html

The author is kind enough to open source it
https://github.com/hall-lab/speedseq

So I checked out the code and make some tests. Why is it faster than popular GATK package? After spending few hours , my conclusions:

1. The speedseq use simplified process.

Basically it use works like:

"mapping"->"remove duplicates"->"Call variants"

while traditional GATK works like:

"mapping"->"remove duplicates"->"realign"->"recalibrate"->"Call variants" plus lots of sorting and indexing between steps.

As indicated in freebayes' website, freebayes internally handle realign and recalibrate.

2. User GNU parallel to parallel the variants calling, because freebayes does not support threading or sub-processing.

So supposedly you assign 8 CPU cores to the speedseq, it will launch a 8-processes pool to call variants using freebayes on each chromosome independently.

3. Other improvements.

For example, use sambamba instead of picard (in my experience sambamba is at least 3x faster than picard). use samblaster to get discordant and split reads simultaneously for later SV and CNV callings right after the alignment.

So what are the catches? Nothing can be perfect.

1. The components are tightly wrapped into the pipeline. Changing of individual parameters are no easy tasks.

2. The variant calling process did not split the BAM actually. If fact it only splits the BAM header for a chromosome based region to accelerate the processing because the caller will not need to go through the whole genome. Each process has to load the full size reference genome and full size BAM file(s). As a result, the memory requirement will be a problem for some machines. Also the disk I/O could be a bottleneck if you have many processes trying to read/write large data on the same time.

No wonder why the authors did the test on a pretty good AWS EC2 c3.8xlarge instance with big memory and large SSD.

Final words: Even it is not perfect, it is a wonderful solution for some people. The best of all, it is free! No need to pay greedy and lovely Broad Institute $$$$$$$$$$$$$$$$$$ for the full GATK package.