Friday, December 20, 2013

How does YARN execute my job

Found some interesting "behind the curtain" stuffs to understand how YARN works.

In the configuration file "yarn-site.xml", we can specify where to store LocalResouce files. Under there we can find "usercache" for APPLICATION level resource which will be automatically cleaned after finishing a job. The "filecache" that was for "PUBLIC" level resource which will NOT be deleted after finishing a job, saved as cache, only deleted if disk space is a concern.

In our script, we can add a command at the end of script to pass everything in current working directory to somewhere in master machine and check it out.

$rsync -arvue ssh * hadoop@master:~/tmp/

Then if we have a look at the hadoop@master:~/tmp/ we can find some stuffs:


container_tokens
default_container_executor.sh
launch_container.sh
\tmp


Here we can find 4 files (folders). "container_tokens" is about security so we can leave it for now. Let us have a look at the two scripts:



$cat default_container_executor.sh
#!/bin/bash

echo $$ > /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp

/bin/mv -f /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid.tmp /home/hadoop/localdirs/nmPrivate/container_1387575550242_0001_01_000002.pid
exec setsid /bin/bash -c "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/launch_container.sh"


"default_container_executor.sh" actually create a new session and call "launch_container.sh" in the container (slave).


$cat launch_container.sh

#!/bin/bash



export NM_HTTP_PORT="8042"
export LOCAL_DIRS="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001"
export HADOOP_COMMON_HOME="/home/hadoop/hadoop-2.2.0"
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
"
export HADOOP_YARN_HOME="/home/hadoop/hadoop-2.2.0"
export HADOOP_TOKEN_FILE_LOCATION="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002/container_tokens"
export NM_HOST="n1"
export JVM_PID="$$"
export USER="hadoop"
export HADOOP_HDFS_HOME="/home/hadoop/hadoop-2.2.0"
export PWD="/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/container_1387575550242_0001_01_000002"
export CONTAINER_ID="container_1387575550242_0001_01_000002"
export NM_PORT="48810"
export HOME="/home/"
export LOGNAME="hadoop"
export HADOOP_CONF_DIR="/home/hadoop/hadoop-2.2.0/etc/hadoop"
export MALLOC_ARENA_MAX="4"
export LOG_DIRS="/home/hadoop/logs/application_1387575550242_0001/container_1387575550242_0001_01_000002"
ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/12/A_R2.fq" "A_R2.fq"
ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/11/A_R1.fq" "A_R1.fq"
ln -sf "/home/hadoop/localdirs/usercache/hadoop/appcache/application_1387575550242_0001/filecache/10/1.sh" "script.sh"
exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"


OK we can see lots of stuff. Most lines of the script just define environmental variables. Some lines ("ln -sf ...") make soft links from the filecache to current working directory. All these steps are prepared for the last command:



exec /bin/bash -c "/bin/bash script.sh 1>>stdout.txt 2>>stderr.txt"


This time our own script "script.sh" was called on current working directory. In our script we can use previously defined environmental variables like "LOG_DIRS" like the original example application "DistributedShell" from hadoop-2.2.0 did.

Since all LocalResource files are also linked to current directory, our script can do something on these files now.

#!/bin/bash
echo "Running"
cat A_R1.fq A_R2.fq > A.fq
echo "Moving results back"
rsync -arvue ssh * hadoop@master:/somewhere/to/store/outputs/
echo "Done"






Tuesday, December 17, 2013

The DistributedShell in hadoop 2.0 a.k.a YARN

[update: 01/05/2014]

The holiday season is over. Finally have sometime to make a update to add the link to the code. It is just a prototype now. lots of things are under development and testing. 

Here is the link to the Client & ApplicationMaster:

https://github.com/home9464/BioinformaticsHadoop/blob/master/Client.java

https://github.com/home9464/BioinformaticsHadoop/blob/master/ApplicationMaster.java



***********************************************
The distributedshell example application in Hadoop release 2.2.0 is a good example to start playing with YARN. I spent some time in the past few days to modify this application for a simple BWA alignment job.


In the original application, parameter "-shell_command " gives the CMD you want to run. Therefore you can only run a single command like "cat && mouse && dog".


$hadoop jar hadoop-yarn-applications-distributedshell-2.2.0.jar  \
org.apache.hadoop.yarn.applications.distributedshell.Client \
-jar hadoop-yarn-applications-distributedshell-2.2.0.jar \

-shell_command '/bin/date' -num_containers 2


For a simple BWA alignment, we need do:

$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_1.fastq > A_1.sai
$/PATH/TO/BWA/bwa aln /PATH/TO/GENOME_INDEX A_2.fastq > A_2.sai
$/PATH/TO/BWA/bwa sampe /PATH/TO/GENOME_INDEX A_1.sai A_2.sai A_1.fastq A_2.fastq > A.sam


Apparently we need put all these commands into a script, e.g. "1.sh"

Since this script is going to be executed on slave nodes, here we have three problems:


  1. what is the full path to binary executable files for BWA (bwa) on slave nodes
  2. what is the full path to indexed genome files for BWA (hg19.*) on slave nodes
  3. what is the full path to raw sequencing files (*.fastq)on slave nodes


For the 1st and 2nd problems, because we expect to re-use the BWA application and indexed genomes many times, we should keep them on each slave nodes and most importantly, use the exact same full paths:

/home/hadoop/app/bwa-0.7.5/bwa
/home/hadoop/data/bwa/hg19/hg19.*


To do so we can manually mkdirs on each slave node

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/app/"; done

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/data/"; done


Also we need to set up a working directory "job" on each node to put any output files like "*.sai" and the SAM file.

$for i in $(cat hadoop-2.2.0/etc/hadoop/slaves); do ssh hadoop@$i "mkdir -p /home/hadoop/job/"; done

Then sync all the BWA app and indexed genome files to each node

$rsync -arvue bwa-0.7.5 ssh hadoop@slave1:~/app
$rsync -arvue hg19 ssh hadoop@slave1:~/data


The shell script to be executed will be like:

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_1.fastq > /home/hadoop/job/A_1.sai

/home/hadoop/app/bwa-0.7.5/bwa aln /home/hadoop/data/bwa/hg19/hg19 A_2.fastq > /home/hadoop/job/A_2.sai

/home/hadoop/app/bwa-0.7.5/bwa sampe ... > home/hadoop/job/A.sam


With all that said, now we can make some changes on the demo code. The new command will be like:

$hadoop jar ... -shell_script 1.sh -num_containers 2 -container_memory 8192








Friday, December 6, 2013

Back to work today

After refreshing myself with a 3-week vacation.

Then I learned a news this morning - 


After FDA's action



It looks like the "gene based disease prediction" still has a long way to go in real health care, let alone "personalized medicine" which has been imagined for a long time. 

Monday, October 21, 2013

Implement a hadoop pipeline for NGS data analysis - 4

"Align Fastq to ReferenceGenome and get Alignment"

This is always the first step to do a NGS data analysis, supposedly we already did the QC on the "Fastq" files. In this command, both the "Fastq" and "ReferenceGenome" are inputs, the "Alignment" is the output. The key difference between the two type of inputs is that every job has its own "Fastq" while lots of similar jobs reuse the same "ReferenceGenome", as well as the application "Align".

For the NGS data analysis, we always have some pre-existing files include

1. All bioinformatics applications, packages that were used in the data analysis.

2. All associated data files used in above applications, which include:

  1. Reference sequence file (and index) 
  2. Indexed reference sequence file for different aligners
  3. Databases like dbSNP, COSMIC, UCSC Gene Table, etc.
  4. Other pre-defined common resource

The total size of these applications and data files could be as much as tens of GB, depending how many reference genomes we are going to support in our pipeline.

To achieve the best performance we need to keep a copy of all apps and data in each slave node. We do NOT want to copy tens of GB of "immutable" apps and library files from one place to all slave nodes for every job. Also it is very important to keep the consistency of all apps and library files in all nodes. In another word, we MUST guarantee everything is the same. The idea is simple: we only maintain a single copy of all bioinformatics applications and library files in the master name node or another dedicated node, as long as this dedicated node can access all slave nodes with ssh-keyless. Then we can synchronize all slave nodes to make sure all slave nodes has a duplicated copy, before accepting EVERY new job. Why we can NOT synchronize slave nodes in a fixed interval? Because the backend library files changes may cause inconsistency if a job is in processing. So, before we pop a job from the job queue and to push to hadoop, we should synchronize all slave nodes firstly. OK, again, why do not we just do the synchronization only once when we start our framework? Because we need to update applications and library files while the service is running. If we are going to deploy this service as 24*7 we need to update these things on-the-fly. Luckily it would not take too much time as most operations are just comparing timestamp of files. The real data transferring operation will not happen unless we add/update/remove the data in the name node. Even so, the data transferring is a incremental updating. It is quite simple to get all these done - "rsync" is our friend.



For simplicity, we use the name node as the central repository. We already created indexed genome hg19 for BWA and put it under "
/home/hadoop/data/hg19/bwa". We also put the BWA binary executable under "/home/hadoop/tools/bwa" on name node.

Now we need to transfer all these files to all slave nodes. There is one script named "slaves.sh" under hadoop's distribution package. It is convenient to use that script execute a command in all slaves nodes. Howeevr, this synchronization operation should be done in our application. For demonstration, we can just do it manually to learn the process. 

#master is name node.
#n1, n2 and n3 are slave nodes.


#on slave node 
hadoop@n1$ mkdir -p /hadoop/data
hadoop@n1$ mkdir -p /hadoop/app


#on master node 
hadoop@master$ rsync -arvue ssh /home/hadoop/tools/bwa hadoop@n1:/hadoop/app/


hadoop@master$ rsync -arvue ssh /home/hadoop/data/hg19/bwa hadoop@n1:/hadoop/data/



Thursday, October 17, 2013

Implement a hadoop pipeline for NGS data analysis - 3

The baby step.

The WordCount example in the hadoop distribution provides an excellent demonstration on how to implement a simple MapReduce application. About 7 years ago when I read the paper of MapReduce from Google I realized it would be an effective solution for many problems in the field of bioinformatics then I developed a light weight MapReduce application, the ABCGrid http://abcgrid.cbi.pku.edu.cn/mainly for BLAST-alike sequence analysis. 

The WordCount application will count the frequency of words so it need to read the text file, tokenize the line (Map stage). After that, by partitioning the results to Reducer, and counting each word (Reduce stage). 

The input file format for NGS will be FASTQ file, which size differ from hundreds of MB to a few GB, depending on the platform, coverage, etc. We can use the FASTQ file name as key, then we need to split the FASTQ files.

1) Split the FASTQ manually, and save as files.
   
  #count lines, supposedly we got 1M lines
  $zcat  A_1.fastq.gz | wc -l

  1.1. NUM_LINES % 4==0
  because every 4 lines represent a read, we must assure that
  split size is right 

  1.2 size the raw fastq
  #If the fastq is small, we do not need to bother splitting it     #into too many pieces, because of the overhead.

  1.3 number of nodes in the cluster, N
  #number of splits >= N

  1.4 HDFS block size, M
  #size of each split <= M

  Suposedly after some calculation, we determine that we should   split the big fastq into 100 pieces, with each of them has 10K reads.

  $zcat A_1.fastq.gz | split -l 10000 -a 4 -d - A_1.

  Also we have to upload the data to the HDFS manually
  $hadoop fs -put A* /user/hdfs/suny/
2) Split the FASTQ in the application automatically, and save as files. But the size is determined by a set of rules like above, so we do need to take out our calculator

   2.1 public static int splitFastq(File input, File outputPath)
    {...
       return 10000;
    }
   2.2 public static int copyFastq(File outputPath, Path InputHDFS)
    {
       FileUtils.copy(...);
    }

3) Split the FASTQ in the application automatically, but do NOT save as files. Instead, transfer/feed the content to the Map function as value. To do this we need to derive our own class from InputFormat and RecordReader.

Monday, October 14, 2013

Implement a hadoop pipeline for NGS data analysis - 2

The fundamental idea behind any big data processing is "divide and conquer" which means we always try to divide the big data into small pieces - small enough to process in parallel in many computers. The prerequisite is that the processing of these small piece should be independent as much as possible to each other, without any in-process communications. 


Supposed a FASTA file has 10 million sequences, now we want to do a BLAST operation to find all good alignments, using human reference genome hg19.
The question is how should we "divide and conquer" this computation intensive job. The first thing come to my mind is "divide the FASTA files into many fragments". Let us assume the optimal size of a fragment is 10 sequences - in another word, we will have 1 million FASTA files, and each FASTA file contains ten sequences. If we submit this job to cluster consists of 100 computation nodes (sometime they also called "slave" or "worker"), then on average, each node will process 10,000 jobs. In an ideal situation, we can say we expect a 100x speedup.

Another approach is to divide the reference data. Let us say if the query FASTA file is very small, with only 10 sequences, then it does not make sense to split it because of the overhead. In this case, we can divide (and index) the reference genome, as long as it was "divide-able". To the very least, the chromosome is a nature and safe unit for any dividing methods.

Now come back to the NGS era, now what file formats are involved in a typical DNASeq data analysis?


  1. FASTQ, sequencing read
  2. SAM/BAM, alignment
  3. BED, genomic regions
  4. VCF, variants






Thursday, October 10, 2013

Implement a hadoop pipeline for NGS data analysis - 1


Setup a hadoop cluster

1. Software


Windows8 + VMWare-Workstation9.0 + Ubuntu-13.04-x64-server + hadoop-2.1.0-beta


Name the first ubuntu installation as "master", this will be the master server (namenode + secondarynamenode)

then snapshot/clone 3 images from "master" and name them as "n1", "n2" and "n3". These 3 images will be the slave nodes.


2. Hadoop configuration


Too many steps so I will not list them one by one in here. 
Google is your friend. DO not forget to change the /etc/hosts, /etc/hostname in n1, n2 and n3. Start the hadoop and open a web browser to see if it works http://192.168.1.2:8088/cluster
192.168.1.2 is the IP of my master server.

You should be able to see 3 active nodes in the Cluster Metrics table.


Get all necessary apps and data


1. install the BWA package



  1. #get the code
  2. git clone https://github.com/lh3/bwa.git
  3. cd bwa

  4. #install two required libs
  5. sudo apt-get install gcc
  6. sudo apt-get install zlib1g-dev

  7. #compile
  8. make

  9. #delete other files, only keep the executable BWA
  10. find . ! -name "bwa" -exec rm -fr {} \;


2. Download some reference genome data -  for test purpose, we don't need to get all of them


  1. wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz
  2. wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr2.fa.gz
  3. zcat *.gz > hg19.fasta

3. Build genome index for BWA

  1. bwa index -p hg19 hg19.fasta

########################################


#/etc/hosts
192.168.221.128 prime
192.168.221.133 n1
192.168.221.139 n2
192.168.221.140 n3

#/etc/hostname
prime

sudo apt-get install vim
sudo apt-get install openssh-server
sudo apt-get install screen
sudo apt-get install openjdk-7-jdk 
sudo apt-get install eclipse-platform

wget http://apache.cs.utah.edu/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
tar -zxvf hadoop-2.2.0.tar.gz

#.bash_profile
alias hj="hadoop jar"
alias hf="hadoop fs"
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export DISPLAY=192.168.1.133:0.0
export HADOOP_HOME=/home/hadoop/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/hadoop/hadoop-2.2.0/lib/native
export EC2_HOME=$HOME/tools/ec2-api-tools-1.6.11.0
export PATH=$PATH:/home/hadoop/hadoop-2.2.0/bin
export PATH=$PATH:/home/hadoop/hadoop-2.2.0/sbin
export PATH=$PATH:/home/hadoop/tools/bwa:/home/hadoop/tools/snap-0.15.4-linux
export PATH=$PATH:$HOME/tools/ec2-api-tools-1.6.11.0/bin


#hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

#core-site.xml





fs.default.name
hdfs://prime:50011/
true


hadoop.tmp.dir
/home/hadoop/dfstmp






#hdfs-site.xml






dfs.namenode.name.dir
/home/hadoop/dfsname



        dfs.datanode.data.dir
        /home/hadoop/dfsdata



dfs.namenode.secondary.http-address
prime:50012



     dfs.replication
     3





#yarn-site.xml




yarn.nodemanager.log-dirs
/home/hadoop/logs



yarn.nodemanager.aux-services
mapreduce_shuffle



yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler



yarn.resourcemanager.scheduler.address
prime:8030



yarn.resourcemanager.resource-tracker.address
prime:8031



yarn.resourcemanager.address
prime:8032



yarn.resourcemanager.admin.address
prime:8033



yarn.resourcemanager.webapp.address
prime:8088



#slaves
n1
n2

n3

Sunday, September 8, 2013

hadoop-yarn configuration



1. core-site.xml

fs.defaultFS
hdfs://192.168.221.132:50000

dfs.namenode.rpc-address
192.168.221.132:54310

dfs.namenode.http-address
192.168.221.132:50070


2. hdfs-site.xml

dfs.namenode.name.dir
/home/hadoop/dfsname

dfs.namenode.secondary.http-address
192.168.221.132:50001

3. yarn-site.xml

yarn.resourcemanager.address
192.168.221.132:8030

yarn.resourcemanager.scheduler.address
192.168.221.132:8031

yarn.resourcemanager.resource-tracker.address
192.168.221.132:8032

yarn.resourcemanager.admin.address
192.168.221.132:8033

yarn.resourcemanager.webapp.address
192.168.221.132:8088

4. mapred-site.xml

mapreduce.framework.name
yarn
       


sampling(subset) NA12878

Based on the discussion from http://www.biostars.org/p/6544/

For testing purpose, we need to get 10x, 20x and 40x WGS coverage using sample NA12878 from 1KG.
The raw fastq has ~643 million paired reads, which is equivalent to ~43x WGS coverage.


 $ll -h
 -rw-r--r-- 1 user group 54555856768 Apr 27 13:58 ERR262997_1.fastq.gz
 -rw-r--r-- 1 user group 55794552881 Apr 27 13:58 ERR262997_2.fastq.gz

55 GB each fastq file! We need to unzip it anyway

 $gunzip *.gz
 $ll -h
 -rw-r--r-- 1 user group 173198473181 Apr 27 13:58 ERR262997_1.fastq
 -rw-r--r-- 1 user group 173198473181 Apr 27 13:58 ERR262997_2.fastq


162GB each after unzipping. I am not sure I can handle something like this big.

Read length is 101 while genome size of hg19  is 3,000,000,000

to get a 10x coverage, we need:



 $echo $(( (10*3000000000)/(101*2) ))
 148514851


About 148 million reads, from both pair-end files. Here comes the last and critical step.
Again, I am not sure if it is possible to "paste" two 162GB files.



for i in 10 20 40
do
num_reads=$(( (${i}*3000000000)/(101*2) ))
paste ERR262997_1.fastq ERR262997_2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");}  else { printf("\t\t");} }' | shuf  | head -n ${num_reads}| sed 's/\t\t/\n/g' | awk '{print $1 >  "ERR262997_${i}x_1.fastq"; print $2 > "ERR262997_${i}x_2.fastq"}'
done


Now it is time to cross fingers ...

[update]

It does not work. This morning I got this error:

"shuf: memory exhausted".

I have to split the 162GB fastq file firstly. Number of lines is 2572389100, divided by 20, we got  128619455.



$splitSize=128619455
$split -l $splitSize -a 4 -d ERR262997_1.fastq output/ERR262997_1.
$split -l $splitSize -a 4 -d ERR262997_2.fastq output/ERR262997_2.

For each file fragment, applying above script then merge/cat the results into one fastq.

################################################

mkdir sample final
splitSize=128619455

split -l $splitSize -a 4 -d ERR262997_1.fastq output/ERR262997_1.
split -l $splitSize -a 4 -d ERR262997_2.fastq output/ERR262997_2.

for i in 10 20 40
do
num_reads=$(( (${i}*3000000000)/(101*2*20) ))

for j in $(seq 0 19)
do
k=$(printf %04d $j)

x=sample/ERR262997_${i}x_1.${k}
y=sample/ERR262997_${i}x_2.${k}
echo $x $y
paste ERR262997_1.${k} ERR262997_2.${k} | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf  | head -n ${num_reads}| sed 's/\t\t/\n/g' | awk -v f=$x -v r=$y '{print $1 > f; print $2 > r}'
done

cat `ls sample/ERR262997_${i}x_1.*` > final/ERR262997_${i}x_1.fastq
cat `ls sample/ERR262997_${i}x_2.*` > final/ERR262997_${i}x_2.fastq

done





Tuesday, August 13, 2013

Hard trimming FASTQ file

When we plot the base quality distribution (for example, using fastqc), we can often observe that the base qualities on the 3' tail are pretty poor. Sometimes we can observe the qualities on the 5' headers are poor too.

For better alignment and variants calling (if any), we may need to trim the fastq files, only keep the bases in the middle of cycles, which generally have good and consistent qualities.

One method is trimming by its base quality, which can be done using tools like sickle.  Or we can just hard trim all reads , which means just removing N tail (or head) bases and corresponding qualities.

A typical FATSQ file:



@READ_NAME
NGGAAATGGCGTCTGGCGGCGAGATAATGG
+
#1=DDFFFHGHHHIIGIIJJJJJJIIJGJIIHFDD?BDB

Let us say we have pair-end fastq files: "A_1.fastq.gz" and "A_2.fastq.gz"

#trim the 10 tail bases
zcat A_1.fastq.gz | awk --posix '{ if (NR % 2 == 0) { sub(/.{10}$/,""); print} else {print}}' | gzip > A_1.fq.gz

#trim the 10 head bases
zcat A_1.fastq.gz | awk --posix '{ if (NR % 2 == 0) { sub(/^.{10}/,""); print} else {print}}' | gzip > A_1.fq.gz

#trim the 10 head bases and 10 tail bases
zcat A_1.fastq.gz | awk --posix '{ if (NR % 2 == 0) { sub(/^.{10}/,""); sub(/.{10}$/,""); print} else {print}}' | gzip > A_1.fq.gz

Monday, August 5, 2013

Create some artificial SNPs on a reference genome



import random
import sys

random.seed(13115)

def iter_chromosome(fasta):
    m = {}
    seq = []
    chr = ''
    for line in file(fasta):
        line = line.strip()
        if line.startswith('>'):
            if seq: #
                yield chr,''.join(seq)
                seq = []
            chr = line
        else:
            seq.append(line)
    yield chr,''.join(seq)

def make_snp(ref):
    x = ['A','G','C','T']
    x.remove(ref)
    return random.choice(x)

def generate_SNP(fasta,interval=1250):
    import numpy
    exon = {}.fromkeys(('A','G','C','T'))
    snp = []
    fsnp=file('snp.txt','w')
    for chr, seq in iter_chromosome(fasta):
        dup = [s for s in seq]
        N = len(dup)
        num_required=N/interval
        pos = [int(i) for i in numpy.random.normal(interval,interval/5,num_required)]
        index = 0
        for p in pos:
            index = index + p
            try:
                ref = dup[index-1]
                if exon.has_key(ref):
                    alt = make_snp(ref)
                    dup[index-1] = alt
                    snp.append('\t'.join((chr.lstrip('>'),str(index),ref,alt)))
            except IndexError:
                break
        print chr
        for i in xrange(0, len(dup), 50):
            print ''.join(dup[i:i+50])
    print >>fsnp,'\n'.join(snp)
    fsnp.close()
   
fasta = 'hg19.fa'
generate_SNP(fasta,250)



Split the BAM into smaller BAM files by chr (then use hadoop to accelerate the downstream analysis)

Supposedly we have a single BAM file "x.sort.bam" which is the sorted output of an aligner like BWA. To call variants, we still need many steps. It will take a long time to execute these steps one by one on a single server. To speed up the downstream analysis, we can "split and conquer" the BAM files by chromosomes, this is called "parallel processing".

for i in `samtools view -H x.sort.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do

#remove not-mapped-by-pair, low quality reads and output BAMs by chr
samtools view -bh -F 0x4 -q 1 -o x.$i.bam x.sort.bam $i

#index
samtools index x.$i.bam


#all other steps
java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T BaseRecalibrator \
-knownSites ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.norm.vcf \
-knownSites Mills_and_1000G_gold_standard.indels.b37.sites.norm.vcf \
-knownSites 1000G_phase1.indels.b37.norm.vcf \
-I x.$i.bam \
-o x.$i.grp

java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T PrintReads \
-BQSR x.$i.grp \
-I x.$i.bam \
-o x.$i.recal.bam

...

java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T UnifiedGenotyper --genotype_likelihoods_model BOTH 
-I x.$i.bam -rf BadCigar -dcov 500 -o x.$i.gatk.vcf
done

After indexing the splitted BAMs, we can let hadoop taking over the BAMs to run other steps in hadoop cluster.

for i in `samtools view -H x.sort.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do

#remove not-mapped-by-pair, low quality reads and output BAMs by chr
samtools view -bh -F 0x4 -q 1 -o x.$i.bam x.sort.bam $i

#index
samtools index x.$i.bam

done

#upload data into hadoop
hadoop fs -put *chr*.ba? $hdfsInput

#run hadoop
hadoop jar xxx.jar MyPipeline $hdfsInput $hdfsOutput ...

Tuesday, May 28, 2013

Setting up NFS


Server: 192.168.52.130
Client: 192.168.52.133

#server
sudo su -
apt-get install nfs-kernel-server rpcbind

#server - create a NFS directory
mkdir /var/nfs/

#server - change owner
chown nobody:nogroup /var/nfs

#server 
vim /etc/exports

/var/nfs        192.168.52.133(rw,sync,no_subtree_check)

#server
exportfs -a

#server
service nfs-kernel-server start 


#client
apt-get install nfs-common rpcbind

#client - local directory for NFS
mkdir -p /var/nfs

#mount NFS from Server to local NFS directory
mount 192.168.52.130:/var/nfs /var/nfs

#check out
df -h

Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/A--vg-root    18G  1.5G   15G   9% /
none                     4.0K     0  4.0K   0% /sys/fs/cgroup
udev                     989M  4.0K  989M   1% /dev
tmpfs                    200M  332K  200M   1% /run
none                     5.0M     0  5.0M   0% /run/lock
none                     999M     0  999M   0% /run/shm
none                     100M     0  100M   0% /run/user
/dev/sda1                228M   30M  187M  14% /boot
192.168.52.133:/var/nfs   18G  969M   16G   6% /var/nfs


#client
vim /etc/fstab

92.168.52.130:/var/nfs  /var/nfs   nfs     auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0 0

Thursday, March 21, 2013

Inifinite processing time for novoalign

Few days ago an user reported that his simple "@align" job has been running more than 5 days in cluster.
It is absolutely abnormal. After few hours work I located the 3 causes of this problem.

The first two causes are corrupted input FASTQ file, result in incorrect format and size.
The last was from novoalign - novoalign processes corrupted FASTQ files for a infinite time without giving any error messages.

Before novoalign adding the new feature of validating the input files, we can do a simple validation

For pair-end files, we can compare the size (number of reads) firstly.

$zcat X1.fq.gz | wc -l

$zcat X2.fq.gz | wc -l  


For single-end file, check out if it can be mod by 4

$expr `zcat X.fq.gz | wc -l` % 4  

TCGA and COSMIC database for annotating mutations

TCGA - The Cancer Genome Atlas https://tcga-data.nci.nih.gov/tcga/ COSMIC http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/download.html

Thursday, February 28, 2013

Tomato2 will be released soon

With two months work, the new Tomato framework is almost finished. This version of Tomato was designed for Utah Genome Project with some new and important features. Now I am making extensive tests before release which is expected in one week. Still working on the tutorial and wiki pages. http://bioserver.hci.utah.edu/BioInfo/index.php/Software:Tomato2

Tuesday, January 15, 2013

1000g CEU population raw data extraction

Recently I need to use the variants from 1000G's CEU as general background(control). One way is to directly download the pre-identified variants from the ftp site. Or the better way is to process CEU raw sequencing data and our case samples with our own pipelines. Keep everything (apps and parameters) the same.


#ped file has the sample name for CEU population


wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20111108_samples_pedigree/G1K_samples_20111108.ped


#based on the sample info, download the raw sequencing data from 1000g's ftp, save under subdir "CEU_samples"


awk '$2 ~ /^NA/ {print $2}' G1K_samples_20111108.ped | sort | uniq | xargs -I sample_name wget -r -nH --cut-dirs=3 -P CEU_samples "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/"sample_name"/sequence_read" || true


#it will take a while and lots of disk space...


Wednesday, January 2, 2013

EUR_AF>0.05 from 1000G release with 1029 genomes

#1. download 

$wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

#2. unzip

$gunzip ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

#3. size?

$wc -l ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

39706744 


#4. Get the variants whose AF>=0.05 in EUR population

$grep -o ".*EUR_AF=\([0-1]\.[0-9]*\)" ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | awk -F'EUR_AF=' '{if ($NF >= 0.05) print}' > body.vcf

#5. Header

$head -n 100 ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | grep '^#'
> head.txt

#6. cat

$cat head.txt body.vcf > 1000g.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.EUR_AF_0.05.vcf