Next Generation Sequencing and Data Analysis: February 2014

Saturday, February 15, 2014

Goal

Compare BWA to SNAP

Approach

Preparation

#app

$wget http://snap.cs.berkeley.edu/downloads/snap-0.15.4-linux.tar.gz

#data, simulated from hg19-chr20, paired, 30x average coverage

$wgsim -N 10000000 -r 0.01 -1 100 -2 100 -S11 -e0 chr20.fa A1.fq A2.fq > mut.txt

Running Time

BWA	SNAP
30m20s	20m01s

Mapping metrics

Variants Concordance

Impression (so far)

1) SNAP is 1.5x faster than BWA (memory consumption is not evaluated)

2) SNAP aligned more TP alignments, less FP alignments than BWA

3) SNAP generated more TP (8512 vs. 740) and FP (2715 vs. 596) variants than BWA

STAR

#download app
$wget https://rna-star.googlecode.com/files/STAR_2.3.0e.Linux_x86_64.tgz

$tar -zxvf STAR_2.3.0e.Linux_x86_64.tgz

$cd STAR_2.3.0e.Linux_x86_64/

#splice junction data
$wget ftp://ftp2.cshl.edu/gingeraslab/tracks/STARrelease/STARgenomes/SpliceJunctionDatabases/gencode.v14.annotation.gtf.sjdbcd

#create the dir for index genome
$mkdir hg19

#generate index genome with splice junction annotations
$./STAR --runMode genomeGenerate --genomeDir hg19 --genomeFastaFiles /projects/confidential_sequence/home/sun/data/hg19/hg19.fa --runThreadN 4 --sjdbFileChrStartEnd gencode.v14.annotation.gtf.sjdb --sjdbOverhang 99

$mv hg19 ~/data/star_hg19

$cd ~

$mkdir -p test/STAR; cd ~/test/STAR

#full path of input files does NOT work. instead, create softlinks on the local folder.
$ln -s .

#run
$time STAR_2.3.0e.Linux_x86_64/STAR --genomeDir data/star_hg19 --readFilesIn *.fastq.gz --outFileNamePrefix --runThreadN 10 --readFilesCommand zcat 1>std.txt 2>err.txt

The output is a SAM file with name "OUTPUT.Aligned.out.sam"

By using suffix tree algorithm, STAR uses lots of (>30GB) memory in exchange of speed.

Monday, February 10, 2014

Node.js is fun

I have been playing with Node.js for few days and really love it. Trying to build a prototype website with workflows.

Use case:

log into the system
upload dataset with supported format (fastq, sam/bam, vcf, bed. etc)
describe the dataset
choose components/node, each node is a independent operation
connect components with edges as DAG (directed acyclic graph)
fine tune each component if necessary (add/remove/change parameter settings)
execute the flow
monitor the progress (check, terminate, pause)
check out the output of each component and last result
save the flow for future use, share, publish.

#download the NoFlo.js flow demo
$git clone https://github.com/noflo/dataflow-noflo.git flow

$cd flow

$npm install

$grunt build

#start a simple http server to serve the contents
$python -m SimpleHTTPServer

#You may see a log message like this:
#Serving HTTP on 0.0.0.0 port 8000

Now start a browser like Chrome, type in

"192.168.1.2:8000/demo/"

#192.168.1.2 is my ip address. YMMV.

You will see a very nice dataflow graph like this:

You can add/delete/drag/move nodes and edges. Cool!

Next Generation Sequencing and Data Analysis