Next Generation Sequencing and Data Analysis: March 2014

##########################################
# Input Data, from the 1000G NA12878 RNASeq
##########################################
FASTQ1=ERR356372_1.fastq.gz
FASTQ2=ERR356372_2.fastq.gz

#######################################
# deFuse
#######################################
#you have to install dependencies include "samtools", "bowtie", "blat", "faToTwoBits", "R"
#(check out if library "ada" was installed) and "gmap"

#download app
#http://sourceforge.net/apps/mediawiki/defuse/index.php?title=DeFuse_Version_0.6.1

#path
DEFUSE=/home/ysun/app/defuse-0.6.1/scripts

#download lib data
cd /home/ysun/data/
$DEFUSE/create_reference_dataset.pl -c defuse_cbbl.config

#the default create_reference_dataset.pl has one bug on creating the gmap indicies.
#you need to remove "$gmap_index_directory" in "make" command from "sub #create_gmap_indices".
#In my system , the total size of all lib files is 48GB.

#unzip the FASTQ. deFuse will not work on zipped FASTQ?
gunzip *.fastq.gz

#let us start with a small dataset with 100K reads for testing

head -n 400000 ERR356372_1.fastq > A_1.fastq
head -n 400000 ERR356372_2.fastq > A_2.fastq

#now start the testing job
$DEFUSE/defuse.pl -c defuse_cbbl.config -1 A_1.fastq -2 A_2.fastq -o output_defuse

#if everything goes well you will see outputs like this:
Importing fastq files
Splitting fastq files
Discordant alignments
Read Stats
Fragment mean 173.540675389827 stddev 59.0075428839599
Read length min 49 max 49
Generating discordant alignment clusters
Remove mitochondrial-genomic clusters
Generating maximum parsimony solution
Selecting fusion clusters
Preparing sequences for local realignment
Performing local realignment
Filtering concordant clusters
Generating spanning alignment regions file
Initializing split read alignments
Calculating split read alignments
Evaluating split reads
Calculating spanning stats
Calculating spanning p-values
Calculating split read pvalues
Creating fastas
Splitting fastas
Breakpoint alignments
Annotating fusions
Coallating fusions
Running adaboost classifier
Filtering fusions
Success

#now let us go full scale

$time $DEFUSE/defuse.pl -c defuse_cbbl.config -1 ERR356372_1.fastq -2 ERR356372_2.fastq -o output_defuse

real 1703m53.046s
user 1642m55.346s
sys 83m2.979s

#Kind of slow

The NA12878 RNASeq dataset is very small with 18 million reads and read length is 49. A rough calculation indicates the average coverage is ~60x. Even so, it took more than one day to get it done!

#What's next
Have a glimpse on the result on deFuse.
Repeat the analysis with Tophat-Fusion and FusionMap.

Supposedly I want to declare a closure nested method in JavaScript like this:

###Sample1
//declaration
function addInternal(x)
{
//return anonymous nested function
return function(y) { console.log(x+y);};
}

//use
var addByTen = addInternal(10);
addByTen(15);

.

or wrap it up as a variable

###Sample2, assign the parameter to outer function addInternal in declaration
var addByTen = (function addInternal(x)
{
//return anonymous nested function
return function(y){console.log(x+y);};
})(10);

//use
addByTen(15);

The corresponding CoffeeScript version is very tight:

###sample1
addInternal = (x) ->
(y) -> console.log x+y

addByTen = addInternal 10
addByTen 15

###sample2
addByTen = ((x) ->
(y) -> console.log x+y ) (10)

addByTen 15

Next Generation Sequencing and Data Analysis

Friday, March 28, 2014

Gene Fusion Detection - deFuse

Learning CoffeeScript

About Me

Blog Archive