Next Generation Sequencing and Data Analysis: August 2010

Tuesday, August 31, 2010

Design of the pipeline

Message queue centralized system and coupled with independent I/O Channel

Friday, August 27, 2010

Tuesday, August 24, 2010

#alignment
novoalign -F ILMFQ -t120 -r0.2 -q5 -d hg19Splices34.index -f 1.txt 2.txt > 84060.novoalign

#transform to MAQ
maq novo2maq 84060.novoalign.map - 84060.novoalign

#Convert the reference sequences to the binary fasta format
maq fasta2bfa ref.fasta ref.bfa

#Build the mapping assembly
maq assemble consensus.cns ref.bfa 84060.novoalign.map 2>assemble.log

#Extract consensus sequences and qualities
maq cns2fq consensus.cns >cns.fq

#Extract list of SNPs
maq cns2snp consensus.cns >cns.snp

will be translated to :

/PATH_TO_NONOALIGN/novoalign -F ILMFQ -t120 -r0.2 -q5 -d /PATH_TO_GENOME_INDEX/hg19Splices34.index -f /PATH_TO_INPUT_FILE/1.txt /PATH_TO_INPUT_FILE/2.txt > /PATH_TO_OUTPUT_FILE/1_84060.novoalign
/PATH_TO_MAQ/maq novo2maq /PATH_TO_OUTPUT_FILE/84060.novoalign.map - /PATH_TO_INPUT_FILE/84060.novoalign
...

Monday, August 23, 2010

query, download, upload and update with rsync


#!/home/ying/py2.6.5/bin/python
"""
Use rsync to synchronize the DATA_HOME(hci-bio2, which raw data was stored) and COMP_HOME(chpc-node, 
which process raw data and return output back to source. All the rsync operations are initiated from chpc-node.

Functions:
1. query: query the source if the new data is available
2. download: download new data ( from source to destination)
3. upload: upload result(output) to source.
4. update: update the source by appending a log file from the destination to show the alignment progress on chpc. "begin, percent-finished, end, error, etc"
      
"""
import os
import time
import operator
import subprocess

DATA_HOME = 'ying@123.123.123.123:~/alignment/' 
COMP_HOME = '/home/ying/alignment/' 
RSYNC_QUERY_PARAMS = '-are ssh'
RYSNC_DOWNLOAD_PARAMS = '-avuze ssh' #archive, verbose, update-only, compress-during-transfer and trasnfer-over-ssh  
RYSNC_UPLOAD_PARAMS = '-avuze ssh' #archive, verbose, update-only, compress-during-transfer and trasnfer-over-ssh  
LOG_FILE_NAME = 'log.txt'

def query():
    #query_cmd = """rsync -are ssh ying@155.100.235.73:~/alignment/ | awk '{print $1 " " $3"#"$4 " " $5}'"""
    cmd = ["rsync",RSYNC_QUERY_PARAMS,DATA_HOME,'|',"""awk '{print $1 " " $3"#"$4 " " $5}'"""]
    query_cmd = ' '.join(cmd)
    #print query_cmd
    p = subprocess.Popen(query_cmd,shell=True,stdout=subprocess.PIPE)
    line = p.communicate()
    fnames = ''.join(line[:-1]).strip().split('\n')
    df = {}
    dt = {}
    for fname in fnames:
        prop,ctime,fn = fname.split(' ')
        if fn != '.' and fn != '..':
            if prop[0] == 'd': #it is a directory
                pass
            if prop[0] == '-': #it is a file
                parent_path = os.path.dirname(fn)
                df.setdefault(parent_path,[]).append(os.path.basename(fn))
                #Transform time to seconds, for comparing which job should be run firstly. First In First Run.   
                #'2010/08/20#16:27:40' -> 1282343260.0
                dt[parent_path] = int(time.mktime(time.strptime(ctime,'%Y/%m/%d#%H:%M:%S'))) 
    jobs = []            
    for k,v in df.items():
        if "begin" in v: #the data is ready
            if not LOG_FILE_NAME in v: #this job is not running
                jobs.append((k,dt[k]))
    #print jobs       
    return sorted(jobs,key=operator.itemgetter(1))

def download(path_to_download):
    """download a whole directory from DATA_HOME to COMP_HOME 
    """
    normalized_path = path_to_download.strip('/') #/A, /A/ or A/ -> A
    source = os.path.join(DATA_HOME,normalized_path)
    cmd = ["rsync", RYSNC_DOWNLOAD_PARAMS, source, COMP_HOME] #download the whole directory to COMP_HOME.
    download_cmd = ' '.join(cmd)
    #print download_cmd
    p = subprocess.Popen(download_cmd,shell=True,stdout=subprocess.PIPE)
    line = p.communicate()
    stdout_message = ''.join(line[:-1]).strip().split('\n')
    #print 'Message:',stdout_message
    update(normalized_path,stdout_message)

def upload(path,file_to_upload):
    """upload a file from COMP_HOME to DATA_HOME
     upload 'file_to_upload' at local path 'folder'.
    :/folder/file_to_upload will be uploaded to :/
    """
    normalized_path = path.strip('/')+"/" #/A, /A/ or A/ -> A/
    source = os.path.join(COMP_HOME,normalized_path,file_to_upload) 
    dest = os.path.join(DATA_HOME,normalized_path)
    cmd = ["rsync",RYSNC_UPLOAD_PARAMS, source,dest]
    upload_cmd = ' '.join(cmd)
    p = subprocess.Popen(upload_cmd,shell=True,stdout=subprocess.PIPE)
    line = p.communicate()
    stdout_message = ''.join(line[:-1]).strip().split('\n')
    #fn_log = log(os.path.join(COMP_HOME,normalized_src),stdout_message)
    update(normalized_path,stdout_message)

def update(path,message):
    fn = os.path.join(COMP_HOME,path,LOG_FILE_NAME)
    ofn = None
    try:
        if os.path.exists(fn): 
            ofn = open(fn,'w+')#append 
        else:
            ofn = open(fn,'w') #new file
        print >>ofn,'-'*50
        print >>ofn,time.asctime()
        for m in message:
            print >>ofn,M
        ofn.close()
        normalized_path = path.strip('/')+"/" #/A, /A/ or A/ -> A/
        source = fn 
        dest = os.path.join(DATA_HOME,normalized_path)
        cmd = ["rsync",RYSNC_UPLOAD_PARAMS, source,dest]
        subprocess.Popen(cmd,shell=False,stdout=subprocess.PIPE)
    except:
        pass
        
def clear():
    pass
#print query()
#download('A')
upload('A','hello.txt')

Thursday, August 19, 2010

Failed to run RSA key on HCI-BIO2

Password is required.

How to: Make a passwordless ssh connection

How to: Make a passwordless ssh connection between chpc(hello@delicatearch.chpc.utah.edu) and hci(world@hci-bio1.hci.utah.edu)

Suppose after we log into chpc , we want to "ssh" to hci without password.

1. Create RSA keys (on chpc).

hello@chpc:MY_HOME>ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key ($HOME/.ssh/id_rsa):

Enter passphrase (empty for no passphrase): MY_PASSPHRASE

Enter same passphrase again:MY_PASSPHRASE

Your identification has been saved in $HOME/.ssh/id_rsa.

Your public key has been saved in $HOME/.ssh/id_rsa.pub.

The key fingerprint is:

ad:9e:ab:f9:e1:1a:a4:85:16:3b:24:5f:35:b6:76:a7 user@machine

MY_PASSPHRASE is password used to encrypt the keys, not the user's passworld.

Now we have two files under $HOME/.ssh, "id_rsa" is private key, "id_rsa.pub" is public key.

2. Transfer the public key to world@hci's home directory (on chpc)

hello@chpc:$HOME>scp .ssh/id_rsa.pub world@hci-bio1.hci.utah.edu:~/

3. Append the id_rsa.pub to aothorized keys (on hci)

world@hci:$HOME>cat id_rsa.pub >>authorized_keys

4. Make the hci accept RSA key style connection.

By default, this features is off.

We need to modify the following line in /etc/ssh/sshd_config (root privilege is required)

#RSAAuthentication yes

#PubkeyAuthentication yes

RSAAuthentication yes

PubkeyAuthentication yes

Then we need to refresh the ssh service.

>/etc/init.d/sshd restart

5. Make a test (on chpc)

>ssh -2 world@hci-bio1.hci.utah.edu

Enter passphrase for key '$HOME/.ssh/id_rsa':

Here we need to input the passphrase of the private key, it is . If everything goes well, we will see the welcome message:

Welcome to Ubuntu!

6. Now we need to eliminate the "passphrase" step using ssh-agent and ssh-add (on chpc)

ssh-agent are used to buffer the passphrase and keep it in memory, we we do not need input passphrase next time.

#start ssh-agent

>eval `ssh-agent`

#add passphrase

>ssh-add

Enter passphrase for chpc:$HOME/.ssh/id_rsa: MY_PASSPHRASE

Identity added: chpc:$HOME/.ssh/id_rsa

7. Make a test again (on chpc)

>ssh -2 world@hci-bio1.hci.utah.edu

Welcome to Ubuntu!

8. we can test scp (on chpc)

>scp hello.txt world@hci-bio1.hci.utah.edu:~/

To avoid run "eval `ssh-agent`" and "ssh-add" every time after we log into the chpc, we can append eval `ssh-agent` to ~/.bash_profile. So it will start automatically next time. However, we still need to run 'ssh-add' manually after each log in to register the passphrase for security. if passphrase is empty, we do not even need ssh-agent and ssh-add. But this may bring security risk to the private key.

we can also use "keychain" as the frontend of ssh-agent, so we do not have to create ssh-agent for each login. With keychain, only one ssh-agent is in service no matter how many consoles we open.

>wget http://www.funtoo.org/archive/keychain/keychain-2.7.1.tar.bz2

>tar -jxvf keychain-2.7.1.tar.bz2

>cd keychain-2.7.1

>./keychain

Wednesday, August 18, 2010

ssh connection from chpc to bio1

Suppose we want to passwordless ssh connection from machine A(chpc) to machine B(bio1)

1. On machine B

modify the following line in /etc/ssh/sshd_config

(root privilege is required)

#RSAAuthentication yes

#PubkeyAuthentication yes

#AuthorizedKeysFile .ssh/authorized_keys

RSAAuthentication yes

PubkeyAuthentication yes

AuthorizedKeysFile .ssh/authorized_keys

reload configuration

#/etc/init.d/ssh reload

2. On machine A

$ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/MY_HOME/.ssh/id_rsa):

Enter passphrase (empty for no passphrase): THIS_IS_MY_PASSWORD

Enter same passphrase again: THIS_IS_MY_PASSWORD

Your identification has been saved in /MY_HOME/.ssh/id_rsa.

Your public key has been saved in /MY_HOME/.ssh/id_rsa.pub.

The key fingerprint is:

ad:9e:ab:f9:e1:1a:a4:85:16:3b:24:5f:35:b6:76:a7 user@machine

$ls .ssh/

id_rsa id_rsa.pub known_hosts

id_rsa is my private key, and id_rsa.pub is the public key

$scp id_rsa.pub user@machineB:~/.ssh

3.On machine B

$cd .ssh/

$cat id_rsa.pub >>authorized_keys

Now make a test from machine A:

$ssh user@machineB

Enter passphrase for key '/MY_HOME/.ssh/id_rsa': THIS_IS_MY_PASSWORD

user@machineB$ #OK now we are using machineB.

Next, we use ssh-agent and ssh-add to avoid "Enter passphrase for key '/MY_HOME/.ssh/id_rsa'", avoid password completely.

Pipeline to run novoalign - 0818

Since we can run a deamon service at chpc, here is my idea to implement a pipeline

On HCI:
Setup rsync service on /home/alignment@hci-bio1. This directory will served as home to put all read files.

On CHPC:
A script running on chpc's interactive node (delicatearch) as deamon. This script will:

1. monitor /home/alignment/@hci-bio1 using rsync

2. if new files are found in that directory, download the file to chpc

3. split the read file into segments as jobs

4. submit jobs using PBS

5. wait for results by monitoring the local result directory e.g. /home/user/pid

6. re-submit jobs if necessary (some jobs may take exceptional long time to finish)

7. assemble result files

8. transfer result back to hci-bio1

Use case:

I have two files "A_1.fq.gz" and "A_2.fq.gz" from pair-end sequencing read. Now I want to align the read to human genome using novoalign.

1. Put "A_1.fq.gz" and "A_2.fq.gz" to /home/alignment/request/ @ hci-bio1

2. Create a file named "A.params" to list the parameters (gap penalty, genome, etc) for running novoalign.

3. Touch a dummy file named "A.ready". This file is to tell the daemon service on chpc that alignment request that starts with "A" are ready.

Once the job is done, I should be able to find my alignment at /home/alignment/response/

Now it looks like:

/home
|
|---alignment/
|
|---request/
| |
| |---A_1.fq.gz
| |
| |---A_2.fq.gz
| |
| |---A.params
| |
| |---A.ready
|
|---response/
|
|---A.result
|
|---other results

Get the pid of my_process

pidof my_process

ps uxw | grep my_process | grep -v grep | awk '{print $2}

Tuesday, August 17, 2010

rsync?

Analysis

****************************************************

Given a fastaq read file, currently we need to:

1. "scp" this file from working machine to the chpc

2. log into chpc, using a script to split the read file into pieces of small files

3. write a script to run alignment using PBS for each small file

4. Once the individual alignment task is done, the script will send a notification email to user.

5. User log into chpc, get alignment results

6. "scp" alignment results back to working machine

7. "cat" result files for downstream making consensus, variance call, etc.

Goal

****************************************************

Reduce these manual steps as many as possible.

Design(as of 08/17/2010)

****************************************************

1. Run a "rsync" service on working machine (e.g, my desktop Linux machine)

2. Run a client application (cron + shell + rsync or in-house application X which acts as a rsync client) at the chpc as deamon, monitoring any changes happened on the working machine.

3. If observed a specified file, for example "ready.go", then sync the data to the local machine (chpc). This "ready.go" is used as a indicator to avoid incomplete data. For example, if we have two files for pair-end, only after we put all two files into the same path and "touch ready.go",

then the chpc can download the two data files and start alignment. This is for data consistency.

4. Application X will monitor the alignment progress. After all alignments were done and result files are generated, then merge, tar and gzip the result files.

5. Application X will start another rsync operation, upload the .tar.gz result file back to working machine.

6. Continue downstream analysis with the alignment result.

Monday, August 16, 2010

FASTQ Quality Data

python>qa = '!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65.....'

python>quality_score = [ord(q)-33 for q in qa]

python>print quality_score

[0, 9, 7, 7, 7, 7, 9, 9, 9, 10, 8, 8, 4, 4, 4, 10, 10, 8, 7, 4, 4, 4, 4, 8, 13, 16, 9, 9, 9, 12, 10, 9, 8, 8, 9, 9, 20, 20, 34, 34, 37, 29, 29, 29, 29, 29, 29, 34, 34, 34, 34, 34, 34, 34, 21, 20, 13, 13, 13, 13, 13]

SOAP

soap2sam.pl

>wget http://soap.genomics.org.cn/down/soap2sam.tar.gz

soapaligner

>wget http://soap.genomics.org.cn/down/SOAPaligner-v2.20-Linux-x86_64.tar.bz2

#soap's index-builder "2bwt-builder" do not accept wildcard.

>2bwt-builder *.fa #will not work, only build the indexes for the first .fa file. As thus we need to make a single .fa file.

#create a single .fa file

>cat *.fa>> hg19.fa # ~3.2GB

#HG19 has 93 fasta entry

>grep -c '>' hg19.fa

#now build the index files

>2bwt-builder hg19.fa

Parsing FASTA file..

Finished. Parsed 93 sequences.

Elapsed time = 77.96 s

Building Look-Up..

Finished.

Elapsed time = 250.56 s

Building BWT..

Finished constructing BWT in 292 iterations. Elapsed time = 1354.80 s

Saving BWT..

Finished saving BWT. Elapsed time = 1.02 s

Building Reversed BWT..

Finished constructing Reversed BWT in 292 iterations. Elapsed time = 1360.44 s

Saving BWT..

Finished saving BWT. Elapsed time = 7.29 s

Loading BWT...

Finished loading BWT. Elapsed time = 3.86 s

Building SA value...

Finished building SA value. Elapsed time = 1028.20 s

Building High-Occ Hash Table...

Finished.

Elapsed time = 695.12 s

Building SA index...

Finished building SA index. Elapsed time = 14.29 s

Index building is completed.

Total elapsed time = 4793.76 s

Friday, August 13, 2010

Pipeline for Genomic Variants Identification - Part 2

#use Lyon Robison Lab's aligned data

wget http://soap.genomics.org.cn/down/soap2sam.tar.gz soapaligner >wget http://soap.genomics.org.cn/down/SOAPaligner-v2.20-Linux-x86_64.tar.bz2">https://hci-as1.hci.utah.edu/gnomex/gnomexFlex.jsp?launchWindow=AnalysisDetail&analysisNumber=A120

# ~3.86GB
gnomex-analysis-20100813.zip

#after unzip, it goes to 3.7G
>unzip gnomex-analysis-20100813.zip
>du -sh bioinformatics-analysis-A121/
-rw-r--r-- 1 97740 1979 453778525 2010-06-24 09:57 1_84060.novoalign

>pwd
/home/galaxy/tmp/bioinformatics-analysis-A121/84060/ShellScripts

#just check the first fragment(1_84060.novoalign, ~450MB) generate by 1.sh
>less 1.sh
#PBS -l nodes=1:ppn=4,walltime=12:00:00
#PBS -m a
#PBS -M hello@world.com
#PBS -N 1_84060
#PBS -j oe
#PBS -o /XXX/Logs

echo '1_84060'
echo -n 'Start: '; date +%s
/uufs/chpc.utah.edu/common/home/u0028003/BioApps/Novocraft/novocraft/novoalign -F ILMFQ -t120 -r0.2 -q5 -d /uufs/chpc.utah.edu/common/home/u0028003/Genomes/hg19Splices34bpAdaptersNovo.index -f /scratch/serial/u0028003/84060/SplitFastqData1/1.txt /scratch/serial/u0028003/84060/SplitFastqData2/1.txt > /scratch/serial/u0028003/84060/Alignments/1_84060.novoalign

gzip /scratch/serial/u0028003/84060/Alignments/1_84060.novoalign
echo -n 'End: '; date +%s

>gunzip 1_84060.novoalign.gz
>wc -l 1_84060.novoalign

>tail -n 100 2140441 1_84060.novoalign
# Paired Reads: 1000000
# Pairs Aligned: 867741
# Read Sequences: 2000000
# Aligned: 1765701
# Unique Alignment: 1636245
# Gapped Alignment: 5509
# Quality Filter: 27456
# Homopolymer Filter: 13
# Elapsed Time: 345,966s
# Fragment Length Distribution
# From To Count
# 72 77 487
# 78 83 1103
# Mean 201, Std Dev 59.4
# Done.

Pipeline for Genomic Variants Identification - Part 1

Build a pipeline which:

1. Use novoalign to align some data to HG19.

2. Parse(transform) the outputs of novoalign for genomic variants identification:

2.1 SNP call (use MAQ or SOAP)

2.2 Indel

2.3 inversion, duplication etc.

3. Investigate the affects of genomic variance on gene. Synoymous/ Non-Synoymous mutation? using USeq, etc.

4. Visualization of these effects using IGB, UCSC, etc.

***************************************************************************

#download human genome build, build index for novoalign

>pwd #the path for genome build

/home/galaxy/tools/novoalign-2.06.09/genome/HG19

>wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

>tar -zxvf chromFa.tar.gz

#now we have a list of .fa files for chromosomes.

>rm chromFa.tar.gz

#use novoindex to build a index

>novoindex hg19.novoindex *.fa

#size is ~6.5G

>ll hg19.novoindex

-rwxr--r-- 1 root root 6505422914 2010-08-13 10:50 hg19.novoindex*

#make a test using 25 sequences

>time novoalign -d genome/HG19/hg19.novoindex -f read/test.txt >hello.txt

real 0m47.846s

user 0m0.180s

sys 0m5.220s

#check out the result

>less hello.txt

# novoalign (2.06.09MT - Jun 16 2010 @ 12:36:05) - A short read aligner with qualities.

# Licensed for evaluation, educational, and not-for-profit use only.

# novoalign -d genome/HG19/hg19.novoindex -f read/test.txt

# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.

# Index Build Version: 2.6

# Hash length: 14

# Step size: 3

....

# Read Sequences: 25

# Aligned: 22

# Unique Alignment: 22

# Gapped Alignment: 1

# Quality Filter: 1

# Homopolymer Filter: 0

# Elapsed Time: 0,035s

# Done.

***************************************************************************

#Install MAQ

>pwd

/home/galaxy/tmp

>wget http://sourceforge.net/projects/maq/files/maq/0.7.1/maq-0.7.1.tar.bz2/download

>tar -jxvf maq-0.7.1.tar.bz2

>cd maq-0.7.1

>./configure --prefix=/home/galaxy/tools/maq-0.7.1 && make && make install

***************************************************************************

#Install SOAPsnp

#boost(http://sourceforge.net/projects/boost/) is required to build SOAPsnp

>wget http://sourceforge.net/projects/boost/files/boost/1.43.0/boost_1_43_0.tar.gz/download

>tar -zxvf boost_1_43_0.tar.gz

>cd boost_1_43_0

#move the boost directory to the system path so the gcc can find it.

>mv boost /usr/local/include/

#download SOAPsnp

>wget http://soap.genomics.org.cn/down/SOAPsnp-v1.03.tar.gz

>cd SOAPsnpZ

>make all #this will make a single executable file, soapsnp

>ll soapsnp

-rwxr-xr-x 1 root root 123425 2010-08-13 12:04 soapsnp*

#move it to $PATH

>mv soapsnp /home/galaxy/tools

>soapsnp

SoapSNP

Compulsory Parameters:

-i Input SORTED Soap Result

-d Reference Sequence in fasta format

-o Output consensus file

......

***************************************************************************

#download the sequencing data

>lftp -u USERNAME,PASSWORD ftp://cdts.genomics.org.cn

#do not know which data should be downloaded. So go to GenomEx.

Alignment test: 25 sequences against HG19

>time ./novoalign -d genome/hg19.nix -f read/test.txt >hello.txt

real 0m29.149s

user 0m5.610s

sys 0m2.990s

>less hello.txt

# novoalign (2.06.09MT - Jun 16 2010 @ 12:36:05) - A short read aligner with qualities.

# Licensed for evaluation, educational, and not-for-profit use only.

# novoalign -d genome/hg19.nix -f read/test.txt

# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.

# Index Build Version: 2.6

# Hash length: 13

# Step size: 2

@HWI-EAS240_0001:7:1:1086:16499#0/1 S TGTTTTCCTGNCAGCCCCCACCTTTTTAGTACATGTCTCATCCTACTGCTGTTTNNNGGATTGTGAATGTGTATTN BBBBB@?@@@#::82;8;889:=@BB>@>*?*?+5:;3@7>@(@78@

......

############################### NM

# Read Sequences: 25

# Aligned: 0

# Unique Alignment: 0

# Gapped Alignment: 0

# Quality Filter: 1

# Homopolymer Filter: 0

# Elapsed Time: 0,139s

# Done.

Thursday, August 12, 2010

Heterozygous/Homozygous SNP and Heterozygous/Homozygous Indel

Heterozygous SNP:
GCTA
GCCA

Homozygous SNP:
GCTA
GCTA

Heterozygous Indel:
GCTA
GC*A

Homozygous Indel:
GC*A
GC*A

Wednesday, August 11, 2010

When to Apply Bayes' Theorem

The sample space is partitioned into a set of mutually exclusive events { A1, A2, . . . , An }.
Within the sample space, there exists an event B, for which P(B) > 0.
The analytical goal is to compute a conditional probability of the form: P( Ak | B ).
You know at least one of the two sets of probabilities described below.

P( Ak ∩ B ) for each Ak
P( Ak ) and P( B | Ak ) for each Ak

Friday, August 6, 2010

Novoalign

#Download human genome HG19 from UCSC
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

hg19EnsKwn45bpSplices.CSnovoindex.gz 5562353922 5304.674MB
hg19EnsKwn45bpSplices.CSnovoindex 7438882817 7094.271MB

>time gunzip hg19EnsKwn45bpSplices.CSnovoindex.gz
real 3m43.081s
user 0m39.760s
sys 1m9.550s

100517_7355X4_s_7_sequence.txt.gz 1755738672 1674.403MB
6686206832 2010-06-30 10:57 100517_7355X4_s_7_sequence.txt

>time gunzip 100517_7355X4_s_7_sequence.txt.gz

>time ./novoalign -d genome/hg19EnsKwn45bpSplices.CSnovoindex -f read/100517_7355X4_s_7_sequence.txt -F ILMFQ >hello.txt

#NovoalignCS
wget http://www.novocraft.com/downloads/download.php?filename=Linux/V2.0/novoalignCSV1.00.09.gcc.tar.gz

hg19EnsKnGn45bpSplices.fasta.gz
167905261 2010-08-03 05:56 hg19EnsKnGn45bpSplices.fasta.gz

>time ./novoindex hg19.nix genome/hg19.fasta
# novoindex (1.0) - Universal k-mer index constructor.
# (C) 2008 NovoCraft
# novoindex hg19.nix genome/hg19.fasta
# Creating 5 indexing threads.
# Building with 13-mer and step of 2 bp.
# novoindex construction dT = 257.0s
# Index memory size 2.383Gbyte.
# Done.

real 4m17.235s
user 2m5.360s
sys 5m50.900s

hg19.nix 2558416377

time ./novoalign -d genome/hg19.nix -f read/100517_7355X4_s_7_sequence.txt -F ILMFQ >hello.txt

>grep -c '@' 100517_7355X4_s_7_sequence.txt
29283087

So this means this file has 29283087 fastq entries (reads)

RNASeq

>wget http://sourceforge.net/projects/useq/files/USeq_6.7.zip/download
>unzip USeq_6.7.zip
>java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

>java -jar RNASeq
**************************************************************************************
** RNASeq: June 2010 **
**************************************************************************************
Example: java -Xmx2G -jar pathTo/USeq/Apps/RNASeq -y eland -v D_rerio_Dec_2008 -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/PolIINRep1/,/Data/PolIINRep2/ -s
/Data/Results/WtVsNull -g /Anno/zv8Genes.ucsc

**************************************************************************

>apt-get install r-base-core
>R

R>source("http://www.bioconductor.org/biocLite.R")

R>biocLite("DESeq")

>R CMD INSTALL DESeq_1.1.8.tar.gz

Install Galaxy on Bioserver(bioserver.hci.utah.edu)

#Install R
>wget http://cran.cnr.berkeley.edu/src/base/R-2/R-2.11.1.tar.gz
>tar zxvf R-2.11.1.tar.gz
>configure && make && make install

#Put R into environment

#Start R
>R
>source('http://www.bioconductor.org/biocLite.R')
>biocLite('DESeq')

java -Xmx2G -jar pathTo/USeq/Apps/RNASeq -y eland -v D_rerio_Dec_2008 -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/PolIINRep1/,/Data/PolIINRep2/ -s
/Data/Results/WtVsNull -g /Anno/zv8Genes.ucsc

galaxy@bioserver>java -Xmx2G -jar /home/galaxy/tools/USeq_6.7/Apps/RNASeq -y eland -v D_rerio_Dec_2008 -c /home/galaxy/tmp/OverDispersedRNASeqDataset/Novoalignments/Control -t /home/galaxy/tmp/OverDispersedRNASeqDataset/Novoalignments/Treatment -g /home/galaxy/tmp/OverDispersedRNASeqDataset/hg19EnsGenes.ucsc -s /home/galaxy/tmp/hello -r /home/galaxy/tools/R-2.11.1/bin/R

time java -Xmx2G -jar /home/ying/tools/USeq_6.7/Apps/RNASeq -y eland -v D_rerio_Dec_2008 -c /home/ying/tmp/OverDispersedRNASeqDataset/Novoalignments/Control -t /home/ying/tmp/OverDispersedRNASeqDataset/Novoalignments/Treatment -g /home/ying/tmp/OverDispersedRNASeqDataset/hg19EnsGenes.ucsc -s /home/ying/tmp/hello

Framework

Taverna(Taverna: a tool for building and running workflows of services, http://www.taverna.org.uk/)

Pegasys(Pegasys: software for executing and integrating analyses of biological sequences, http://www.bioinformatics.ubc.ca/pegasys/)

WebLab(WebLab: a data-centric, knowledge-sharing bioinformatic platform, http://weblab.cbi.pku.edu.cn/srcDownload.do)

wEMBOSS(wEMBOSS: a web interface for EMBOSS, http://wemboss.sourceforge.net/)

GenePattern

To upload a directory structure to a module, your users would first create the directory structure locally, and then create a zip file to use as input to your module.
You would write a GP module which takes the zip file as an arg. Within the module, you would unzip as a first step and then call RNASeq.
The output files are stored in the server. The GP server allows users to browse the result files in a web browser, to download individual files, to launch subsequent analysis steps on those individual files, and to download all of the results as a zip file. There may be some problems with output files in nested directories. This has been dealt with in the past by writing a summary.html page as a result of the job. The summary page includes relative links to the nested job result files.

To summarize, the answers to your questions are:
1. no, GenePattern does not have the concept of a working space in the server. Although there are some workarounds, by writing helper modules which:
a) upload a file via the web browser and store it unaltered as a job result file
b) upload a zip file via the web browser and store it unzipped as a set of job results files
c) configuring the server to provide access to a subset of the server's file system. Write modules which accept a server file path (string arg) as the input.

2. Yes, you should be able to integrate RNASeq without changing the source code of GenePattern.

I would start by going through the tutorial and the other links which Michael sent.

Regards,

Peter

Requirement:

1. Account
user: register,login,logout,change password,
admin: create/modify/delete user, create/modify/delete group, change password,privilege.

2. Workspace
Workspace is like the home directory of a user. Each user has his own workspace.
Each group has a public workspace. The system has one workspace.
System's workspace are public to all members.
Group's workspace are public to members in this group.
User's workspace are public to him/her only.

User can create folders and files in his and group's workspace.

1. upload/create/modify/delete/share file
2. upload/create/modify/delete/share folder
3. upload/create/modify/delete/share workflow

workspace can be shared(read or read-write) with other groups or users.

3. Global job queue
1. First Come First Serve with priority
2. limited volume
3. queue management(submit, progress, runnung time, cancel, lower/upper priority)

4. History
track all operations and events
shows
expose the history to other user for re-use.

6. Workflow
A workflow is a combination and data and operations. It contains all the information to repeat a series of
operations by another user.

Monday, August 2, 2010

Install Plone4.0 Beta Release 5

#install openssl
>sudo apt-get install libssl-dev

>wget http://launchpad.net/plone/4.0/4.0b5/+download/Plone-4.0b5-UnifiedInstaller.tgz
>tar -zxvf Plone-4.0b5-UnifiedInstaller.tgz
>cd Plone-4.0b5-UnifiedInstaller.tgz

#install Zope as standalone
>./install.sh standalone --target=/home/galaxy/tools/plone4.0b5 --with-python=/usr/bin/python

Username: admin
Password: OBM3GHk5

#change port from 8080 to 8050
>cd /home/galaxy/tools/plone4.0b5/zinstance
>vim buildout.cfg
http-address = 8050

#locate the username and temp password in buildout.cfg
user=admin:OBM3GHk5

#load the change on buildout.cfg
>cd /home/galaxy/tools/plone4.0b5/zinstance
>bin/buildout

Uninstalling unifiedinstaller.
Uninstalling instance.
Installing instance.
Updating zopepy.
Updating zopeskel.
Updating backup.
Updating chown.
chown: Running echo Dummy references to force this to execute after referenced parts
echo /home/galaxy/tools/plone4.0b5/zinstance/var/backups
chmod 600 .installed.cfg
find /home/galaxy/tools/plone4.0b5/zinstance/var -type d -exec chmod 700 {} \;
chmod 744 /home/galaxy/tools/plone4.0b5/zinstance/bin/*
Dummy references to force this to execute after referenced parts
/home/galaxy/tools/plone4.0b5/zinstance/var/backups
Updating osxcontroller.
Installing unifiedinstaller.
*************** PICKED VERSIONS ****************
[versions]

*************** /PICKED VERSIONS ***************

#Start plone
>/home/galaxy/tools/plone4.0b5/zinstance/bin/plonectl start

Next Generation Sequencing and Data Analysis

Tuesday, August 31, 2010

Design of the pipeline

Friday, August 27, 2010

Design of the pipeline

Tuesday, August 24, 2010

Script mapping

Monday, August 23, 2010

query, download, upload and update with rsync

Thursday, August 19, 2010

Failed to run RSA key on HCI-BIO2

How to: Make a passwordless ssh connection

Wednesday, August 18, 2010

ssh connection from chpc to bio1

Pipeline to run novoalign - 0818

Get the pid of my_process

Tuesday, August 17, 2010

rsync?

Monday, August 16, 2010

FASTQ Quality Data

SOAP

Friday, August 13, 2010

Pipeline for Genomic Variants Identification - Part 2

Pipeline for Genomic Variants Identification - Part 1

Alignment test: 25 sequences against HG19

Thursday, August 12, 2010

Heterozygous/Homozygous SNP and Heterozygous/Homozygous Indel

Wednesday, August 11, 2010

When to Apply Bayes' Theorem

Friday, August 6, 2010

Novoalign

RNASeq

Framework

Monday, August 2, 2010

Install Plone4.0 Beta Release 5

About Me

Blog Archive