Wednesday, August 18, 2010

Pipeline to run novoalign - 0818

Since we can run a deamon service at chpc, here is my idea to implement a pipeline


On HCI:
Setup rsync service on /home/alignment@hci-bio1.
This directory will served as home to put all read files.

On CHPC:
A script running on chpc's interactive node (delicatearch) as deamon.
This script will:

1. monitor /home/alignment/@hci-bio1 using rsync

2. if new files are found in that directory, download the file to chpc

3. split the read file into segments as jobs

4. submit jobs using PBS

5. wait for results by monitoring the local result directory e.g. /home/user/pid

6. re-submit jobs if necessary (some jobs may take exceptional long time to finish)

7. assemble result files

8. transfer result back to hci-bio1


Use case:

I have two files "A_1.fq.gz" and "A_2.fq.gz" from pair-end sequencing read. Now I want to align the read to human genome using novoalign.

1. Put "A_1.fq.gz" and "A_2.fq.gz" to /home/alignment/request/ @ hci-bio1

2. Create a file named "A.params" to list the parameters (gap penalty, genome, etc) for running novoalign.

3. Touch a dummy file named "A.ready". This file is to tell the daemon service on chpc that alignment request that starts with "A" are ready.

Once the job is done, I should be able to find my alignment at /home/alignment/response/

Now it looks like:

/home
|
|---alignment/
|
|---request/
| |
| |---A_1.fq.gz
| |
| |---A_2.fq.gz
| |
| |---A.params
| |
| |---A.ready
|
|---response/
|
|---A.result
|
|---other results

No comments:

Post a Comment