Tuesday, August 17, 2010

rsync?

Analysis
****************************************************
Given a fastaq read file, currently we need to:
1. "scp" this file from working machine to the chpc
2. log into chpc, using a script to split the read file into pieces of small files
3. write a script to run alignment using PBS for each small file
4. Once the individual alignment task is done, the script will send a notification email to user.
5. User log into chpc, get alignment results
6. "scp" alignment results back to working machine
7. "cat" result files for downstream making consensus, variance call, etc.

Goal
****************************************************
Reduce these manual steps as many as possible.


Design(as of 08/17/2010)
****************************************************
1. Run a "rsync" service on working machine (e.g, my desktop Linux machine)

2. Run a client application (cron + shell + rsync or in-house application X which acts as a rsync client) at the chpc as deamon, monitoring any changes happened on the working machine.

3. If observed a specified file, for example "ready.go", then sync the data to the local machine (chpc). This "ready.go" is used as a indicator to avoid incomplete data. For example, if we have two files for pair-end, only after we put all two files into the same path and "touch ready.go",
then the chpc can download the two data files and start alignment. This is for data consistency.

4. Application X will monitor the alignment progress. After all alignments were done and result files are generated, then merge, tar and gzip the result files.

5. Application X will start another rsync operation, upload the .tar.gz result file back to working machine.

6. Continue downstream analysis with the alignment result.



No comments:

Post a Comment