This is always the first step to do a NGS data analysis, supposedly we already did the QC on the "Fastq" files. In this command, both the "Fastq" and "ReferenceGenome" are inputs, the "Alignment" is the output. The key difference between the two type of inputs is that every job has its own "Fastq" while lots of similar jobs reuse the same "ReferenceGenome", as well as the application "Align".
1. All bioinformatics applications, packages that were used in the data analysis.
2. All associated data files used in above applications, which include:
- Reference sequence file (and index)
- Indexed reference sequence file for different aligners
- Databases like dbSNP, COSMIC, UCSC Gene Table, etc.
- Other pre-defined common resource
The total size of these applications and data files could be as much as tens of GB, depending how many reference genomes we are going to support in our pipeline.
To achieve the best performance we need to keep a copy of all apps and data in each slave node. We do NOT want to copy tens of GB of "immutable" apps and library files from one place to all slave nodes for every job. Also it is very important to keep the consistency of all apps and library files in all nodes. In another word, we MUST guarantee everything is the same. The idea is simple: we only maintain a single copy of all bioinformatics applications and library files in the master name node or another dedicated node, as long as this dedicated node can access all slave nodes with ssh-keyless. Then we can synchronize all slave nodes to make sure all slave nodes has a duplicated copy, before accepting EVERY new job. Why we can NOT synchronize slave nodes in a fixed interval? Because the backend library files changes may cause inconsistency if a job is in processing. So, before we pop a job from the job queue and to push to hadoop, we should synchronize all slave nodes firstly. OK, again, why do not we just do the synchronization only once when we start our framework? Because we need to update applications and library files while the service is running. If we are going to deploy this service as 24*7 we need to update these things on-the-fly. Luckily it would not take too much time as most operations are just comparing timestamp of files. The real data transferring operation will not happen unless we add/update/remove the data in the name node. Even so, the data transferring is a incremental updating. It is quite simple to get all these done - "rsync" is our friend.
For simplicity, we use the name node as the central repository. We already created indexed genome hg19 for BWA and put it under "
/home/hadoop/data/hg19/bwa". We also put the BWA binary executable under "/home/hadoop/tools/bwa" on name node.
Now we need to transfer all these files to all slave nodes. There is one script named "slaves.sh" under hadoop's distribution package. It is convenient to use that script execute a command in all slaves nodes. Howeevr, this synchronization operation should be done in our application. For demonstration, we can just do it manually to learn the process.
#master is name node.
#n1, n2 and n3 are slave nodes.
#on slave node
hadoop@n1$ mkdir -p /hadoop/data
hadoop@n1$ mkdir -p /hadoop/app
#on master node
hadoop@master$ rsync -arvue ssh /home/hadoop/tools/bwa hadoop@n1:/hadoop/app/
hadoop@master$ rsync -arvue ssh /home/hadoop/data/hg19/bwa hadoop@n1:/hadoop/data/