This is always the first step to do a NGS data analysis, supposedly we already did the QC on the "Fastq" files. In this command, both the "Fastq" and "ReferenceGenome" are inputs, the "Alignment" is the output. The key difference between the two type of inputs is that every job has its own "Fastq" while lots of similar jobs reuse the same "ReferenceGenome", as well as the application "Align".
1. All bioinformatics applications, packages that were used in the data analysis.
2. All associated data files used in above applications, which include:
- Reference sequence file (and index)
- Indexed reference sequence file for different aligners
- Databases like dbSNP, COSMIC, UCSC Gene Table, etc.
- Other pre-defined common resource
The total size of these applications and data files could be as much as tens of GB, depending how many reference genomes we are going to support in our pipeline.
To achieve the best performance we need to keep a copy of all apps and data in each slave node. We do NOT want to copy tens of GB of "immutable" apps and library files from one place to all slave nodes for every job. Also it is very important to keep the consistency of all apps and library files in all nodes. In another word, we MUST guarantee everything is the same. The idea is simple: we only maintain a single copy of all bioinformatics applications and library files in the master name node or another dedicated node, as long as this dedicated node can access all slave nodes with ssh-keyless. Then we can synchronize all slave nodes to make sure all slave nodes has a duplicated copy, before accepting EVERY new job. Why we can NOT synchronize slave nodes in a fixed interval? Because the backend library files changes may cause inconsistency if a job is in processing. So, before we pop a job from the job queue and to push to hadoop, we should synchronize all slave nodes firstly. OK, again, why do not we just do the synchronization only once when we start our framework? Because we need to update applications and library files while the service is running. If we are going to deploy this service as 24*7 we need to update these things on-the-fly. Luckily it would not take too much time as most operations are just comparing timestamp of files. The real data transferring operation will not happen unless we add/update/remove the data in the name node. Even so, the data transferring is a incremental updating. It is quite simple to get all these done - "rsync" is our friend.
For simplicity, we use the name node as the central repository. We already created indexed genome hg19 for BWA and put it under "
/home/hadoop/data/hg19/bwa". We also put the BWA binary executable under "/home/hadoop/tools/bwa" on name node.
Now we need to transfer all these files to all slave nodes. There is one script named "slaves.sh" under hadoop's distribution package. It is convenient to use that script execute a command in all slaves nodes. Howeevr, this synchronization operation should be done in our application. For demonstration, we can just do it manually to learn the process.
#master is name node.
#n1, n2 and n3 are slave nodes.
#on slave node
hadoop@n1$ mkdir -p /hadoop/data
hadoop@n1$ mkdir -p /hadoop/app
#on master node
hadoop@master$ rsync -arvue ssh /home/hadoop/tools/bwa hadoop@n1:/hadoop/app/
hadoop@master$ rsync -arvue ssh /home/hadoop/data/hg19/bwa hadoop@n1:/hadoop/data/
Uniqe informative article and of course True words, thanks for sharing. Today I see myself proud to be a hadoop professional with strong dedication and will power by blasting the obstacles. Thanks to Hadoop Training in Chennai
ReplyDeleteI gathered a lot of information through this article.Every example is easy to understandable and explaining the logic easily.Thanks! AWS Training in chennai | AWS Training chennai | AWS course in chennai
ReplyDeleteI have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article.
ReplyDeleteB.Tech | Ex-MBA | MCA | Diploma | Distance Learning Program
B.Tech | Diploma |Engineering Courses
Thanks for sharing this niche useful informative post to our knowledge.
ReplyDeleteBest JAVA Training in Chennai | JAVA J2EE Training in Chennai
Hi, I am Jackson from Chennai. I am technology freak. I did Hadoop Training Chennai at FITA. This is useful for me to make a bright career in IT field.
ReplyDeleteThank you so much for sharing this worthwhile to spent time on. You are running a really awesome blog. Keep up this good work Best Hadoop Training in Chennai
ReplyDeleteHi friends,This is Johnson from Chennai.Thanks for sharing this informative blog. I did Unix certification course in Chennai at Fita academy. This is really useful for me to make a bright career.
ReplyDeleteRegards..
Unix Training
Learning new technology would give oneself a true confidence in the current emerging Information Technology domain. With the knowledge of big data the most magnificent cloud computing technology one can go the peek of data processing. As there is a drastic improvement in this field everyone are showing much interest in pursuing this technology. Your content tells the same about evolving technology. Thanks for sharing this.
ReplyDeleteHadoop Training in Chennai | Best Hadoop Training in Chennai | Best hadoop training institute in chennai | Big Data Hadoop Training in Chennai
I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
ReplyDeleteRegards,
ccna course in Chennai|ccna training in Chennai|ccna training institute in Chennai
Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)
ReplyDeleteSoftware testing training in chennai | Software testing institute in chennai | Testing training in chennai
There is a huge demand for professional big data analysts who are able to use the software which is used to process the big data in order to get accurate results. MNC's are looking for professionals who can process their data so that they can get into a accurate business decision which would eventually help them to earn more profits, they can serve their customers better, and their risk is lowered.
ReplyDeletebig data training in chennai|big data training|big data course in chennai|big data training chennai|big data hadoop training in chennai
ReplyDeleteEverything is fine, am happy about your blog. Thanks admin for sharing the unique content, you have done a great job I appreciate your effort and I hope you will get more positive comments from the web users.
Regards,
Angular training in chennai|Angularjs training in chennai
We are offering applications at most cost effective and affordable rates. Feel free to contact us at telephony applications
ReplyDeleteUpdating with the current trend is strictly advisable and the content furnished here also states the same. Thanks for sharing this wonderful and worth able article in here. The way to expressed is simply awesome. Keep doing this job. Thanks :)
ReplyDeleteVisit SKARTEC
Click Here
SKARTEC Digital Marketing Academy
digital marketing course in chennai with placement
digital marketing training institute in chennai
digital marketing course near me
digital marketing course in chennai fees
best institute for digital marketing course in chennai
digital marketing course with placement
online digital marketing course in chennai
advance digital marketing course in chennai
digital marketing training institute near me
digital marketing course near me
digital marketing training in india
seo training
Data analysis is the future for next generation. He there I have gone through your blog and your content is fabulous. We are in distance education sharing fact about distance learning courses. learn fact about distance engineering courses approved by AICTE.
ReplyDeleteThanks for the information. It is very useful for us. For more details like this visit our website.How To Make Your Resume Stand Out
ReplyDeleteReally helpful, thanks for Sharing - visit our website, Data Analysis MCA from Distance Education to know more about It.
ReplyDelete