Monday, October 21, 2013

Implement a hadoop pipeline for NGS data analysis - 4

"Align Fastq to ReferenceGenome and get Alignment"

This is always the first step to do a NGS data analysis, supposedly we already did the QC on the "Fastq" files. In this command, both the "Fastq" and "ReferenceGenome" are inputs, the "Alignment" is the output. The key difference between the two type of inputs is that every job has its own "Fastq" while lots of similar jobs reuse the same "ReferenceGenome", as well as the application "Align".

For the NGS data analysis, we always have some pre-existing files include

1. All bioinformatics applications, packages that were used in the data analysis.

2. All associated data files used in above applications, which include:

  1. Reference sequence file (and index) 
  2. Indexed reference sequence file for different aligners
  3. Databases like dbSNP, COSMIC, UCSC Gene Table, etc.
  4. Other pre-defined common resource

The total size of these applications and data files could be as much as tens of GB, depending how many reference genomes we are going to support in our pipeline.

To achieve the best performance we need to keep a copy of all apps and data in each slave node. We do NOT want to copy tens of GB of "immutable" apps and library files from one place to all slave nodes for every job. Also it is very important to keep the consistency of all apps and library files in all nodes. In another word, we MUST guarantee everything is the same. The idea is simple: we only maintain a single copy of all bioinformatics applications and library files in the master name node or another dedicated node, as long as this dedicated node can access all slave nodes with ssh-keyless. Then we can synchronize all slave nodes to make sure all slave nodes has a duplicated copy, before accepting EVERY new job. Why we can NOT synchronize slave nodes in a fixed interval? Because the backend library files changes may cause inconsistency if a job is in processing. So, before we pop a job from the job queue and to push to hadoop, we should synchronize all slave nodes firstly. OK, again, why do not we just do the synchronization only once when we start our framework? Because we need to update applications and library files while the service is running. If we are going to deploy this service as 24*7 we need to update these things on-the-fly. Luckily it would not take too much time as most operations are just comparing timestamp of files. The real data transferring operation will not happen unless we add/update/remove the data in the name node. Even so, the data transferring is a incremental updating. It is quite simple to get all these done - "rsync" is our friend.



For simplicity, we use the name node as the central repository. We already created indexed genome hg19 for BWA and put it under "
/home/hadoop/data/hg19/bwa". We also put the BWA binary executable under "/home/hadoop/tools/bwa" on name node.

Now we need to transfer all these files to all slave nodes. There is one script named "slaves.sh" under hadoop's distribution package. It is convenient to use that script execute a command in all slaves nodes. Howeevr, this synchronization operation should be done in our application. For demonstration, we can just do it manually to learn the process. 

#master is name node.
#n1, n2 and n3 are slave nodes.


#on slave node 
hadoop@n1$ mkdir -p /hadoop/data
hadoop@n1$ mkdir -p /hadoop/app


#on master node 
hadoop@master$ rsync -arvue ssh /home/hadoop/tools/bwa hadoop@n1:/hadoop/app/


hadoop@master$ rsync -arvue ssh /home/hadoop/data/hg19/bwa hadoop@n1:/hadoop/data/



19 comments:

  1. Uniqe informative article and of course True words, thanks for sharing. Today I see myself proud to be a hadoop professional with strong dedication and will power by blasting the obstacles. Thanks to Hadoop Training in Chennai

    ReplyDelete
  2. I gathered a lot of information through this article.Every example is easy to understandable and explaining the logic easily.Thanks! AWS Training in chennai | AWS Training chennai | AWS course in chennai

    ReplyDelete
  3. I have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article.
    B.Tech | Ex-MBA | MCA | Diploma | Distance Learning Program
    B.Tech | Diploma |Engineering Courses

    ReplyDelete
  4. Thanks for sharing this niche useful informative post to our knowledge.
    Best JAVA Training in Chennai | JAVA J2EE Training in Chennai

    ReplyDelete
  5. Hi, I am Jackson from Chennai. I am technology freak. I did Hadoop Training Chennai at FITA. This is useful for me to make a bright career in IT field.

    ReplyDelete
  6. Thank you so much for sharing this worthwhile to spent time on. You are running a really awesome blog. Keep up this good work Best Hadoop Training in Chennai

    ReplyDelete
  7. Hi friends,This is Johnson from Chennai.Thanks for sharing this informative blog. I did Unix certification course in Chennai at Fita academy. This is really useful for me to make a bright career.
    Regards..
    Unix Training

    ReplyDelete
  8. Learning new technology would give oneself a true confidence in the current emerging Information Technology domain. With the knowledge of big data the most magnificent cloud computing technology one can go the peek of data processing. As there is a drastic improvement in this field everyone are showing much interest in pursuing this technology. Your content tells the same about evolving technology. Thanks for sharing this.

    Hadoop Training in Chennai | Best Hadoop Training in Chennai | Best hadoop training institute in chennai | Big Data Hadoop Training in Chennai

    ReplyDelete
  9. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    Regards,

    ccna course in Chennai|ccna training in Chennai|ccna training institute in Chennai

    ReplyDelete
  10. Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

    Software testing training in chennai | Software testing institute in chennai | Testing training in chennai

    ReplyDelete
  11. I have finally found a Worth able content to read. The way you have presented information here is quite impressive. I have bookmarked this page for future use. Thanks for sharing content like this once again. Keep sharing content like this.

    Software testing training in chennai | Software testing training institutes in chennai | Manual testing training in Chennai

    ReplyDelete
  12. There is a huge demand for professional big data analysts who are able to use the software which is used to process the big data in order to get accurate results. MNC's are looking for professionals who can process their data so that they can get into a accurate business decision which would eventually help them to earn more profits, they can serve their customers better, and their risk is lowered.
    big data training in chennai|big data training|big data course in chennai|big data training chennai|big data hadoop training in chennai

    ReplyDelete
  13. Cloud computing is the next big thing, through cloud the users have the liberty to use a shared network. The companies can focus on core business parts rather than investing heavily on infrastucture.
    cloud computing training in chennai|cloud computing courses in chennai|cloud computing training

    ReplyDelete

  14. Everything is fine, am happy about your blog. Thanks admin for sharing the unique content, you have done a great job I appreciate your effort and I hope you will get more positive comments from the web users.
    Regards,
    Angular training in chennai|Angularjs training in chennai

    ReplyDelete
  15. We are offering applications at most cost effective and affordable rates. Feel free to contact us at telephony applications

    ReplyDelete
  16. Data analysis is the future for next generation. He there I have gone through your blog and your content is fabulous. We are in distance education sharing fact about distance learning courses. learn fact about distance engineering courses approved by AICTE.

    ReplyDelete
  17. Thanks for the information. It is very useful for us. For more details like this visit our website.How To Make Your Resume Stand Out

    ReplyDelete
  18. Really helpful, thanks for Sharing - visit our website, Data Analysis MCA from Distance Education to know more about It.

    ReplyDelete