Monday, October 14, 2013

Implement a hadoop pipeline for NGS data analysis - 2

The fundamental idea behind any big data processing is "divide and conquer" which means we always try to divide the big data into small pieces - small enough to process in parallel in many computers. The prerequisite is that the processing of these small piece should be independent as much as possible to each other, without any in-process communications. 


Supposed a FASTA file has 10 million sequences, now we want to do a BLAST operation to find all good alignments, using human reference genome hg19.
The question is how should we "divide and conquer" this computation intensive job. The first thing come to my mind is "divide the FASTA files into many fragments". Let us assume the optimal size of a fragment is 10 sequences - in another word, we will have 1 million FASTA files, and each FASTA file contains ten sequences. If we submit this job to cluster consists of 100 computation nodes (sometime they also called "slave" or "worker"), then on average, each node will process 10,000 jobs. In an ideal situation, we can say we expect a 100x speedup.

Another approach is to divide the reference data. Let us say if the query FASTA file is very small, with only 10 sequences, then it does not make sense to split it because of the overhead. In this case, we can divide (and index) the reference genome, as long as it was "divide-able". To the very least, the chromosome is a nature and safe unit for any dividing methods.

Now come back to the NGS era, now what file formats are involved in a typical DNASeq data analysis?


  1. FASTQ, sequencing read
  2. SAM/BAM, alignment
  3. BED, genomic regions
  4. VCF, variants






No comments:

Post a Comment