Thursday, October 17, 2013

Implement a hadoop pipeline for NGS data analysis - 3

The baby step.

The WordCount example in the hadoop distribution provides an excellent demonstration on how to implement a simple MapReduce application. About 7 years ago when I read the paper of MapReduce from Google I realized it would be an effective solution for many problems in the field of bioinformatics then I developed a light weight MapReduce application, the ABCGrid http://abcgrid.cbi.pku.edu.cn/mainly for BLAST-alike sequence analysis. 

The WordCount application will count the frequency of words so it need to read the text file, tokenize the line (Map stage). After that, by partitioning the results to Reducer, and counting each word (Reduce stage). 

The input file format for NGS will be FASTQ file, which size differ from hundreds of MB to a few GB, depending on the platform, coverage, etc. We can use the FASTQ file name as key, then we need to split the FASTQ files.

1) Split the FASTQ manually, and save as files.
   
  #count lines, supposedly we got 1M lines
  $zcat  A_1.fastq.gz | wc -l

  1.1. NUM_LINES % 4==0
  because every 4 lines represent a read, we must assure that
  split size is right 

  1.2 size the raw fastq
  #If the fastq is small, we do not need to bother splitting it     #into too many pieces, because of the overhead.

  1.3 number of nodes in the cluster, N
  #number of splits >= N

  1.4 HDFS block size, M
  #size of each split <= M

  Suposedly after some calculation, we determine that we should   split the big fastq into 100 pieces, with each of them has 10K reads.

  $zcat A_1.fastq.gz | split -l 10000 -a 4 -d - A_1.

  Also we have to upload the data to the HDFS manually
  $hadoop fs -put A* /user/hdfs/suny/
2) Split the FASTQ in the application automatically, and save as files. But the size is determined by a set of rules like above, so we do need to take out our calculator

   2.1 public static int splitFastq(File input, File outputPath)
    {...
       return 10000;
    }
   2.2 public static int copyFastq(File outputPath, Path InputHDFS)
    {
       FileUtils.copy(...);
    }

3) Split the FASTQ in the application automatically, but do NOT save as files. Instead, transfer/feed the content to the Map function as value. To do this we need to derive our own class from InputFormat and RecordReader.

No comments:

Post a Comment