Thursday, July 24, 2014

Counting lines in "fastq.gz"

Before splitting the "fastq.gz" into fragments, we need to counting how many lines in the raw file, then we can calculate how many lines per fragment.

If the raw file is quite big, the counting itself will take a long time to process because the file has to be decompressed firstly.

The common approach is using "zcat" with "wc". Is it the fastest?



#1. use zcat - 9.381s 
$time zcat A_R1.fq.gz | wc -l
 20000000

 real    0m9.381s
 user    0m7.423s
 sys     0m1.919s


#2. use zgrep - 16.258s
 $time zgrep -Ec "$" A_R1.fq.gz
 20000000

 real    0m16.258s
 user    0m13.109s
 sys     0m0.490s


#3. use pigz - 7.227s
 $sudo apt-get install pigz
 $time pigz -d -p 1 -c A_R1.fq.gz  | wc -l
 20000000

 real    0m7.227s
 user    0m5.544s
 sys     0m0.386s

#4. use pigz with 4 threads - 5.973s
 $time pigz -d -p 4 -c A_R1.fq.gz  | wc -l
 20000000

 real    0m5.973s
 user    0m5.599s
 sys     0m2.200s

By default, pigz will use all available processors so "-p" is not necessary. The above command can be simplified as 

 $pigz -dc A_R1.fq.gz  | wc -l


Clearly, our winner is pigz.

I am going to write another blog on splitting the "fastq.gz" file with only one goal - as fast as possible. 

 1. Count lines
 2. Determine how many lines per splitted fastq file
 3. Unzip and split the paired "fastq.gz" files
 4. Zip splitted "fastq" files again.


Reference:
http://superuser.com/questions/135329/count-lines-in-a-compressed-file



No comments:

Post a Comment