If the raw file is quite big, the counting itself will take a long time to process because the file has to be decompressed firstly.
The common approach is using "zcat" with "wc". Is it the fastest?
#1. use zcat - 9.381s 
$time zcat A_R1.fq.gz | wc -l
 20000000
 real    0m9.381s
 user    0m7.423s
 sys     0m1.919s
#2. use zgrep - 16.258s
 $time zgrep -Ec "$" A_R1.fq.gz
 20000000
 real    0m16.258s
 user    0m13.109s
 sys     0m0.490s
#3. use pigz - 7.227s
 $sudo apt-get install pigz
 $time pigz -d -p 1 -c A_R1.fq.gz  | wc -l
 20000000
 real    0m7.227s
 user    0m5.544s
 sys     0m0.386s
#4. use pigz with 4 threads - 5.973s
 $time pigz -d -p 4 -c A_R1.fq.gz  | wc -l
 20000000
 real    0m5.973s
 user    0m5.599s
 sys     0m2.200s
By default, pigz will use all available processors so "-p" is not necessary. The above command can be simplified as 
 $pigz -dc A_R1.fq.gz  | wc -l
Clearly, our winner is pigz.
I am going to write another blog on splitting the "fastq.gz" file with only one goal - as fast as possible. 
 1. Count lines
 2. Determine how many lines per splitted fastq file
 3. Unzip and split the paired "fastq.gz" files
 4. Zip splitted "fastq" files again.
Reference:
http://superuser.com/questions/135329/count-lines-in-a-compressed-file
 
No comments:
Post a Comment