If the raw file is quite big, the counting itself will take a long time to process because the file has to be decompressed firstly.
The common approach is using "zcat" with "wc". Is it the fastest?
#1. use zcat - 9.381s
$time zcat A_R1.fq.gz | wc -l
20000000
real 0m9.381s
user 0m7.423s
sys 0m1.919s
#2. use zgrep - 16.258s
$time zgrep -Ec "$" A_R1.fq.gz
20000000
real 0m16.258s
user 0m13.109s
sys 0m0.490s
#3. use pigz - 7.227s
$sudo apt-get install pigz
$time pigz -d -p 1 -c A_R1.fq.gz | wc -l
20000000
real 0m7.227s
user 0m5.544s
sys 0m0.386s
#4. use pigz with 4 threads - 5.973s
$time pigz -d -p 4 -c A_R1.fq.gz | wc -l
20000000
real 0m5.973s
user 0m5.599s
sys 0m2.200s
By default, pigz will use all available processors so "-p" is not necessary. The above command can be simplified as
$pigz -dc A_R1.fq.gz | wc -l
Clearly, our winner is pigz.
I am going to write another blog on splitting the "fastq.gz" file with only one goal - as fast as possible.
1. Count lines
2. Determine how many lines per splitted fastq file
3. Unzip and split the paired "fastq.gz" files
4. Zip splitted "fastq" files again.
Reference:
http://superuser.com/questions/135329/count-lines-in-a-compressed-file
No comments:
Post a Comment