Next Generation Sequencing and Data Analysis: Counting lines in "fastq.gz"

Thursday, July 24, 2014

Counting lines in "fastq.gz"

Before splitting the "fastq.gz" into fragments, we need to counting how many lines in the raw file, then we can calculate how many lines per fragment.

If the raw file is quite big, the counting itself will take a long time to process because the file has to be decompressed firstly.

The common approach is using "zcat" with "wc". Is it the fastest?

#1. use zcat - 9.381s 

$time zcat A_R1.fq.gz | wc -l

20000000

real 0m9.381s

user 0m7.423s

sys 0m1.919s

#2. use zgrep - 16.258s

$time zgrep -Ec "$" A_R1.fq.gz

20000000

real 0m16.258s

user 0m13.109s

sys 0m0.490s

#3. use pigz - 7.227s

$sudo apt-get install pigz

$time pigz -d -p 1 -c A_R1.fq.gz | wc -l

20000000

real 0m7.227s

user 0m5.544s

sys 0m0.386s

#4. use pigz with 4 threads - 5.973s

$time pigz -d -p 4 -c A_R1.fq.gz | wc -l

20000000

real 0m5.973s

user 0m5.599s

sys 0m2.200s

By default, pigz will use all available processors so "-p" is not necessary. The above command can be simplified as

$pigz -dc A_R1.fq.gz | wc -l

Clearly, our winner is pigz.

I am going to write another blog on splitting the "fastq.gz" file with only one goal - as fast as possible.

1. Count lines

2. Determine how many lines per splitted fastq file

3. Unzip and split the paired "fastq.gz" files

4. Zip splitted "fastq" files again.

Reference:

http://superuser.com/questions/135329/count-lines-in-a-compressed-file

Next Generation Sequencing and Data Analysis

Thursday, July 24, 2014

Counting lines in "fastq.gz"

No comments:

Post a Comment

About Me

Blog Archive