Tuesday, August 13, 2013

Hard trimming FASTQ file

When we plot the base quality distribution (for example, using fastqc), we can often observe that the base qualities on the 3' tail are pretty poor. Sometimes we can observe the qualities on the 5' headers are poor too.

For better alignment and variants calling (if any), we may need to trim the fastq files, only keep the bases in the middle of cycles, which generally have good and consistent qualities.

One method is trimming by its base quality, which can be done using tools like sickle.  Or we can just hard trim all reads , which means just removing N tail (or head) bases and corresponding qualities.

A typical FATSQ file:



@READ_NAME
NGGAAATGGCGTCTGGCGGCGAGATAATGG
+
#1=DDFFFHGHHHIIGIIJJJJJJIIJGJIIHFDD?BDB

Let us say we have pair-end fastq files: "A_1.fastq.gz" and "A_2.fastq.gz"

#trim the 10 tail bases
zcat A_1.fastq.gz | awk --posix '{ if (NR % 2 == 0) { sub(/.{10}$/,""); print} else {print}}' | gzip > A_1.fq.gz

#trim the 10 head bases
zcat A_1.fastq.gz | awk --posix '{ if (NR % 2 == 0) { sub(/^.{10}/,""); print} else {print}}' | gzip > A_1.fq.gz

#trim the 10 head bases and 10 tail bases
zcat A_1.fastq.gz | awk --posix '{ if (NR % 2 == 0) { sub(/^.{10}/,""); sub(/.{10}$/,""); print} else {print}}' | gzip > A_1.fq.gz

No comments:

Post a Comment