Thursday, March 21, 2013

Inifinite processing time for novoalign

Few days ago an user reported that his simple "@align" job has been running more than 5 days in cluster.
It is absolutely abnormal. After few hours work I located the 3 causes of this problem.

The first two causes are corrupted input FASTQ file, result in incorrect format and size.
The last was from novoalign - novoalign processes corrupted FASTQ files for a infinite time without giving any error messages.

Before novoalign adding the new feature of validating the input files, we can do a simple validation

For pair-end files, we can compare the size (number of reads) firstly.

$zcat X1.fq.gz | wc -l

$zcat X2.fq.gz | wc -l  


For single-end file, check out if it can be mod by 4

$expr `zcat X.fq.gz | wc -l` % 4  

TCGA and COSMIC database for annotating mutations

TCGA - The Cancer Genome Atlas https://tcga-data.nci.nih.gov/tcga/ COSMIC http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/download.html