Monday, October 18, 2010

A SNP Call pipeline

SNP and MicroIndel detection

Novoalign detects mismatches and short indels in short reads from next-generation sequencing platforms. The recipes below contain command line usage to generate list of single nucleotide polymorphisms SNPs and Indels for further analysis.

Generate sorted alignments in BAM format

Inititally we need to align our file(s) to the reference genome:

novoalign -d hg18.nix -f reads1.fastq -o SAM 2> stats.txt > novo.sam

samtools view -bS novo.sam | samtools sort - reads1

For paired-end or mate-pair reads adjust the parameters accordingly. If we have multiple alignment files these all need to be sorted individually and merged with either samtools or Picard.

Remove PCR duplicates

It is highly recommended to remove PCR duplicates before proceeding with the SNP and Indel calls. We use samtools as an example to remove PCR duplicates.

#remove PCR duplicates from single-end reads

samtools rmdup -s reads1.bam reads1.sorted.bam

#For Paired reads

samtools rmdup reads1.bam

SNP and Microindel calling

In this example we use the samtools pileup SNP caller and accept reads with a minimum mapping quality of 20:

#The hg18.fasta reference genome sequence is required. is also required

samtools view -u -q 20 reads1.bam | samtools pileup -vc -f hg18.fasta - | varFilter -D 1000 -G 50 > variations.list

Filtering and final list generation

A simple awk script can be used to filter our list of variations where the indel SNP quality is greater than 50 or 60 respectively:

awk '($3=="*"&&$6>=50)||($3!="*"&&$6>30)' variations.list > variations.filtered.list

Exporting to other formats

The samtools distribution contains a helper script called that may be used to convert to the Variant Call Format (VCF): variations.filtered.list -r hg18.fasta > variations.vcf

Thursday, October 7, 2010

Walltime estimation

Input: ~2.5G *2 (pair-end)
Genome: ~4G

250MB reads and 4GB genome per node: ~4 hours