Next Generation Sequencing and Data Analysis: GNUMAP and novoalign benchmark

Data:

7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz, pair-end, 25000 reads, 101bp per read.

chr20.fa, human genome chromosome 20, build hg19.

Command:

$gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz

Output:

This is GNUMAP, Version 2.2.0, for public and private use.

Command Line Arguments: gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt,7550X1_1_2.txt

Parameters:

Verbose: 1

Genome file(s): chr20.fa

Output file: example.output

Sequence file(s): 7550X1_1_1.txt,7550X1_1_2.txt

Align score: 0.9

Number of threads: 1

Mer size: 10

Using jump size of 5

Using Default Alignment Scores

Gap score: -4

Maximum Gaps: 3

Hashing the genome.

[0/1] gen_size=0, my_start=0, my_end=0

chr20.fa

Hashing Genome...

Reading: chr20

..............................................

Size of genome: 63025520

[0/-] Converting to Vector...

[0/-] Trying to create hash of size 1048576

[0/-] Finished create hash.

[0/-] Stats: Total hashes is 59499396, Longest hash is 69516, shortest is 1, and average is 56.743046

[0/-] Trying to create a new genome with a size of 63025520...Success!

[0/-] Trying to malloc 7878190 elements for positions array...Success!

[0/-] Finished Vector Conversion

Time to hash: 12 seconds

Matching 2 file(s):

[-/0] Matching 25000 sequences of: 7550X1_1_1.txt

Reads per processor: 128

[0/0] 33% reads complete

[0/0] 66% reads complete

[0/0] 98% reads complete

[-/0] Matching 25000 sequences of: 7550X1_1_2.txt

[0/0] 0% reads complete

[0/0] 33% reads complete

[0/0] 66% reads complete

[0/0] 98% reads complete

[0/-] Time since start: 4890.28

[0/-] Printing output.

Finished printing to example.output.sgr

#Finished.

# Total Time: 4890.38 seconds.

# Found 49152 sequences.

# Sequences matched: 13921

# Sequences not matched: 35231

# Output written to example.output

Total wait time is 0.000000

The same data using novoalign(for fairness, remove license file to avoid threading and using unzipped fastq files):

$novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:7851\tCN:HCI' -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt >7550x1.sam

# novoalign (2.07.05 - Nov 29 2010 @ 13:34:51) - A short read aligner with qualities.

# Licensed for evaluation, educational, and not-for-profit use only.

# novoalign -o SAM @RG ID:YING PL:ILLUMINA LB:LB_TEST SM:7851 CN:HCI -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt

# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.

# Index Build Version: 2.7

# Hash length: 14

# Step size: 1

# Paired Reads: 25000

# Pairs Aligned: 368

# Read Sequences: 50000

# Aligned: 767

# Unique Alignment: 749

# Gapped Alignment: 18

# Quality Filter: 490

# Homopolymer Filter: 32

# Elapsed Time: 59.363 (sec.)

# CPU Time: 0.9 (min.)

Next Generation Sequencing and Data Analysis

Monday, March 14, 2011

GNUMAP and novoalign benchmark

No comments:

Post a Comment

About Me

Blog Archive