Friday, August 19, 2011

Merge the SAM/BAM file

Suppose I have two alignments in SAM (1.sam and 2.sam). Since these two alignments are two lanes from the same sample, I should merge these two alignments into one.


>wc -l 1.sam
10000

>wc -l 2.sam
10000

So both the SAM files has 10000 lines (include the header)

First using picard's "MergeSamFiles.jar"

>java -jar MergeSamFiles.jar I=1.sam I=2.sam O=12.sam
>wc -l 12.sam
19973

The missing 27 lines are actually duplicated header lines and they were removed by PICARD automatically. However, there is a subtle bug here.

>less 12.sam

@HD VN:1.0 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621 AS:hg19.nov.illumina.nix
@SQ SN:chr2 LN:243199373 AS:hg19.nov.illumina.nix
@SQ SN:chr3 LN:198022430 AS:hg19.nov.illumina.nix
@SQ SN:chr4 LN:191154276 AS:hg19.nov.illumina.nix
@SQ SN:chr5 LN:180915260 AS:hg19.nov.illumina.nix
@SQ SN:chr6 LN:171115067 AS:hg19.nov.illumina.nix
@SQ SN:chr7 LN:159138663 AS:hg19.nov.illumina.nix
@SQ SN:chr8 LN:146364022 AS:hg19.nov.illumina.nix
@SQ SN:chr9 LN:141213431 AS:hg19.nov.illumina.nix
@SQ SN:chr10 LN:135534747 AS:hg19.nov.illumina.nix
@SQ SN:chr11 LN:135006516 AS:hg19.nov.illumina.nix
@SQ SN:chr12 LN:133851895 AS:hg19.nov.illumina.nix
@SQ SN:chr13 LN:115169878 AS:hg19.nov.illumina.nix
@SQ SN:chr14 LN:107349540 AS:hg19.nov.illumina.nix
@SQ SN:chr15 LN:102531392 AS:hg19.nov.illumina.nix
@SQ SN:chr16 LN:90354753 AS:hg19.nov.illumina.nix
@SQ SN:chr17 LN:81195210 AS:hg19.nov.illumina.nix
@SQ SN:chr18 LN:78077248 AS:hg19.nov.illumina.nix
@SQ SN:chr19 LN:59128983 AS:hg19.nov.illumina.nix
@SQ SN:chr20 LN:63025520 AS:hg19.nov.illumina.nix
@SQ SN:chr21 LN:48129895 AS:hg19.nov.illumina.nix
@SQ SN:chr22 LN:51304566 AS:hg19.nov.illumina.nix
@SQ SN:chrX LN:155270560 AS:hg19.nov.illumina.nix
@SQ SN:chrY LN:59373566 AS:hg19.nov.illumina.nix
@SQ SN:chrM LN:16571 AS:hg19.nov.illumina.nix
@RG ID:TOMATO PL:ILLUMINA LB:LIBTMP SM:SAMPLE
@PG ID:novoalign VN:V2.07.10 CL:novoalign
@PG ID:novoalign.1 VN:V2.07.10 CL:novoalign



The last two lines are both @PG which differ the ID. Before merging, they are both "novoalign", after merging, the duplicated ID was renamed as "novoalign.1" (if merging 3 files, you will have "novoalign.2" as well). When you check the individual read alignment, you will see the PG:Z field referenced the PG line. When you call variances later, the genotyper application will process merged file as two files which is not what you want.


To fix this:

1. Keep the first @PG line and delete all other @PG lines
2. Replace all "novoalign.X" to "novoalign".

Or you can use samtools.

>java -jar SortSam.jar INPUT=1.sam OUTPUT=1.bam SO=unsorted
>java -jar SortSam.jar INPUT=2.sam OUTPUT=2.bam SO=unsorted
>samtools merge 3.bam 1.bam 2.bam
>samtools view -h -o 3.sam 3.bam
>less 3.sam

Samtools only use the first SAM's header in merged SAM, so it is OK in this case.