Wednesday, June 1, 2011

Ambiguous Code in SOLiD Color Space

An Example from a SOLiD Color Space aligned SAM file:

CS:Z:T3.3300220113.2323222031301323303202033330300301210


Here "CS:Z" field gives the color read sequence "T3.33". The first character after the "CS:Z:" is always "T", after that, all valid code should be range from 0 to 3.

0 -this base is same as previous base
1 - this base is the transversion of previous base
2 - this base is the transition of previous base
3 - this base is the complement of previous base

However sometimes we can observe ambiguous code like "." or "4". You may notice the 3rd base is not in 0~3, it is a "." - an ambiguous code indicates an ambiguous base, could be any base of A, C, G and T.

This ambiguous code will fail the GATK's "CountCovariates" tool. It will complain


Error Caused by: org.broadinstitute.sting.utils.exceptions.UserException$MalformedBAM: SAM/BAM file SAMFileReader{/home/u0592675/A232/441_10xF3.mate.dup.sort.realign.bam}
is malformed: Unrecognized color space in SOLID read, color = . Unfortunately this bam file can not be recalibrated without full color space information because of potential reference bias.



In a simple test, around 1% (58297 out of 5275136) alignments has at least one ambiguous code. So here we have two options to eliminate this problem:

1. make a script to change all ambiguous codes into random valid codes. It looks like:

m = re.compile("\.")
m.sub(random.choice(('0','1','2','3')),CS_Z_STRING)

where CS_Z_STRING is like above example.

However, this may bring a serious problem because all bases execpt for the 1st base are depending on previous base. If we choose a wrong code which probability is 3/4, then all the following bases will be wrong.

2. delete the affected alignments

Considering the small percentage (1%) of ambiguous code affected alignments, it may be wise to delete those troublesome lines.

m1 = re.compile("CS:Z:T[0-3]*?[^0-3\s]")
def delete_ambiguous_line(fin,fout):
fo = file(fout,'w')
i = j = 0
for line in file(fin):
line = line.strip()
if line.startswith('@'):
nline = line
else:
if m1.search(line):
j += 1
else:
i += 1
nline = line
print >>fo,nline
print "Fix ambiguous code - Total: %d, Deleted: %d, Percent of Deleted: %d%%" % (i+j,j,j*100/(i+j))
fo.close()



1 comment:

  1. Good post! This helped us on our way to the following solution:

    Newer versions of GATK include two flags for the count covariate and table recal steps (--solid_nocall_strategy and --solid_recal_mode) which you can use to handle missing colourspace calls. The default no call strategy is to throw an exception. It would be nice if the exception mentioned the flag...

    http://www.broadinstitute.org/gsa/gatkdocs/release/org_broadinstitute_sting_gatk_walkers_recalibration_TableRecalibrationWalker.html#--solid_nocall_strategy

    ReplyDelete