Monday, March 7, 2011

Loss of function annotation

From 1000 genome project:

Functional annotation of SNPs, short indels and large structural variants was determined
with reference to the GENCODE v3b annotation release (Harrow, Denoeud et al. 2006).
Coding SNPs were mapped on to transcripts annotated as “protein_coding” and
containing an annotated START codon, and classified as synonymous, non-synonymous,
nonsense (stop codon-introducing), stop codon-disrupting or splice site-disrupting
(canonical splice sites). Transcripts labeled as NMD (predicted to be subject to nonsensemediated
decay) were not used. Small deletions predicted to cause a frame-shift and
large deletions predicted to disrupt gene function were also analysed.

Nonsense and splice-disrupting SNPs were flagged as likely representing reference error
or annotation artefacts if the inferred loss-of-function (LOF) allele was also the ancestral
state, or if the reference (non-LOF) allele was not observed in any individual in that
population, and were excluded from the per-individual counts in Table 2. Splice-disrupting
SNPs in non-canonical splice sites were also discarded. We did not consider the frameshift
status of splice-disrupting SNPs due to the challenges of inferring the effects of
removal of splice-donor and acceptor sites on exon structure, but rather treated all such
SNPs as likely to affect gene function.

We classified large deletions as gene-disrupting if they fulfilled the following criteria:

1. Removal of >50% of the coding sequence; or
2. Removal of the gene’s transcriptional start site or start codon; or
3. Removal of an odd number of internal splice sites; or
4. Removal of one or more internal coding exons that would be predicted to generate
a frameshift.

For large deletions with imprecise breakpoints, we conservatively required that the
deletions defined by both the inner and outer confidence intervals would have the same
predicted effect on gene function. For cases with microhomology at the break-point we
treated the breakpoint as falling to the right-hand side of the region of microhomology.
We did not perform functional annotation for large duplications due to the challenges of
inferring functional consequences. The numbers stated in the text should thus be
regarded as a lower bound for the number of observed loss-of-function variants per
individual genome.

However, it should also be noted that the proportion of false positive calls in the LOF class
due to sequencing and annotation errors is expected to be substantially higher than the
genome-wide average. This effect is expected as LOF sites show a low level of true
polymorphism due to selective constraint, meaning that a uniform error rate across the
genome will result in a higher proportion of false positive calls compared to other (more
variable) sites.
Enrichment of false positive calls in LOF variants is most evident in the CHB+JPT
samples, which showed a higher per-individual number of LOF SNPs than other
populations despite a comparable number of synonymous variants (Supplementary Table
11), as well as an unusual peak in the derived allele frequency spectrum (Supplementary
Figure 13). This is likely due to a mild elevation in genome-wide false positive rates for
SNPs in this population compared to other samples, which is then highly enriched at
functionally constrained sites.
To lower the number of false positive indel calls we applied more stringent filters to the
subset of indels that were called in the genome-wide set and were predicted to fall into the
LOF class. The stringent filter requires that the range of positions where an indel would
yield the same alternative haplotype sequence as the original called indel (for instance, in
a repeat, the deletion of any repeat unit would give the same alternative haplotype), plus 4
bases of reference sequence on both sides of this region, was covered by at least one
read on the forward strand, and at least one read on the reverse strand, with at most one
mismatch between the read and the alternative haplotype sequence resulting from the
indel (regardless of base-qualities). This filter removed an excess of 1 bp frameshift
insertions seen in CHB+JPT with respect to CEU in the less stringently filtered genomewide
indel call set, although it is expected to remove a significant number of true positive
calls as well. The indels that pass these stringent filters have been annotated in the
project’s VCF files.
Experimental validation and manual reannotation of identified LOF variants is currently
ongoing (manuscript in preparation).
For extrapolating the functional variants identified per individual in the exon project to the
whole genome (Table 2) we used the ratio of the total coding sequence and splice sites in
the exon-capture target regions (1,423,559 bp and 7,513, respectively) to the
corresponding numbers for the GENCODE v3b annotation set as a whole (35,676,620 bp
and 384,439, respectively).

The coordinates and predicted functional consequences of all of the LOF variants
identified in the project are available on the 1000 Genomes FTP site.


No comments:

Post a Comment