Next Generation Sequencing and Data Analysis: Loss of function annotation

From 1000 genome project:

Functional annotation of SNPs, short indels and large structural variants was determined

with reference to the GENCODE v3b annotation release (Harrow, Denoeud et al. 2006).

Coding SNPs were mapped on to transcripts annotated as “protein_coding” and

containing an annotated START codon, and classified as synonymous, non-synonymous,

nonsense (stop codon-introducing), stop codon-disrupting or splice site-disrupting

(canonical splice sites). Transcripts labeled as NMD (predicted to be subject to nonsensemediated

decay) were not used. Small deletions predicted to cause a frame-shift and

large deletions predicted to disrupt gene function were also analysed.

Nonsense and splice-disrupting SNPs were flagged as likely representing reference error

or annotation artefacts if the inferred loss-of-function (LOF) allele was also the ancestral

state, or if the reference (non-LOF) allele was not observed in any individual in that

population, and were excluded from the per-individual counts in Table 2. Splice-disrupting

SNPs in non-canonical splice sites were also discarded. We did not consider the frameshift

status of splice-disrupting SNPs due to the challenges of inferring the effects of

removal of splice-donor and acceptor sites on exon structure, but rather treated all such

SNPs as likely to affect gene function.

We classified large deletions as gene-disrupting if they fulfilled the following criteria:

1. Removal of >50% of the coding sequence; or

2. Removal of the gene’s transcriptional start site or start codon; or

3. Removal of an odd number of internal splice sites; or

4. Removal of one or more internal coding exons that would be predicted to generate

a frameshift.

For large deletions with imprecise breakpoints, we conservatively required that the

deletions defined by both the inner and outer confidence intervals would have the same

predicted effect on gene function. For cases with microhomology at the break-point we

treated the breakpoint as falling to the right-hand side of the region of microhomology.

We did not perform functional annotation for large duplications due to the challenges of

inferring functional consequences. The numbers stated in the text should thus be

regarded as a lower bound for the number of observed loss-of-function variants per

individual genome.

However, it should also be noted that the proportion of false positive calls in the LOF class

due to sequencing and annotation errors is expected to be substantially higher than the

genome-wide average. This effect is expected as LOF sites show a low level of true

polymorphism due to selective constraint, meaning that a uniform error rate across the

genome will result in a higher proportion of false positive calls compared to other (more

variable) sites.

Enrichment of false positive calls in LOF variants is most evident in the CHB+JPT

samples, which showed a higher per-individual number of LOF SNPs than other

populations despite a comparable number of synonymous variants (Supplementary Table

11), as well as an unusual peak in the derived allele frequency spectrum (Supplementary

Figure 13). This is likely due to a mild elevation in genome-wide false positive rates for

SNPs in this population compared to other samples, which is then highly enriched at

functionally constrained sites.

To lower the number of false positive indel calls we applied more stringent filters to the

subset of indels that were called in the genome-wide set and were predicted to fall into the

LOF class. The stringent filter requires that the range of positions where an indel would

yield the same alternative haplotype sequence as the original called indel (for instance, in

a repeat, the deletion of any repeat unit would give the same alternative haplotype), plus 4

bases of reference sequence on both sides of this region, was covered by at least one

read on the forward strand, and at least one read on the reverse strand, with at most one

mismatch between the read and the alternative haplotype sequence resulting from the

indel (regardless of base-qualities). This filter removed an excess of 1 bp frameshift

insertions seen in CHB+JPT with respect to CEU in the less stringently filtered genomewide

indel call set, although it is expected to remove a significant number of true positive

calls as well. The indels that pass these stringent filters have been annotated in the

project’s VCF files.

Experimental validation and manual reannotation of identified LOF variants is currently

ongoing (manuscript in preparation).

For extrapolating the functional variants identified per individual in the exon project to the

whole genome (Table 2) we used the ratio of the total coding sequence and splice sites in

the exon-capture target regions (1,423,559 bp and 7,513, respectively) to the

corresponding numbers for the GENCODE v3b annotation set as a whole (35,676,620 bp

and 384,439, respectively).

The coordinates and predicted functional consequences of all of the LOF variants

identified in the project are available on the 1000 Genomes FTP site.

Next Generation Sequencing and Data Analysis

Monday, March 7, 2011

Loss of function annotation

No comments:

Post a Comment

About Me

Blog Archive