Next Generation Sequencing and Data Analysis: March 2011

Thursday, March 31, 2011

phased genotype

A phased genotype indicates that for heterozygous loci it is known which chromosome each variant

sequence belongs to. That is, if one variant has a Variant_seq=A,C attribute and another variant on the same landmark

feature (chromosome) has a Variant_seq=T,G attribute then the A and T occur together on one copy of the chromosome and the C and G

occur together on the other copy of that chromosome.

Thursday, March 24, 2011

VCF manipulation (to be edited)

#how many raw snps?

$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | wc -l

3492007

#filter dbSNP member and retain PASSed SNPs.

$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | awk '$3 !~ "rs" {print $0}' | awk '$7 ~ "PASS" {print $0}' > 7550x1.nodbsnp.pass.vcf

#how many novel SNPs?

$wc -l 7550x1.nodbsnp.pass.vcf

312776

$vcf-validator 7550x1.nodbsnp.pass.vcf

#error , no header

$head -n 22 7550X1.snps.raw.vcf > head.txt

$cat head.txt 7550x1.nodbsnp.pass.vcf > 7550x1.vcf

#have to build index!

$bgzip 7550x1.vcf

$tabix -p vcf 7550x.vcf.gz

#the commom snps between x1 and x4 (affected)

$vcf-isec -n =2 7550x1.vcf.gz 7550x4.vcf.gz > x1x4.vcf

$wc -l x1x4.vcf

130480 x1x4.vcf

Tuesday, March 22, 2011

Append .jar to USeq applications

FILES=USeq_7.6.2/Apps/*

for f in $FILES

echo "Processing $f file..."

# take action on each file. $f store current file name

#cat $f

mv $f $f'.jar'

done

Monday, March 14, 2011

GNUMAP and novoalign benchmark

Data:

7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz, pair-end, 25000 reads, 101bp per read.

chr20.fa, human genome chromosome 20, build hg19.

Command:

$gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz

Output:

This is GNUMAP, Version 2.2.0, for public and private use.

Command Line Arguments: gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt,7550X1_1_2.txt

Parameters:

Verbose: 1

Genome file(s): chr20.fa

Output file: example.output

Sequence file(s): 7550X1_1_1.txt,7550X1_1_2.txt

Align score: 0.9

Number of threads: 1

Mer size: 10

Using jump size of 5

Using Default Alignment Scores

Gap score: -4

Maximum Gaps: 3

Hashing the genome.

[0/1] gen_size=0, my_start=0, my_end=0

chr20.fa

Hashing Genome...

Reading: chr20

..............................................

Size of genome: 63025520

[0/-] Converting to Vector...

[0/-] Trying to create hash of size 1048576

[0/-] Finished create hash.

[0/-] Stats: Total hashes is 59499396, Longest hash is 69516, shortest is 1, and average is 56.743046

[0/-] Trying to create a new genome with a size of 63025520...Success!

[0/-] Trying to malloc 7878190 elements for positions array...Success!

[0/-] Finished Vector Conversion

Time to hash: 12 seconds

Matching 2 file(s):

[-/0] Matching 25000 sequences of: 7550X1_1_1.txt

Reads per processor: 128

[0/0] 33% reads complete

[0/0] 66% reads complete

[0/0] 98% reads complete

[-/0] Matching 25000 sequences of: 7550X1_1_2.txt

[0/0] 0% reads complete

[0/0] 33% reads complete

[0/0] 66% reads complete

[0/0] 98% reads complete

[0/-] Time since start: 4890.28

[0/-] Printing output.

Finished printing to example.output.sgr

#Finished.

# Total Time: 4890.38 seconds.

# Found 49152 sequences.

# Sequences matched: 13921

# Sequences not matched: 35231

# Output written to example.output

Total wait time is 0.000000

The same data using novoalign(for fairness, remove license file to avoid threading and using unzipped fastq files):

$novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:7851\tCN:HCI' -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt >7550x1.sam

# novoalign (2.07.05 - Nov 29 2010 @ 13:34:51) - A short read aligner with qualities.

# Licensed for evaluation, educational, and not-for-profit use only.

# novoalign -o SAM @RG ID:YING PL:ILLUMINA LB:LB_TEST SM:7851 CN:HCI -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt

# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.

# Index Build Version: 2.7

# Hash length: 14

# Step size: 1

# Paired Reads: 25000

# Pairs Aligned: 368

# Read Sequences: 50000

# Aligned: 767

# Unique Alignment: 749

# Gapped Alignment: 18

# Quality Filter: 490

# Homopolymer Filter: 32

# Elapsed Time: 59.363 (sec.)

# CPU Time: 0.9 (min.)

Monday, March 7, 2011

Loss of function annotation

From 1000 genome project:

Functional annotation of SNPs, short indels and large structural variants was determined

with reference to the GENCODE v3b annotation release (Harrow, Denoeud et al. 2006).

Coding SNPs were mapped on to transcripts annotated as “protein_coding” and

containing an annotated START codon, and classified as synonymous, non-synonymous,

nonsense (stop codon-introducing), stop codon-disrupting or splice site-disrupting

(canonical splice sites). Transcripts labeled as NMD (predicted to be subject to nonsensemediated

decay) were not used. Small deletions predicted to cause a frame-shift and

large deletions predicted to disrupt gene function were also analysed.

Nonsense and splice-disrupting SNPs were flagged as likely representing reference error

or annotation artefacts if the inferred loss-of-function (LOF) allele was also the ancestral

state, or if the reference (non-LOF) allele was not observed in any individual in that

population, and were excluded from the per-individual counts in Table 2. Splice-disrupting

SNPs in non-canonical splice sites were also discarded. We did not consider the frameshift

status of splice-disrupting SNPs due to the challenges of inferring the effects of

removal of splice-donor and acceptor sites on exon structure, but rather treated all such

SNPs as likely to affect gene function.

We classified large deletions as gene-disrupting if they fulfilled the following criteria:

1. Removal of >50% of the coding sequence; or

2. Removal of the gene’s transcriptional start site or start codon; or

3. Removal of an odd number of internal splice sites; or

4. Removal of one or more internal coding exons that would be predicted to generate

a frameshift.

For large deletions with imprecise breakpoints, we conservatively required that the

deletions defined by both the inner and outer confidence intervals would have the same

predicted effect on gene function. For cases with microhomology at the break-point we

treated the breakpoint as falling to the right-hand side of the region of microhomology.

We did not perform functional annotation for large duplications due to the challenges of

inferring functional consequences. The numbers stated in the text should thus be

regarded as a lower bound for the number of observed loss-of-function variants per

individual genome.

However, it should also be noted that the proportion of false positive calls in the LOF class

due to sequencing and annotation errors is expected to be substantially higher than the

genome-wide average. This effect is expected as LOF sites show a low level of true

polymorphism due to selective constraint, meaning that a uniform error rate across the

genome will result in a higher proportion of false positive calls compared to other (more

variable) sites.

Enrichment of false positive calls in LOF variants is most evident in the CHB+JPT

samples, which showed a higher per-individual number of LOF SNPs than other

populations despite a comparable number of synonymous variants (Supplementary Table

11), as well as an unusual peak in the derived allele frequency spectrum (Supplementary

Figure 13). This is likely due to a mild elevation in genome-wide false positive rates for

SNPs in this population compared to other samples, which is then highly enriched at

functionally constrained sites.

To lower the number of false positive indel calls we applied more stringent filters to the

subset of indels that were called in the genome-wide set and were predicted to fall into the

LOF class. The stringent filter requires that the range of positions where an indel would

yield the same alternative haplotype sequence as the original called indel (for instance, in

a repeat, the deletion of any repeat unit would give the same alternative haplotype), plus 4

bases of reference sequence on both sides of this region, was covered by at least one

read on the forward strand, and at least one read on the reverse strand, with at most one

mismatch between the read and the alternative haplotype sequence resulting from the

indel (regardless of base-qualities). This filter removed an excess of 1 bp frameshift

insertions seen in CHB+JPT with respect to CEU in the less stringently filtered genomewide

indel call set, although it is expected to remove a significant number of true positive

calls as well. The indels that pass these stringent filters have been annotated in the

project’s VCF files.

Experimental validation and manual reannotation of identified LOF variants is currently

ongoing (manuscript in preparation).

For extrapolating the functional variants identified per individual in the exon project to the

whole genome (Table 2) we used the ratio of the total coding sequence and splice sites in

the exon-capture target regions (1,423,559 bp and 7,513, respectively) to the

corresponding numbers for the GENCODE v3b annotation set as a whole (35,676,620 bp

and 384,439, respectively).

The coordinates and predicted functional consequences of all of the LOF variants

identified in the project are available on the 1000 Genomes FTP site.

Next Generation Sequencing and Data Analysis