Thursday, March 31, 2011

phased genotype

A phased genotype indicates that for heterozygous loci it is known which chromosome each variant
sequence belongs to. That is, if one variant has a Variant_seq=A,C attribute and another variant on the same landmark
feature (chromosome) has a Variant_seq=T,G attribute then the A and T occur together on one copy of the chromosome and the C and G
occur together on the other copy of that chromosome.

Thursday, March 24, 2011

VCF manipulation (to be edited)

#how many raw snps?
$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | wc -l


#filter dbSNP member and retain PASSed SNPs.
$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | awk '$3 !~ "rs" {print $0}' | awk '$7 ~ "PASS" {print $0}' > 7550x1.nodbsnp.pass.vcf

#how many novel SNPs?
$wc -l 7550x1.nodbsnp.pass.vcf


$vcf-validator 7550x1.nodbsnp.pass.vcf

#error , no header

$head -n 22 7550X1.snps.raw.vcf > head.txt
$cat head.txt 7550x1.nodbsnp.pass.vcf > 7550x1.vcf

#have to build index!

$bgzip 7550x1.vcf
$tabix -p vcf 7550x.vcf.gz

#the commom snps between x1 and x4 (affected)
$vcf-isec -n =2 7550x1.vcf.gz 7550x4.vcf.gz > x1x4.vcf

$wc -l x1x4.vcf

130480 x1x4.vcf

Tuesday, March 22, 2011

Append .jar to USeq applications

for f in $FILES
echo "Processing $f file..."
# take action on each file. $f store current file name
#cat $f
mv $f $f'.jar'

Monday, March 14, 2011

GNUMAP and novoalign benchmark

7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz, pair-end, 25000 reads, 101bp per read.

chr20.fa, human genome chromosome 20, build hg19.

$gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz


This is GNUMAP, Version 2.2.0, for public and private use.

Command Line Arguments: gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt,7550X1_1_2.txt
Verbose: 1
Genome file(s): chr20.fa
Output file: example.output
Sequence file(s): 7550X1_1_1.txt,7550X1_1_2.txt
Align score: 0.9
Number of threads: 1
Mer size: 10
Using jump size of 5
Using Default Alignment Scores
Gap score: -4
Maximum Gaps: 3

Hashing the genome.
[0/1] gen_size=0, my_start=0, my_end=0
Hashing Genome...
Reading: chr20
Size of genome: 63025520
[0/-] Converting to Vector...
[0/-] Trying to create hash of size 1048576
[0/-] Finished create hash.
[0/-] Stats: Total hashes is 59499396, Longest hash is 69516, shortest is 1, and average is 56.743046
[0/-] Trying to create a new genome with a size of 63025520...Success!
[0/-] Trying to malloc 7878190 elements for positions array...Success!
[0/-] Finished Vector Conversion

Time to hash: 12 seconds
Matching 2 file(s):

[-/0] Matching 25000 sequences of: 7550X1_1_1.txt
Reads per processor: 128
[0/0] 33% reads complete
[0/0] 66% reads complete
[0/0] 98% reads complete

[-/0] Matching 25000 sequences of: 7550X1_1_2.txt
[0/0] 0% reads complete
[0/0] 33% reads complete
[0/0] 66% reads complete
[0/0] 98% reads complete
[0/0] 98% reads complete

[0/-] Time since start: 4890.28

[0/-] Printing output.
Finished printing to example.output.sgr

# Total Time: 4890.38 seconds.
# Found 49152 sequences.
# Sequences matched: 13921
# Sequences not matched: 35231
# Output written to example.output
Total wait time is 0.000000

The same data using novoalign(for fairness, remove license file to avoid threading and using unzipped fastq files):

$novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:7851\tCN:HCI' -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt >7550x1.sam

# novoalign (2.07.05 - Nov 29 2010 @ 13:34:51) - A short read aligner with qualities.
# (C) 2008 NovoCraft
# Licensed for evaluation, educational, and not-for-profit use only.
# novoalign -o SAM @RG ID:YING PL:ILLUMINA LB:LB_TEST SM:7851 CN:HCI -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt
# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.
# Index Build Version: 2.7
# Hash length: 14
# Step size: 1
# Paired Reads: 25000
# Pairs Aligned: 368
# Read Sequences: 50000
# Aligned: 767
# Unique Alignment: 749
# Gapped Alignment: 18
# Quality Filter: 490
# Homopolymer Filter: 32
# Elapsed Time: 59.363 (sec.)
# CPU Time: 0.9 (min.)

Monday, March 7, 2011

Loss of function annotation

From 1000 genome project:

Functional annotation of SNPs, short indels and large structural variants was determined
with reference to the GENCODE v3b annotation release (Harrow, Denoeud et al. 2006).
Coding SNPs were mapped on to transcripts annotated as “protein_coding” and
containing an annotated START codon, and classified as synonymous, non-synonymous,
nonsense (stop codon-introducing), stop codon-disrupting or splice site-disrupting
(canonical splice sites). Transcripts labeled as NMD (predicted to be subject to nonsensemediated
decay) were not used. Small deletions predicted to cause a frame-shift and
large deletions predicted to disrupt gene function were also analysed.

Nonsense and splice-disrupting SNPs were flagged as likely representing reference error
or annotation artefacts if the inferred loss-of-function (LOF) allele was also the ancestral
state, or if the reference (non-LOF) allele was not observed in any individual in that
population, and were excluded from the per-individual counts in Table 2. Splice-disrupting
SNPs in non-canonical splice sites were also discarded. We did not consider the frameshift
status of splice-disrupting SNPs due to the challenges of inferring the effects of
removal of splice-donor and acceptor sites on exon structure, but rather treated all such
SNPs as likely to affect gene function.

We classified large deletions as gene-disrupting if they fulfilled the following criteria:

1. Removal of >50% of the coding sequence; or
2. Removal of the gene’s transcriptional start site or start codon; or
3. Removal of an odd number of internal splice sites; or
4. Removal of one or more internal coding exons that would be predicted to generate
a frameshift.

For large deletions with imprecise breakpoints, we conservatively required that the
deletions defined by both the inner and outer confidence intervals would have the same
predicted effect on gene function. For cases with microhomology at the break-point we
treated the breakpoint as falling to the right-hand side of the region of microhomology.
We did not perform functional annotation for large duplications due to the challenges of
inferring functional consequences. The numbers stated in the text should thus be
regarded as a lower bound for the number of observed loss-of-function variants per
individual genome.

However, it should also be noted that the proportion of false positive calls in the LOF class
due to sequencing and annotation errors is expected to be substantially higher than the
genome-wide average. This effect is expected as LOF sites show a low level of true
polymorphism due to selective constraint, meaning that a uniform error rate across the
genome will result in a higher proportion of false positive calls compared to other (more
variable) sites.
Enrichment of false positive calls in LOF variants is most evident in the CHB+JPT
samples, which showed a higher per-individual number of LOF SNPs than other
populations despite a comparable number of synonymous variants (Supplementary Table
11), as well as an unusual peak in the derived allele frequency spectrum (Supplementary
Figure 13). This is likely due to a mild elevation in genome-wide false positive rates for
SNPs in this population compared to other samples, which is then highly enriched at
functionally constrained sites.
To lower the number of false positive indel calls we applied more stringent filters to the
subset of indels that were called in the genome-wide set and were predicted to fall into the
LOF class. The stringent filter requires that the range of positions where an indel would
yield the same alternative haplotype sequence as the original called indel (for instance, in
a repeat, the deletion of any repeat unit would give the same alternative haplotype), plus 4
bases of reference sequence on both sides of this region, was covered by at least one
read on the forward strand, and at least one read on the reverse strand, with at most one
mismatch between the read and the alternative haplotype sequence resulting from the
indel (regardless of base-qualities). This filter removed an excess of 1 bp frameshift
insertions seen in CHB+JPT with respect to CEU in the less stringently filtered genomewide
indel call set, although it is expected to remove a significant number of true positive
calls as well. The indels that pass these stringent filters have been annotated in the
project’s VCF files.
Experimental validation and manual reannotation of identified LOF variants is currently
ongoing (manuscript in preparation).
For extrapolating the functional variants identified per individual in the exon project to the
whole genome (Table 2) we used the ratio of the total coding sequence and splice sites in
the exon-capture target regions (1,423,559 bp and 7,513, respectively) to the
corresponding numbers for the GENCODE v3b annotation set as a whole (35,676,620 bp
and 384,439, respectively).

The coordinates and predicted functional consequences of all of the LOF variants
identified in the project are available on the 1000 Genomes FTP site.