Wednesday, December 29, 2010

Short Read Micro re-Aligner

http://sourceforge.net/apps/mediawiki/srma/index.php?title=Main_Page

#sync

git clone git://srma.git.sourceforge.net/gitroot/srma/srma srma

cd srma

cd c-code

#samtools must be present as a subfolder to compile srma.

svn co https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/samtools samtools

cd samtools;make

cd ..

sh autogen.sh && ./configure && make

#if goes well, you will see a binary "srma"

./srma

Friday, December 10, 2010

Common indicators of variant quality

– Number of variants

– Transition/transversion ratio

– Hardy-Weinberg Equilibrium violations

- Presence in variant databases

– % in dbSNP build

– Concordance to Hapmap (II, II+III consensus, III)

– Concordance to validation data (i.e. array-based genotyping)

- Visualization (IGV*)

– Examination of alignments, local sequence context, error covariates, etc.

7550 Benchmark

/DataDisk/Analysis/Sun/7550R/simu

novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:7550R\tCN:HCI' -F ILMFQ -d hg19all.nix -f 7550.chr20.1.fq 7550.chr20.2.fq >7550.chr20.sam 2>7550.chr20.log

jj $picard/SortSam.jar INPUT=7550.chr20.sam OUTPUT=7550.chr20.sorted.bam SO=coordinate TMP_DIR=.

jj $picard/ValidateSamFile.jar I=7550.chr20.sorted.bam O=7550.chr20.validation

#"more 7550.chr20.validation" -> no errors found

jj $picard/BuildBamIndex.jar I=7550.chr20.sorted.nodup.bam O=7550.chr20.sorted.nodup.bam.bai

jj $picard/MarkDuplicates.jar I=7550.chr20.sorted.bam O=7550.chr20.sorted.nodup.bam M=7550.chr20.duplicate

gatk -T RealignerTargetCreator -R chr20.fa -I 7550.chr20.sorted.nodup.bam -D dbsnp_131_chr20.rod -o 7550.chr20.intervals

gatk -T IndelRealigner -R chr20.fa -I 7550.chr20.sorted.nodup.bam -targetIntervals 7550.chr20.intervals -o 7550.chr20.realigned.bam

gatk -T CountCovariates -R chr20.fa -l INFO -D dbsnp_131_chr20.rod -I 7550.chr20.realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov MappingQualityCovariate -recalFile 7550.chr20.recal_data.csv

gatk -T TableRecalibration -R chr20.fa -l INFO -I 7550.chr20.realigned.bam -recalFile 7550.chr20.recal_data.csv -o 7550.chr20.realigned.recal.bam

jj $picard/FixMateInformation.jar I=7550.chr20.realigned.recal.bam O=7550.chr20.realigned.recal.fixed.bam SO=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=.

gatk -T UnifiedGenotyper -R chr20.fa -I 7550.chr20.realigned.recal.fixed.bam -D dbsnp_131_chr20.rod -o 7550.chr20.snps.raw.vcf -stand_call_conf 50.0 -stand_emit_conf 20.0 -dcov 300

Thursday, December 9, 2010

HapMap

The HapMap is a catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world. The International HapMap Project is not using the information in the HapMap to establish connections between particular genetic variants and diseases. Rather, the Project is designed to provide information that other researchers can use to link genetic variants to the risk for specific illnesses, which will lead to new methods of preventing, diagnosing, and treating disease.
The DNA in our cells contains long chains of four chemical building blocks -- adenine, thymine, cytosine, and guanine, abbreviated A, T, C, and G. More than 6 billion of these chemical bases, strung together in 23 pairs of chromosomes, exist in a human cell. (See http://www.dnaftb.org/dnaftb/ for basic information about genetics.) These genetic sequences contain information that influences our physical traits, our likelihood of suffering from disease, and the responses of our bodies to substances that we encounter in the environment.
The genetic sequences of different people are remarkably similar. When the chromosomes of two humans are compared, their DNA sequences can be identical for hundreds of bases. But at about one in every 1,200 bases, on average, the sequences will differ (Figure 1). One person might have an A at that location, while another person has a G, or a person might have extra bases at a given location or a missing segment of DNA. Each distinct "spelling" of a chromosomal region is called an allele, and a collection of alleles in a person's chromosomes is known as a genotype.
Differences in individual bases are by far the most common type of genetic variation. These genetic differences are known as single nucleotide polymorphisms, or SNPs (pronounced "snips"). By identifying most of the approximately 10 million SNPs estimated to occur commonly in the human genome, the International HapMap Project is identifying the basis for a large fraction of the genetic diversity in the human species.
For geneticists, SNPs act as markers to locate genes in DNA sequences. Say that a spelling change in a gene increases the risk of suffering from high blood pressure, but researchers do not know where in our chromosomes that gene is located. They could compare the SNPs in people who have high blood pressure with the SNPs of people who do not. If a particular SNP is more common among people with hypertension, that SNP could be used as a pointer to locate and identify the gene involved in the disease.
However, testing all of the 10 million common SNPs in a person's chromosomes would be extremely expensive. The development of the HapMap will enable geneticists to take advantage of how SNPs and other genetic variants are organized on chromosomes. Genetic variants that are near each other tend to be inherited together. For example, all of the people who have an A rather than a G at a particular location in a chromosome can have identical genetic variants at other SNPs in the chromosomal region surrounding the A. These regions of linked variants are known as haplotypes.

In many parts of our chromosomes, just a handful of haplotypes are found in humans. [See The Origins of Haplotypes.] In a given population, 55 percent of people may have one version of a haplotype, 30 percent may have another, 8 percent may have a third, and the rest may have a variety of less common haplotypes. The International HapMap Project is identifying these common haplotypes in four populations from different parts of the world. It also is identifying "tag" SNPs that uniquely identify these haplotypes. By testing an individual's tag SNPs (a process known as genotyping), researchers will be able to identify the collection of haplotypes in a person's DNA. The number of tag SNPs that contain most of the information about the patterns of genetic variation is estimated to be about 300,000 to 600,000, which is far fewer than the 10 million common SNPs.

Once the information on tag SNPs from the HapMap is available, researchers will be able to use them to locate genes involved in medically important traits. Consider the researcher trying to find genetic variants associated with high blood pressure. Instead of determining the identity of all SNPs in a person's DNA, the researcher would genotype a much smaller number of tag SNPs to determine the collection of haplotypes present in each subject. The researcher could focus on specific candidate genes that may be associated with a disease, or even look across the entire genome to find chromosomal regions that may be associated with a disease. If people with high blood pressure tend to share a particular haplotype, variants contributing to the disease might be somewhere within or near that haplotype.

Transition and transcersion

DNA substitution mutations are of two types. Transitions are interchanges of two-ring purines (A G) or of one-ring pyrimidines (C T): they therefore involve bases of similar shape. Transversions are interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures.

Although there are twice as many possible transversions, because of the molecular mechanisms by which they are generated, transition mutations are generated at higher frequency than transversions. As well, transitions are less likely to result in amino acid substitutions (due to "wobble"), and are therefore more likely to persist as "silent substitutions" in populations as single nucleotide polymorphisms (SNPs).

Wednesday, December 8, 2010

make a simulation for 7550R

#train parameters from the raw fastq file (101bp, pair-end)

>maq simutrain 7550.simupars.dat 7550X1_101122_SN141_0314_B80R4NABXX_1_1.txt

#we want to make test on hg19 Chr20 only. Chr20 has ~63 million(63025520) bp , to get a 30x coverage with 101bp read length, we need 30*63000000/101 = 18712871 reads. (rounds to 19 million reads) .

>maq simulate -d 200 -s 20 -N 19000000 7550.chr20.1.fq 7550.chr20.2.fq chr20.fa 7550.simupars.dat > 7550.chr20.snps

#do alignment

>novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:6966R\tCN:HCI' -F ILMFQ -d chr20.ndx -f 7550.chr20.1.fq 7550.chr20.2.fq > 7550.chr20.sam 2>7550.chr20.log

Monday, December 6, 2010

VCF

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40

clustered SNP

clustered SNP calls are suspicious and are often caused by indels. For example, finding 3 SNPs in a 10bp window.

Thursday, December 2, 2010

Read group



For example, to create a paired end same file with fully formatted read groups, you can run:

bwa sampe -i MARK -m DEPRISTO -l LIBRARY -p ILLUMINA -h Homo_sapiens_assembly18.fasta first.fastq.sai secondfastq.sai first.fastq second.fastq > pe.sam
This will result in a file that looks like:

@HD     VN:1.0  SO:unsorted
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:247249719
@SQ     SN:chr2 LN:242951149
[More @SQ lines removed for clarity]
@RG     ID:MARK SM:DEPRISTO     LB:LIBRARY      PL:ILLUMINA
61CC3AAXX100125:6:48:11631:21212        81      chrM    8       37      76M     chr17   21944837        0       GGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGAT    99:<->=D>IF7.:7,H

Next Generation Sequencing and Data Analysis