Next Generation Sequencing and Data Analysis: 2012

Tuesday, November 27, 2012

Tuesday, October 2, 2012

bamtools can be used to calculate the statistics of alignment using BWA. i.e. the number of reads that mapped to the reference genome #install cmake wget http://www.cmake.org/files/v2.8/cmake-2.8.9.tar.gz tar -zxvf cmake-2.8.9.tar.gz cd cmake-2.8.9 ./configure;make sudo make install #install zlib wget http://zlib.net/zlib-1.2.7.tar.gz tar -zxvf zlib-1.2.7.tar.gz cd zlib-1.2.7 ./configure;make sudo make install #install bamtools https://github.com/pezmaster31/bamtools/wiki/Building-and-installing git clone git://github.com/pezmaster31/bamtools.git cd bamtools; mkdir build; cd build cmake ..; make cd ../bin ./bamtools

Thursday, August 23, 2012

dbSNP137 for hg19


 1. wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz

 2. gunzip 00-All.vcf.gz

 3. awk '/^#/ {print $0}' 00-All.vcf > head.txt

 4. sed -i 's/chrMT/chrM/g' head.txt

 5. awk '/^#/ {next}{print $0}' 00-All.vcf |  sed 's/^/chr/' > 1.vcf

 6. sed -i 's/chrMT/chrM/g' 1.vcf
 
 7. cat head.txt 1.vcf > hg19.dbsnp.vcf 
 
 8. IGVTools/igvtools index hg19.dbsnp.vcf

 9. awk '/^#/ {next}{print $1}' hg19.dbsnp.vcf | sort |uniq

Tuesday, June 5, 2012

fastq version


Reference: http://en.wikipedia.org/wiki/FASTQ_format

Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126

Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII 64 to 126

So we can take few samples and see the range of scores. If any scores fall into the range of 33 ~ 63 then

it would be CASAVA 1.3~1.7(Illumina format). Otherwise it would be CASAVA 1.8 (Sanger format, which is gaining popularity)

For BWA, use "bwa aln -I" for Illumina format. For Sanger format, remove the "-I".

For Novoalign, use "novoalign -F ILMFQ" for Illumina format. For Sanger format, use "novoalign -F STDFQ" or "novoalign -F ILM1.8".

Friday, May 25, 2012

Show all bases with IGV


Today someone asked me if I can make a revised version of snpview to support "Show all bases".

Manual operations shows that it required to right-click the alignment and select "Show all bases" from popped menu.

http://www.broadinstitute.org/igv/PortCommands

The supported socket commands for IGV does not include this operation. (In fact it only supports a very few commands)

Since IGV is open resource I decide to check out the source code to see if I can change the code then build a "Show all bases by default" IGV.

1. Find which source file is responsible for this option

>grep -r -n "Show all bases" IGVDistribution_2.1.14/src/

src/org/broad/igv/sam/AlignmentTrack.java:1533:            final JMenuItem item = new JCheckBoxMenuItem("Show all bases");

2. OK let us check this file "src/org/broad/igv/sam/AlignmentTrack.java"

vim +1533 src/org/broad/igv/sam/AlignmentTrack.java


The context shows:
*************************************************************
public void addShowAllBasesMenuItem() {
            // Change track height by attribute
            final JMenuItem item = new JCheckBoxMenuItem("Show all bases");

            if (renderOptions.colorOption == ColorOption.BISULFITE || renderOptions.colorOption == ColorOption.NOMESEQ) {
                //    item.setEnabled(false);
            } else {
                item.setSelected(renderOptions.showAllBases);
            }
            item.addActionListener(new ActionListener() {
                public void actionPerformed(ActionEvent aEvt) {
                    renderOptions.showAllBases = item.isSelected();
                    refresh();
                }
            });
            add(item);
        }
*************************************************************


It looks like "renderOptions.showAllBases" controls the switch on/off.

3. Check the  "renderOptions.showAllBases"
grep -r -n "renderOptions.showAllBases" IGVDistribution_2.1.14/src/

src/org/broad/igv/sam/AlignmentTrack.java:1538:                item.setSelected(renderOptions.showAllBases);
src/org/broad/igv/sam/AlignmentTrack.java:1542:                    renderOptions.showAllBases = item.isSelected();
src/org/broad/igv/sam/AlignmentRenderer.java:556:                if (renderOptions.showMismatches || renderOptions.showAllBases) {
src/org/broad/igv/sam/AlignmentRenderer.java:635:        boolean showAllBases = renderOptions.showAllBases &&



Notice the last line defines a boolean variable "boolean showAllBases ..."

Is this what I am looking for?

4. Further investigation
vim +635 src/org/broad/igv/sam/AlignmentRenderer.java


The context shows:
*************************************************************
// Disable showAllBases in bisulfite mode
        boolean showAllBases = renderOptions.showAllBases &&
                !(colorOption == ColorOption.BISULFITE || colorOption == ColorOption.NOMESEQ);

*************************************************************

Now it is quite clear: "showAllBases" controls the on/off of "Show all bases".

5. Now let us make a dirty patch

*************************************************************
// Disable showAllBases in bisulfite mode
//        boolean showAllBases = renderOptions.showAllBases &&
//                !(colorOption == ColorOption.BISULFITE || colorOption == ColorOption.NOMESEQ);

boolean showAllBases = true;

*************************************************************

We just set "true" to it for always "on"! Like I said, it is dirty because it completely ignore all options there.
It may have potential problem, but at now, let us make a test.

6. Go back to the source and build

>ant


We will got a new "igv.jar". rename it as "igv_showallbases.jar" to discriminate the original one.
>mv igv.jar igv_showallbases.jar

7. Test!

java -jar igv_showallbases.jar

It works! Now by default all bases in alignment are showed. 

The user is happy because he got what he wants!

Tuesday, May 22, 2012

microRNA differential expression

Goal: 
Find the microRNAs that are differentially expressed (DE) between two groups. Each group consists of 6 ~ 7 biological samples which go through illumina HiSeq2000 sequencer. Single-end with 50bp length.

Method:
1. download gene table for microRNA from miRbase
>wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3


2. convert it from GFF into BED format(and filter non-microRNA)
>awk '/^#/{print;next;} $3 == "miRNA" {print "chr"$1"\t"$4"\t"$5"\t"$9}' hsa.gff3 > hsa.bed


3. Align reads to hg19 whole genome using novoalign
>novoalign -r None -d hg19.nix -f A_1.fq.gz A_2.fq.gz | gzip > A.sam.gz
>java -jar ~/picard/SortSam.jar INPUT=A.sam.gz OUTPUT=A.bam CREATE_INDEX=true SO=coordinate VERBOSITY=ERROR QUIET=true
>mv A.bai A.bam.bai 


4. Counting with bedtools
>~/BEDTools-Version-2.16.2/bin/bedtools multicov -bams A.bam -bed hsa.bed | awk '$4 !~ 0 {print $0}' > A.txt

(to be continued)

Wednesday, March 14, 2012

Aligner and Genotyper

Recently I ran into a problem:

By viewing the alignment BAM file in IGV, we can observe a significant SNP at a position. All reads that mapped to that regions shows this single mutation.

However, with GATK's UnifiedGenotyper, this SNP has a very low QUAL value (QUAL=7).

Not sure what caused this confusion, so at now I am trying to use two mapping applications novoalign and BWA as aligner, and two genotyper application UnifiedGenotyper and samtool/mpileup to call variances.

Ignore an error

I have a long problem when scripting the PBS script to cluster.

To stop a pipeline whenever any errors happened, we can trap it

#!/bin/bash

failclean()

{

echo 'catched!'

exit 1

}

trap 'failclean' ERR TERM

rm -fr *

touch b

echo 'hello,world' > 1.txt

Idonotexist

echo 'OK!'

We will get 'catched!' as output because the command 'Idonotexist' will be fail.

In the real case, the first step should be cleaning all existing file and folder to release all local disk space on each cluster node.

"rm -fr *"

However, this command may throw an error because the file permission

"rm: cannot remove abc.txt: Permission denied"

In fact, if we can not delete other files, it is fine. We just want to use as many disk space as possible.

However, with "trap" the above command will be caught by "trap" which in turn fail your whole script.

So, our questions is: how to ignore and supress any errors when using "rm -fr *".

The solution is:

"rm -fr * 2>/dev/null || true"

Now the final demo script looks like this:

#!/bin/bash

failclean()

{

echo 'catched!'

exit 1

}

trap 'failclean' ERR TERM

rm -fr * 2>dev/null || true

touch b

echo 'hello,world' > 1.txt

Idonotexist

echo 'OK!'

Tuesday, March 13, 2012

Failed: Install and use VAT

wget http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz

tar -zxvf libbios-1.0.0.tar.gz

cd libbios-1.0.0

./configure --prefix=/home/app/VAT/out/libbios-1.0.0;make;make install

wget ftp://ftp.gnu.org/gnu/gsl/gsl-1.15.tar.gz

tar -zxvf gsl-1.15.tar.gz

cd gsl-1.15

./configure --prefix=/home/app/VAT/out/gsl-1.15;;make;make install

#wget http://jpeglib.googlecode.com/files/jpeglib-0.2.0.zip

#hg clone https://bitbucket.org/pierrejoye/gd-libgd

#cd gd-libgd

#./configure --prefix=/home/app/VAT/out/libgd

#blatSuite34

wget http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.34.zip

unzip blatSuite.34.zip -d /home/app/VAT/out/blatSuite-34

wget http://libs3.ischo.com.s3.amazonaws.com/libs3-2.0.tar.gz

tar -zxvf libs3-2.0.tar.gz

cd libs3-2.0

make DESTDIR=/home/app/VAT/out/libs3 install

export CPPFLAGS="-I/home/app/VAT/out/gsl-1.15/include -I/home/app/VAT/out/libbios-1.0.0/include -I/home/app/VAT/out/libs3/include -ggdb"

export LDFLAGS="-L/home/app/VAT/out/gsl-1.15/lib -L/home/app/VAT/out/libbios-1.0.0/lib -L/home/app/VAT/out/libs3/lib"

wget https://s3.amazonaws.com/vat-releases/vat-2.0.0.tar.gz

tar -zxvf vat-2.0.0.tar.gz

cd vat-2.0.0

#vim default.vatrc

#TABIX_DIR /home/app/tabix

#VAT_EXEC_DIR /home/app/VAT/out/vat/bin

./configure --prefix=/home/app/VAT/out/vat;make;make install

export VAT_CONFIG_FILE=/home/app/.vatrc

cd /home/app/test

wget ftp://ftp.sanger.ac.uk/pub/gencode/release_11/gencode.v11.annotation.gtf.gz

gunzip gencode.v11.annotation.gtf.gz

awk '/\t(HAVANA|ENSEMBL)\tCDS\t/ {print}' gencode.v11.annotation.gtf | grep 'gene_type "protein_coding"' | grep 'transcript_type "protein_coding"' > gencode.v11.annotation.cds.gtpc.ttpc.gtf

/home/app/VAT/out/vat/bin/gencode2interval < gencode.v11.annotation.cds.gtpc.ttpc.gtf > gencode.v11.annotation.cds.gtpc.ttpc.interval

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit

./faToTwoBit -noMask /tomato/data/hg19.fasta hg19.2bit

export LD_LIBRARY_PATH=/home/app/VAT/out/libbios-1.0.0/lib:/home/app/VAT/out/libs3/lib:/home/app/VAT/out/gsl-1.15/lib

/home/app/VAT/out/vat/bin/interval2sequences hg19.2bit gencode.v11.annotation.cds.gtpc.ttpc.interval exonic > gencode.v11.annotation.cds.gtpc.ttpc.exonic.fasta

#twoBitReadSeqFrag in chr6 start (53371710) >= end (53371710)

#Segmentation fault

/home/app/VAT/out/vat/bin/interval2sequences hg19.2bit gencode.v11.annotation.cds.gtpc.ttpc.interval genomic> gencode.v11.annotation.cds.gtpc.ttpc.genomic.fasta

#pass

/home/app/VAT/out/vat/bin/snpMapper gencode.v11.annotation.cds.gtpc.ttpc.interval gencode.v11.annotation.cds.gtpc.ttpc.genomic.fasta 1

#Segmentation fault

/home/app/VAT/out/vat/bin/indelMapper gencode.v11.annotation.cds.gtpc.ttpc.interval gencode.v11.annotation.cds.gtpc.ttpc.genomic.fasta 2

Thursday, March 1, 2012

R, DESeq and qvalue installation

wget http://cran.cnr.berkeley.edu/src/base/R-2/R-2.14.2.tar.gz
tar -zxvf R-2.14.2.tar.gz
cd R-2.14.2;./configure --prefix=/home/R-2.14.2;make;make install
cd /home/R-2.14.2/bin

wget http://genomics.princeton.edu/storeylab/qvalue/qvalue_1.1.tar.gz
./R CMD INSTALL qvalue_1.1.tar.gz
rm -f qvalue_1.1.tar.gz

source("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq")

phastConsElements46way and avsift

wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/phastConsElements46way.txt.gz
gunzip phastConsElements46way.txt.gz
mv phastConsElements46way.txt hg19_phastConsElements46way.txt
mv hg19_phastConsElements46way.txt /tomato/data
/tomato/app/annovar/convert2annovar.pl x1.pass.vcf -format vcf4 > x1.annovar
/tomato/app/annovar/annotate_variation.pl -regionanno --buildver hg19 -dbtype mce46way x1.annovar /tomato/data/

/tomato/app/annovar/annotate_variation.pl -downdb -buildver hg19 avsift .
mv hg19_avsift.txt* /tomato/data
/tomato/app/annovar/convert2annovar.pl x1.pass.vcf -format vcf4 > x1.annovar
/tomato/app/annovar/annotate_variation.pl -filter -buildver hg19 -dbtype avsift x1.annovar /tomato/data/

Monday, February 27, 2012

Intersecting VCF

We can use vcftools (together with bgzip and tabix) to intersect VCF files. The first 4 columns for each VCF file(chr, position, ref and alt) can be easily overlapped. However it is undefined on how to "intersecting scores" in VCF files because it is not a simple arithmetic operation. To generate a VCF format compatible file, vcftools used the scores from the first VCF file only when intersecting multiple VCF files. This is simple but not a perfect solution.

Another possible solution is to merge the BAM firstly before calling variances. Therefore we will have only one VCF file after calling variances.

Is this perfect solution? Maybe. Merging alignment BAM files from different samples (individuals) is not well studied. We need a well known dataset to benchmark these two methods.

Tuesday, February 21, 2012

MarkDuplicate is very necessary

Recently I did an analysis which basically comparing the positions of nucleosome in two samples to see if the pattern differs.

The target region is only 1500bp in length. HiSeq generated big raw read sequence files. After aligning to hg19, I can see many alignemnts are incorrect because they were mapped to other chromosomes and positions.

The first analysis without MarkDuplicate from Picard in BAM files result in huge peaks of depth of coverage which basically drowns any useful signals. After removing duplicates, the peaks were totally removed and I got strong signals.

Later I found another problem in GATK's DepthOfCoverage which reports 1000 as maximum depth of coverage. For example, even the actual depth of coverage on two positions are 5000 and 1020 respectively, GATK reports the same 1000.

As an alternative solution, samtools can report the actual depth of coverage without truncating.

samtools mpileup -ABr chr1:12345-123456 -d 1000000 x1.bam | awk '{print $4}' > x1.cov
samtools mpileup -ABr chr1:12345-123456 -d 1000000 x2.bam | awk '{print $4}' > x2.cov

Extracting the 4th column that has the depth of coverage.

The final step is plotting with R.


x1=as.numeric(unlist(read.table('x1.cov')))
x2=as.numeric(unlist(read.table('x2.cov')))
pdf(file="1.pdf") 
op = par( mfrow = c( 3, 1 ))
barplot(x1)
title("X1")
barplot(x2)
title("X2")
op = par(op) 
dev.off() 
pdf(file="2.pdf") 
plot(x1,type='l',col='green',xlab='chr1:12345-123456',ylab='Depth of Coverage')
lines(x2,type='l',col='red')
legend(10, 8000, c("X1", "X2"), col = c("green","red"), text.col = "green4", lty = c(3,4), pch = c(3,4), merge = TRUE, bg = 'gray90')
dev.off()

Remove any variances that were in CEU population


#use CEU as common variance database, not dbSNP
import sys
def filter_ceu(target_file,background_file='CEU.all.ids'):
    x = []
    for line in file(background_file):
        x.append(line.strip())
    mx = {}.fromkeys(x)
    for line in file(target_file):
        line = line.strip()
        if line.startswith('#'):
            pass
            #print line
        else:
            toks = line.split('\t')
            if not mx.has_key(toks[2]):
                print line
            else:
                pass
    
if __name__=='__main__':
    filter_ceu(sys.argv[1])

Tuesday, January 3, 2012

1000Genome Population

CHB Han Chinese Han Chinese in Beijing, China
CHS Southern Han Chinese Han Chinese South
CDX Dai Chinese Chinese Dai in Xishuangbanna, China
CHD Denver Chinese Chinese in Denver, Colorado (pilot 3 only)
JPT Japanese Japanese in Tokyo, Japan
KHV Kinh Vietnamese Kinh in Ho Chi Minh City, Vietnam

CEU CEPH Utah residents (CEPH) with Northern and Western European ancestry
TSI Tuscan Toscani in Italia
GBR British British in England and Scotland
FIN Finnish Finnish in Finland
IBS Spanish Iberian populations in Spain

YRI Yoruba Yoruba in Ibadan, Nigeria
LWK Luhya Luhya in Webuye, Kenya
GWD Gambian Gambian in Western Division, The Gambia
GHN Ghanaian Ghanaian in Navrongo, Ghana
MAB Malawian Malawian in Blantyre, Malawi

ASW African-American SW African Ancestry in Southwest US
AJM African-American MS African American in Jackson, Mississippi
ACB African-Caribbean African Caribbean in Barbados

MXL Mexican-American Mexican Ancestry in Los Angeles, California
CLM Colombian Colombian in Medellin, Colombia
PEL Peruvian Peruvian in Lima, Peru
PUR Puerto Rican Puerto Rican in Puerto Rico

Next Generation Sequencing and Data Analysis