Wednesday, January 2, 2013

EUR_AF>0.05 from 1000G release with 1029 genomes

#1. download 

$wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

#2. unzip

$gunzip ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

#3. size?

$wc -l ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

39706744 


#4. Get the variants whose AF>=0.05 in EUR population

$grep -o ".*EUR_AF=\([0-1]\.[0-9]*\)" ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | awk -F'EUR_AF=' '{if ($NF >= 0.05) print}' > body.vcf

#5. Header

$head -n 100 ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | grep '^#'
> head.txt

#6. cat

$cat head.txt body.vcf > 1000g.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.EUR_AF_0.05.vcf

Tuesday, October 2, 2012

install bamtools

bamtools can be used to calculate the statistics of alignment using BWA. i.e. the number of reads that mapped to the reference genome #install cmake wget http://www.cmake.org/files/v2.8/cmake-2.8.9.tar.gz tar -zxvf cmake-2.8.9.tar.gz cd cmake-2.8.9 ./configure;make sudo make install #install zlib wget http://zlib.net/zlib-1.2.7.tar.gz tar -zxvf zlib-1.2.7.tar.gz cd zlib-1.2.7 ./configure;make sudo make install #install bamtools https://github.com/pezmaster31/bamtools/wiki/Building-and-installing git clone git://github.com/pezmaster31/bamtools.git cd bamtools; mkdir build; cd build cmake ..; make cd ../bin ./bamtools

Thursday, August 23, 2012

dbSNP137 for hg19


 1. wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz

 2. gunzip 00-All.vcf.gz

 3. awk '/^#/ {print $0}' 00-All.vcf > head.txt

 4. sed -i 's/chrMT/chrM/g' head.txt

 5. awk '/^#/ {next}{print $0}' 00-All.vcf |  sed 's/^/chr/' > 1.vcf

 6. sed -i 's/chrMT/chrM/g' 1.vcf
 
 7. cat head.txt 1.vcf > hg19.dbsnp.vcf 
 
 8. IGVTools/igvtools index hg19.dbsnp.vcf

 9. awk '/^#/ {next}{print $1}' hg19.dbsnp.vcf | sort |uniq

Tuesday, June 5, 2012

fastq version


Reference: http://en.wikipedia.org/wiki/FASTQ_format

Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126

Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII 64 to 126

So we can take few samples and see the range of scores. If any scores fall into the range of 33 ~ 63 then

it would be CASAVA 1.3~1.7(Illumina format). Otherwise it would be CASAVA 1.8 (Sanger format, which is gaining popularity)

For BWA, use "bwa aln -I" for Illumina format. For Sanger format, remove the "-I".

For Novoalign, use "novoalign -F ILMFQ" for Illumina format. For Sanger format, use "novoalign -F STDFQ" or "novoalign -F ILM1.8".

Friday, May 25, 2012

Show all bases with IGV


Today someone asked me if I can make a revised version of snpview to support "Show all bases".

Manual operations shows that it required to right-click the alignment and select "Show all bases" from popped menu.

http://www.broadinstitute.org/igv/PortCommands

The supported socket commands for IGV does not include this operation. (In fact it only supports a very few commands)

Since IGV is open resource I decide to check out the source code to see if I can change the code then build a "Show all bases by default" IGV.

1. Find which source file is responsible for this option

>grep -r -n "Show all bases" IGVDistribution_2.1.14/src/

src/org/broad/igv/sam/AlignmentTrack.java:1533:            final JMenuItem item = new JCheckBoxMenuItem("Show all bases");

2. OK let us check this file "src/org/broad/igv/sam/AlignmentTrack.java"

vim +1533 src/org/broad/igv/sam/AlignmentTrack.java


The context shows:
*************************************************************
public void addShowAllBasesMenuItem() {
            // Change track height by attribute
            final JMenuItem item = new JCheckBoxMenuItem("Show all bases");

            if (renderOptions.colorOption == ColorOption.BISULFITE || renderOptions.colorOption == ColorOption.NOMESEQ) {
                //    item.setEnabled(false);
            } else {
                item.setSelected(renderOptions.showAllBases);
            }
            item.addActionListener(new ActionListener() {
                public void actionPerformed(ActionEvent aEvt) {
                    renderOptions.showAllBases = item.isSelected();
                    refresh();
                }
            });
            add(item);
        }
*************************************************************


It looks like "renderOptions.showAllBases" controls the switch on/off.

3. Check the  "renderOptions.showAllBases"
grep -r -n "renderOptions.showAllBases" IGVDistribution_2.1.14/src/

src/org/broad/igv/sam/AlignmentTrack.java:1538:                item.setSelected(renderOptions.showAllBases);
src/org/broad/igv/sam/AlignmentTrack.java:1542:                    renderOptions.showAllBases = item.isSelected();
src/org/broad/igv/sam/AlignmentRenderer.java:556:                if (renderOptions.showMismatches || renderOptions.showAllBases) {
src/org/broad/igv/sam/AlignmentRenderer.java:635:        boolean showAllBases = renderOptions.showAllBases &&



Notice the last line defines a boolean variable "boolean showAllBases ..."

Is this what I am looking for?

4. Further investigation
vim +635 src/org/broad/igv/sam/AlignmentRenderer.java


The context shows:
*************************************************************
// Disable showAllBases in bisulfite mode
        boolean showAllBases = renderOptions.showAllBases &&
                !(colorOption == ColorOption.BISULFITE || colorOption == ColorOption.NOMESEQ);

*************************************************************

Now it is quite clear: "showAllBases" controls the on/off of "Show all bases".

5. Now let us make a dirty patch

*************************************************************
// Disable showAllBases in bisulfite mode
//        boolean showAllBases = renderOptions.showAllBases &&
//                !(colorOption == ColorOption.BISULFITE || colorOption == ColorOption.NOMESEQ);

boolean showAllBases = true;

*************************************************************

We just set "true" to it for always "on"! Like I said, it is dirty because it completely ignore all options there.
It may have potential problem, but at now, let us make a test.

6. Go back to the source and build

>ant


We will got a new "igv.jar". rename it as "igv_showallbases.jar" to discriminate the original one.
>mv igv.jar igv_showallbases.jar

7. Test!

java -jar igv_showallbases.jar

It works! Now by default all bases in alignment are showed. 

The user is happy because he got what he wants!

Tuesday, May 22, 2012

microRNA differential expression

Goal: 
Find the microRNAs that are differentially expressed (DE) between two groups. Each group consists of 6 ~ 7 biological samples which go through illumina HiSeq2000 sequencer. Single-end with 50bp length.

Method:
1. download gene table for microRNA from miRbase
>wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3


2. convert it from GFF into BED format(and filter non-microRNA)
>awk '/^#/{print;next;} $3 == "miRNA" {print "chr"$1"\t"$4"\t"$5"\t"$9}' hsa.gff3 > hsa.bed


3. Align reads to hg19 whole genome using novoalign
>novoalign -r None -d hg19.nix -f A_1.fq.gz A_2.fq.gz | gzip > A.sam.gz
>java -jar ~/picard/SortSam.jar INPUT=A.sam.gz OUTPUT=A.bam CREATE_INDEX=true SO=coordinate VERBOSITY=ERROR QUIET=true
>mv A.bai A.bam.bai 


4. Counting with bedtools
>~/BEDTools-Version-2.16.2/bin/bedtools multicov -bams A.bam -bed hsa.bed | awk '$4 !~ 0 {print $0}' > A.txt

(to be continued)