#1. download
$wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz
#2. unzip
$gunzip ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz
#3. size?
$wc -l ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz
39706744
#4. Get the variants whose AF>=0.05 in EUR population
$grep -o ".*EUR_AF=\([0-1]\.[0-9]*\)" ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | awk -F'EUR_AF=' '{if ($NF >= 0.05) print}' > body.vcf
#5. Header
$head -n 100 ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | grep '^#'
> head.txt
#6. cat
$cat head.txt body.vcf > 1000g.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.EUR_AF_0.05.vcf
Wednesday, January 2, 2013
EUR_AF>0.05 from 1000G release with 1029 genomes
Tuesday, November 27, 2012
Tuesday, October 2, 2012
install bamtools
bamtools can be used to calculate the statistics of alignment using BWA. i.e. the number of reads that mapped to the reference genome
#install cmake
wget http://www.cmake.org/files/v2.8/cmake-2.8.9.tar.gz
tar -zxvf cmake-2.8.9.tar.gz
cd cmake-2.8.9
./configure;make
sudo make install
#install zlib
wget http://zlib.net/zlib-1.2.7.tar.gz
tar -zxvf zlib-1.2.7.tar.gz
cd zlib-1.2.7
./configure;make
sudo make install
#install bamtools https://github.com/pezmaster31/bamtools/wiki/Building-and-installing
git clone git://github.com/pezmaster31/bamtools.git
cd bamtools; mkdir build; cd build
cmake ..; make
cd ../bin
./bamtools
Thursday, August 23, 2012
dbSNP137 for hg19
1. wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
2. gunzip 00-All.vcf.gz
3. awk '/^#/ {print $0}' 00-All.vcf > head.txt
4. sed -i 's/chrMT/chrM/g' head.txt
5. awk '/^#/ {next}{print $0}' 00-All.vcf | sed 's/^/chr/' > 1.vcf
6. sed -i 's/chrMT/chrM/g' 1.vcf
7. cat head.txt 1.vcf > hg19.dbsnp.vcf
8. IGVTools/igvtools index hg19.dbsnp.vcf
9. awk '/^#/ {next}{print $1}' hg19.dbsnp.vcf | sort |uniq
Tuesday, June 5, 2012
fastq version
Reference: http://en.wikipedia.org/wiki/FASTQ_format Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126 Illumina 1.3+ format can encode a Phred quality score from 0 to 62 using ASCII 64 to 126 So we can take few samples and see the range of scores. If any scores fall into the range of 33 ~ 63 then it would be CASAVA 1.3~1.7(Illumina format). Otherwise it would be CASAVA 1.8 (Sanger format, which is gaining popularity) For BWA, use "bwa aln -I" for Illumina format. For Sanger format, remove the "-I". For Novoalign, use "novoalign -F ILMFQ" for Illumina format. For Sanger format, use "novoalign -F STDFQ" or "novoalign -F ILM1.8".
Friday, May 25, 2012
Show all bases with IGV
Today someone asked me if I can make a revised version of snpview to support "Show all bases".
Manual operations shows that it required to right-click the alignment and select "Show all bases" from popped menu.
http://www.broadinstitute.org/igv/PortCommands
The supported socket commands for IGV does not include this operation. (In fact it only supports a very few commands)
Since IGV is open resource I decide to check out the source code to see if I can change the code then build a "Show all bases by default" IGV.
1. Find which source file is responsible for this option
>grep -r -n "Show all bases" IGVDistribution_2.1.14/src/
src/org/broad/igv/sam/AlignmentTrack.java:1533: final JMenuItem item = new JCheckBoxMenuItem("Show all bases");
2. OK let us check this file "src/org/broad/igv/sam/AlignmentTrack.java"
vim +1533 src/org/broad/igv/sam/AlignmentTrack.java
The context shows:
*************************************************************
public void addShowAllBasesMenuItem() {
// Change track height by attribute
final JMenuItem item = new JCheckBoxMenuItem("Show all bases");
if (renderOptions.colorOption == ColorOption.BISULFITE || renderOptions.colorOption == ColorOption.NOMESEQ) {
// item.setEnabled(false);
} else {
item.setSelected(renderOptions.showAllBases);
}
item.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent aEvt) {
renderOptions.showAllBases = item.isSelected();
refresh();
}
});
add(item);
}
*************************************************************
It looks like "renderOptions.showAllBases" controls the switch on/off.
3. Check the "renderOptions.showAllBases"
grep -r -n "renderOptions.showAllBases" IGVDistribution_2.1.14/src/
src/org/broad/igv/sam/AlignmentTrack.java:1538: item.setSelected(renderOptions.showAllBases);
src/org/broad/igv/sam/AlignmentTrack.java:1542: renderOptions.showAllBases = item.isSelected();
src/org/broad/igv/sam/AlignmentRenderer.java:556: if (renderOptions.showMismatches || renderOptions.showAllBases) {
src/org/broad/igv/sam/AlignmentRenderer.java:635: boolean showAllBases = renderOptions.showAllBases &&
Notice the last line defines a boolean variable "boolean showAllBases ..."
Is this what I am looking for?
4. Further investigation
vim +635 src/org/broad/igv/sam/AlignmentRenderer.java
The context shows:
*************************************************************
// Disable showAllBases in bisulfite mode
boolean showAllBases = renderOptions.showAllBases &&
!(colorOption == ColorOption.BISULFITE || colorOption == ColorOption.NOMESEQ);
*************************************************************
Now it is quite clear: "showAllBases" controls the on/off of "Show all bases".
5. Now let us make a dirty patch
*************************************************************
// Disable showAllBases in bisulfite mode
// boolean showAllBases = renderOptions.showAllBases &&
// !(colorOption == ColorOption.BISULFITE || colorOption == ColorOption.NOMESEQ);
boolean showAllBases = true;
*************************************************************
We just set "true" to it for always "on"! Like I said, it is dirty because it completely ignore all options there.
It may have potential problem, but at now, let us make a test.
6. Go back to the source and build
>ant
We will got a new "igv.jar". rename it as "igv_showallbases.jar" to discriminate the original one.
>mv igv.jar igv_showallbases.jar
7. Test!
java -jar igv_showallbases.jar
It works! Now by default all bases in alignment are showed.
The user is happy because he got what he wants!
Tuesday, May 22, 2012
microRNA differential expression
Goal:
Find the microRNAs that are differentially expressed (DE) between two groups. Each group consists of 6 ~ 7 biological samples which go through illumina HiSeq2000 sequencer. Single-end with 50bp length.
Method:
1. download gene table for microRNA from miRbase
>wget ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3
2. convert it from GFF into BED format(and filter non-microRNA)
>awk '/^#/{print;next;} $3 == "miRNA" {print "chr"$1"\t"$4"\t"$5"\t"$9}' hsa.gff3 > hsa.bed
3. Align reads to hg19 whole genome using novoalign
>novoalign -r None -d hg19.nix -f A_1.fq.gz A_2.fq.gz | gzip > A.sam.gz
>java -jar ~/picard/SortSam.jar INPUT=A.sam.gz OUTPUT=A.bam CREATE_INDEX=true SO=coordinate VERBOSITY=ERROR QUIET=true
>mv A.bai A.bam.bai
4. Counting with bedtools
>~/BEDTools-Version-2.16.2/bin/bedtools multicov -bams A.bam -bed hsa.bed | awk '$4 !~ 0 {print $0}' > A.txt
(to be continued)
Subscribe to:
Posts (Atom)