Monday, October 31, 2011

Resource Bundle

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/1.2

Thursday, October 27, 2011

Known SNPs

Three VCF Files for known SNPs from GATK resource

1000G_omni2.5.hg19.sites.vcf -> hg19.1000G_omni2.5.vcf
hapmap_3.3.hg19.sites.vcf -> hg19.hapmap_3.3.vcf
dbsnp_132.hg19.vcf -> hg19.dbsnp_132.vcf

Also copy corresponding ".idx" files to DATA_PATH

Friday, October 14, 2011

I am running out of disk space

again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again

Multiple hits? All or None?

#-r All
57G Oct 14 09:44 1.sam
65G Oct 14 10:06 2.sam
65G Oct 14 10:26 3.sam
45G Oct 14 10:46 4.sam
56G Oct 14 11:09 5.sam


#-r None
56G Oct 11 02:10 1.sam
64G Oct 11 02:32 2.sam
63G Oct 11 02:53 3.sam
44G Oct 11 03:07 4.sam
54G Oct 11 03:25 5.sam

Thursday, October 13, 2011

pre-processing of masked FASTA for novoalign

1. Convert lower case to upper case
sed -i 's/\(.*\)/\U\1/' hg19.fasta

> chr1
NNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNN
ggggggttttTGaaaaaaaCCC

Will be


> CHR1
NNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNN
GGGGGGGTTTTTTGAAAAACCC

2. covert '>CHR' to '>chr'
sed -i -e 's/>CHR/>chr/g' hg19.fasta

3. convert N to n
sed -i -e 's/N/n/g' hg19.masked.fasta

4. mask these UTR
novoindex -m hg19.nix hg19.fasta