Monday, November 14, 2011

prepare GATK resource for pipeline

Three VCF Files for known SNPs from GATK resource

1000G_omni2.5.hg19.sites.vcf -> hg19.1000G_omni2.5.vcf
hapmap_3.3.hg19.sites.vcf -> hg19.hapmap_3.3.vcf
dbsnp_132.hg19.vcf -> hg19.dbsnp_132.vcf

Also copy corresponding ".idx" files to DATA_PATH


mv 1000G_omni2.5.hg19.sites.vcf hg19.1000G_omni2.5.vcf
mv hapmap_3.3.hg19.sites.vcf hg19.hapmap_3.3.vcf
mv dbsnp_132.hg19.vcf hg19.dbsnp_132.vcf

##### ERROR MESSAGE: Input files /scratch/local/4/hg19.dbsnp_132.vcf and reference have incompatible contigs: Order of contigs differences, which is unsafe.
##### ERROR /scratch/local/4/hg19.dbsnp_132.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]


1. get "dbsnp_132.hg19.vcf.gz" from "ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/1.2/hg19/"

2. gunzip dbsnp_132.hg19.vcf.gz

3. awk '/^chrM/ {print $0}' dbsnp_132.hg19.vcf > m.vcf

4. awk '/^chrM/ {next} {print $0}' dbsnp_132.hg19.vcf > wom.vcf

5. cat m.vcf >> wom.vcf

6. mv wom.vcf hg19.dbsnp_132.vcf

7. vcf-validator hg19.dbsnp_132.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chr1:120995

8. vcf-validator dbsnp_132.hg19.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chrM:16189


8. /home/u0592675/vcftools_0.1.7/bin/vcf-validator hg19.dbsnp_132.vcf

10 ./home/u0592675/vcftools_0.1.7/bin/vcf-validator dbsnp_132.hg19.vcf



awk '/^chr/ {print $1}' /tomato/data/hg19.dbsnp_132.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
chrM

awk '/^chr/ {print $1}' hg19.1000G_omni2.5.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY


awk '/^chr/ {print $1}' hg19.hapmap_3.3.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX


1.
awk '/^chrM/ {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > m.vcf
awk '/^chrM/ {next} {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > wom.vcf
cat m.vcf >> wom.vcf
rm m.vcf
mv wom.vcf hg19.dbsnp_132.vcf

2.
vim hg19.dbsnp_132.vcf
delete all non-mapped contigs in header section ##contig

hg19.hapmap_3.3.vcf


hg push ssh://hiseq@hci-bio2.hci.utah.edu:/home/hiseq/tomato/
hg clone http://hci-bio2.hci.utah.edu:8011 tomato

No comments:

Post a Comment