Recently I need to use the variants from 1000G's CEU as general background(control).
One way is to directly download the pre-identified variants from the ftp site.
Or the better way is to process CEU raw sequencing data and our case samples with our own pipelines. Keep everything (apps and parameters) the same.
#ped file has the sample name for CEU population
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20111108_samples_pedigree/G1K_samples_20111108.ped
#based on the sample info, download the raw sequencing data from 1000g's ftp, save under subdir "CEU_samples"
awk '$2 ~ /^NA/ {print $2}' G1K_samples_20111108.ped | sort | uniq | xargs -I sample_name wget -r -nH --cut-dirs=3 -P CEU_samples "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/"sample_name"/sequence_read" || true
#it will take a while and lots of disk space...
Tuesday, January 15, 2013
Wednesday, January 2, 2013
EUR_AF>0.05 from 1000G release with 1029 genomes
#1. download $wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz #2. unzip $gunzip ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz #3. size? $wc -l ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz 39706744 #4. Get the variants whose AF>=0.05 in EUR population $grep -o ".*EUR_AF=\([0-1]\.[0-9]*\)" ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | awk -F'EUR_AF=' '{if ($NF >= 0.05) print}' > body.vcf #5. Header $head -n 100 ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | grep '^#' > head.txt #6. cat $cat head.txt body.vcf > 1000g.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.EUR_AF_0.05.vcf
Subscribe to:
Posts (Atom)