Tuesday, January 15, 2013

1000g CEU population raw data extraction

Recently I need to use the variants from 1000G's CEU as general background(control). One way is to directly download the pre-identified variants from the ftp site. Or the better way is to process CEU raw sequencing data and our case samples with our own pipelines. Keep everything (apps and parameters) the same.


#ped file has the sample name for CEU population


wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20111108_samples_pedigree/G1K_samples_20111108.ped


#based on the sample info, download the raw sequencing data from 1000g's ftp, save under subdir "CEU_samples"


awk '$2 ~ /^NA/ {print $2}' G1K_samples_20111108.ped | sort | uniq | xargs -I sample_name wget -r -nH --cut-dirs=3 -P CEU_samples "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/"sample_name"/sequence_read" || true


#it will take a while and lots of disk space...


Wednesday, January 2, 2013

EUR_AF>0.05 from 1000G release with 1029 genomes

#1. download 

$wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

#2. unzip

$gunzip ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

#3. size?

$wc -l ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz

39706744 


#4. Get the variants whose AF>=0.05 in EUR population

$grep -o ".*EUR_AF=\([0-1]\.[0-9]*\)" ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | awk -F'EUR_AF=' '{if ($NF >= 0.05) print}' > body.vcf

#5. Header

$head -n 100 ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | grep '^#'
> head.txt

#6. cat

$cat head.txt body.vcf > 1000g.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.EUR_AF_0.05.vcf