Sunday, September 8, 2013

sampling(subset) NA12878

Based on the discussion from http://www.biostars.org/p/6544/

For testing purpose, we need to get 10x, 20x and 40x WGS coverage using sample NA12878 from 1KG.
The raw fastq has ~643 million paired reads, which is equivalent to ~43x WGS coverage.


 $ll -h
 -rw-r--r-- 1 user group 54555856768 Apr 27 13:58 ERR262997_1.fastq.gz
 -rw-r--r-- 1 user group 55794552881 Apr 27 13:58 ERR262997_2.fastq.gz

55 GB each fastq file! We need to unzip it anyway

 $gunzip *.gz
 $ll -h
 -rw-r--r-- 1 user group 173198473181 Apr 27 13:58 ERR262997_1.fastq
 -rw-r--r-- 1 user group 173198473181 Apr 27 13:58 ERR262997_2.fastq


162GB each after unzipping. I am not sure I can handle something like this big.

Read length is 101 while genome size of hg19  is 3,000,000,000

to get a 10x coverage, we need:



 $echo $(( (10*3000000000)/(101*2) ))
 148514851


About 148 million reads, from both pair-end files. Here comes the last and critical step.
Again, I am not sure if it is possible to "paste" two 162GB files.



for i in 10 20 40
do
num_reads=$(( (${i}*3000000000)/(101*2) ))
paste ERR262997_1.fastq ERR262997_2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");}  else { printf("\t\t");} }' | shuf  | head -n ${num_reads}| sed 's/\t\t/\n/g' | awk '{print $1 >  "ERR262997_${i}x_1.fastq"; print $2 > "ERR262997_${i}x_2.fastq"}'
done


Now it is time to cross fingers ...

[update]

It does not work. This morning I got this error:

"shuf: memory exhausted".

I have to split the 162GB fastq file firstly. Number of lines is 2572389100, divided by 20, we got  128619455.



$splitSize=128619455
$split -l $splitSize -a 4 -d ERR262997_1.fastq output/ERR262997_1.
$split -l $splitSize -a 4 -d ERR262997_2.fastq output/ERR262997_2.

For each file fragment, applying above script then merge/cat the results into one fastq.

################################################

mkdir sample final
splitSize=128619455

split -l $splitSize -a 4 -d ERR262997_1.fastq output/ERR262997_1.
split -l $splitSize -a 4 -d ERR262997_2.fastq output/ERR262997_2.

for i in 10 20 40
do
num_reads=$(( (${i}*3000000000)/(101*2*20) ))

for j in $(seq 0 19)
do
k=$(printf %04d $j)

x=sample/ERR262997_${i}x_1.${k}
y=sample/ERR262997_${i}x_2.${k}
echo $x $y
paste ERR262997_1.${k} ERR262997_2.${k} | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf  | head -n ${num_reads}| sed 's/\t\t/\n/g' | awk -v f=$x -v r=$y '{print $1 > f; print $2 > r}'
done

cat `ls sample/ERR262997_${i}x_1.*` > final/ERR262997_${i}x_1.fastq
cat `ls sample/ERR262997_${i}x_2.*` > final/ERR262997_${i}x_2.fastq

done