For testing purpose, we need to get 10x, 20x and 40x WGS coverage using sample NA12878 from 1KG.
The raw fastq has ~643 million paired reads, which is equivalent to ~43x WGS coverage.
$ll -h
-rw-r--r-- 1 user group 54555856768 Apr 27 13:58 ERR262997_1.fastq.gz
-rw-r--r-- 1 user group 55794552881 Apr 27 13:58 ERR262997_2.fastq.gz
55 GB each fastq file! We need to unzip it anyway
$gunzip *.gz
$ll -h
-rw-r--r-- 1 user group 173198473181 Apr 27 13:58 ERR262997_1.fastq
-rw-r--r-- 1 user group 173198473181 Apr 27 13:58 ERR262997_2.fastq
162GB each after unzipping. I am not sure I can handle something like this big.
Read length is 101 while genome size of hg19 is 3,000,000,000
to get a 10x coverage, we need:
$echo $(( (10*3000000000)/(101*2) ))
148514851
About 148 million reads, from both pair-end files. Here comes the last and critical step.
Again, I am not sure if it is possible to "paste" two 162GB files.
for i in 10 20 40
do
num_reads=$(( (${i}*3000000000)/(101*2) ))
paste ERR262997_1.fastq ERR262997_2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf | head -n ${num_reads}| sed 's/\t\t/\n/g' | awk '{print $1 > "ERR262997_${i}x_1.fastq"; print $2 > "ERR262997_${i}x_2.fastq"}'
done
Now it is time to cross fingers ...
[update]
It does not work. This morning I got this error:
"shuf: memory exhausted".
I have to split the 162GB fastq file firstly. Number of lines is 2572389100, divided by 20, we got 128619455.
$splitSize=128619455
$split -l $splitSize -a 4 -d ERR262997_1.fastq output/ERR262997_1.
$split -l $splitSize -a 4 -d ERR262997_2.fastq output/ERR262997_2.
################################################
mkdir sample final
splitSize=128619455
split -l $splitSize -a 4 -d ERR262997_1.fastq output/ERR262997_1.
split -l $splitSize -a 4 -d ERR262997_2.fastq output/ERR262997_2.
for i in 10 20 40
do
num_reads=$(( (${i}*3000000000)/(101*2*20) ))
for j in $(seq 0 19)
do
k=$(printf %04d $j)
x=sample/ERR262997_${i}x_1.${k}
y=sample/ERR262997_${i}x_2.${k}
echo $x $y
paste ERR262997_1.${k} ERR262997_2.${k} | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf | head -n ${num_reads}| sed 's/\t\t/\n/g' | awk -v f=$x -v r=$y '{print $1 > f; print $2 > r}'
done
cat `ls sample/ERR262997_${i}x_1.*` > final/ERR262997_${i}x_1.fastq
cat `ls sample/ERR262997_${i}x_2.*` > final/ERR262997_${i}x_2.fastq
done