Next Generation Sequencing and Data Analysis: September 2013

Based on the discussion from http://www.biostars.org/p/6544/

For testing purpose, we need to get 10x, 20x and 40x WGS coverage using sample NA12878 from 1KG.
The raw fastq has ~643 million paired reads, which is equivalent to ~43x WGS coverage.

$ll -h
-rw-r--r-- 1 user group 54555856768 Apr 27 13:58 ERR262997_1.fastq.gz
-rw-r--r-- 1 user group 55794552881 Apr 27 13:58 ERR262997_2.fastq.gz

55 GB each fastq file! We need to unzip it anyway

$gunzip *.gz
$ll -h
-rw-r--r-- 1 user group 173198473181 Apr 27 13:58 ERR262997_1.fastq
-rw-r--r-- 1 user group 173198473181 Apr 27 13:58 ERR262997_2.fastq

162GB each after unzipping. I am not sure I can handle something like this big.

Read length is 101 while genome size of hg19 is 3,000,000,000

to get a 10x coverage, we need:

$echo $(( (10*3000000000)/(101*2) ))
148514851

About 148 million reads, from both pair-end files. Here comes the last and critical step.
Again, I am not sure if it is possible to "paste" two 162GB files.

for i in 10 20 40
do
num_reads=$(( (${i}*3000000000)/(101*2) ))
paste ERR262997_1.fastq ERR262997_2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf | head -n ${num_reads}| sed 's/\t\t/\n/g' | awk '{print $1 > "ERR262997_${i}x_1.fastq"; print $2 > "ERR262997_${i}x_2.fastq"}'
done

Now it is time to cross fingers ...

[update]

It does not work. This morning I got this error:

"shuf: memory exhausted".

I have to split the 162GB fastq file firstly. Number of lines is 2572389100, divided by 20, we got 128619455.

$splitSize=128619455
$split -l $splitSize -a 4 -d ERR262997_1.fastq output/ERR262997_1.
$split -l $splitSize -a 4 -d ERR262997_2.fastq output/ERR262997_2.

For each file fragment, applying above script then merge/cat the results into one fastq.

################################################

mkdir sample final
splitSize=128619455

split -l $splitSize -a 4 -d ERR262997_1.fastq output/ERR262997_1.
split -l $splitSize -a 4 -d ERR262997_2.fastq output/ERR262997_2.

for i in 10 20 40
do
num_reads=$(( (${i}*3000000000)/(101*2*20) ))

for j in $(seq 0 19)
do
k=$(printf %04d $j)

x=sample/ERR262997_${i}x_1.${k}
y=sample/ERR262997_${i}x_2.${k}
echo $x $y
paste ERR262997_1.${k} ERR262997_2.${k} | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf | head -n ${num_reads}| sed 's/\t\t/\n/g' | awk -v f=$x -v r=$y '{print $1 > f; print $2 > r}'
done

cat `ls sample/ERR262997_${i}x_1.*` > final/ERR262997_${i}x_1.fastq
cat `ls sample/ERR262997_${i}x_2.*` > final/ERR262997_${i}x_2.fastq

done

Next Generation Sequencing and Data Analysis

Sunday, September 8, 2013

sampling(subset) NA12878

About Me

Blog Archive