Tuesday, August 13, 2013

Hard trimming FASTQ file

When we plot the base quality distribution (for example, using fastqc), we can often observe that the base qualities on the 3' tail are pretty poor. Sometimes we can observe the qualities on the 5' headers are poor too.

For better alignment and variants calling (if any), we may need to trim the fastq files, only keep the bases in the middle of cycles, which generally have good and consistent qualities.

One method is trimming by its base quality, which can be done using tools like sickle.  Or we can just hard trim all reads , which means just removing N tail (or head) bases and corresponding qualities.

A typical FATSQ file:



@READ_NAME
NGGAAATGGCGTCTGGCGGCGAGATAATGG
+
#1=DDFFFHGHHHIIGIIJJJJJJIIJGJIIHFDD?BDB

Let us say we have pair-end fastq files: "A_1.fastq.gz" and "A_2.fastq.gz"

#trim the 10 tail bases
zcat A_1.fastq.gz | awk --posix '{ if (NR % 2 == 0) { sub(/.{10}$/,""); print} else {print}}' | gzip > A_1.fq.gz

#trim the 10 head bases
zcat A_1.fastq.gz | awk --posix '{ if (NR % 2 == 0) { sub(/^.{10}/,""); print} else {print}}' | gzip > A_1.fq.gz

#trim the 10 head bases and 10 tail bases
zcat A_1.fastq.gz | awk --posix '{ if (NR % 2 == 0) { sub(/^.{10}/,""); sub(/.{10}$/,""); print} else {print}}' | gzip > A_1.fq.gz

Monday, August 5, 2013

Create some artificial SNPs on a reference genome



import random
import sys

random.seed(13115)

def iter_chromosome(fasta):
    m = {}
    seq = []
    chr = ''
    for line in file(fasta):
        line = line.strip()
        if line.startswith('>'):
            if seq: #
                yield chr,''.join(seq)
                seq = []
            chr = line
        else:
            seq.append(line)
    yield chr,''.join(seq)

def make_snp(ref):
    x = ['A','G','C','T']
    x.remove(ref)
    return random.choice(x)

def generate_SNP(fasta,interval=1250):
    import numpy
    exon = {}.fromkeys(('A','G','C','T'))
    snp = []
    fsnp=file('snp.txt','w')
    for chr, seq in iter_chromosome(fasta):
        dup = [s for s in seq]
        N = len(dup)
        num_required=N/interval
        pos = [int(i) for i in numpy.random.normal(interval,interval/5,num_required)]
        index = 0
        for p in pos:
            index = index + p
            try:
                ref = dup[index-1]
                if exon.has_key(ref):
                    alt = make_snp(ref)
                    dup[index-1] = alt
                    snp.append('\t'.join((chr.lstrip('>'),str(index),ref,alt)))
            except IndexError:
                break
        print chr
        for i in xrange(0, len(dup), 50):
            print ''.join(dup[i:i+50])
    print >>fsnp,'\n'.join(snp)
    fsnp.close()
   
fasta = 'hg19.fa'
generate_SNP(fasta,250)



Split the BAM into smaller BAM files by chr (then use hadoop to accelerate the downstream analysis)

Supposedly we have a single BAM file "x.sort.bam" which is the sorted output of an aligner like BWA. To call variants, we still need many steps. It will take a long time to execute these steps one by one on a single server. To speed up the downstream analysis, we can "split and conquer" the BAM files by chromosomes, this is called "parallel processing".

for i in `samtools view -H x.sort.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do

#remove not-mapped-by-pair, low quality reads and output BAMs by chr
samtools view -bh -F 0x4 -q 1 -o x.$i.bam x.sort.bam $i

#index
samtools index x.$i.bam


#all other steps
java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T BaseRecalibrator \
-knownSites ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.norm.vcf \
-knownSites Mills_and_1000G_gold_standard.indels.b37.sites.norm.vcf \
-knownSites 1000G_phase1.indels.b37.norm.vcf \
-I x.$i.bam \
-o x.$i.grp

java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T PrintReads \
-BQSR x.$i.grp \
-I x.$i.bam \
-o x.$i.recal.bam

...

java -jar GenomeAnalysisTK.jar \
-R hg19.fasta \
-T UnifiedGenotyper --genotype_likelihoods_model BOTH 
-I x.$i.bam -rf BadCigar -dcov 500 -o x.$i.gatk.vcf
done

After indexing the splitted BAMs, we can let hadoop taking over the BAMs to run other steps in hadoop cluster.

for i in `samtools view -H x.sort.bam | awk -F"\t" '/@SQ/{print $2}' |  cut -d":" -f2`
do

#remove not-mapped-by-pair, low quality reads and output BAMs by chr
samtools view -bh -F 0x4 -q 1 -o x.$i.bam x.sort.bam $i

#index
samtools index x.$i.bam

done

#upload data into hadoop
hadoop fs -put *chr*.ba? $hdfsInput

#run hadoop
hadoop jar xxx.jar MyPipeline $hdfsInput $hdfsOutput ...