Wednesday, December 29, 2010

Short Read Micro re-Aligner

http://sourceforge.net/apps/mediawiki/srma/index.php?title=Main_Page

#sync

git clone git://srma.git.sourceforge.net/gitroot/srma/srma srma

cd srma

cd c-code

#samtools must be present as a subfolder to compile srma.

svn co https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/samtools samtools

cd samtools;make

cd ..

sh autogen.sh && ./configure && make

#if goes well, you will see a binary "srma"

./srma

Friday, December 10, 2010

Common indicators of variant quality

– Number of variants

– Transition/transversion ratio

– Hardy-Weinberg Equilibrium violations

- Presence in variant databases

– % in dbSNP build

– Concordance to Hapmap (II, II+III consensus, III)

– Concordance to validation data (i.e. array-based genotyping)

- Visualization (IGV*)

– Examination of alignments, local sequence context, error covariates, etc.

7550 Benchmark

/DataDisk/Analysis/Sun/7550R/simu

novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:7550R\tCN:HCI' -F ILMFQ -d hg19all.nix -f 7550.chr20.1.fq 7550.chr20.2.fq >7550.chr20.sam 2>7550.chr20.log

jj $picard/SortSam.jar INPUT=7550.chr20.sam OUTPUT=7550.chr20.sorted.bam SO=coordinate TMP_DIR=.

jj $picard/ValidateSamFile.jar I=7550.chr20.sorted.bam O=7550.chr20.validation

#"more 7550.chr20.validation" -> no errors found

jj $picard/BuildBamIndex.jar I=7550.chr20.sorted.nodup.bam O=7550.chr20.sorted.nodup.bam.bai

jj $picard/MarkDuplicates.jar I=7550.chr20.sorted.bam O=7550.chr20.sorted.nodup.bam M=7550.chr20.duplicate

gatk -T RealignerTargetCreator -R chr20.fa -I 7550.chr20.sorted.nodup.bam -D dbsnp_131_chr20.rod -o 7550.chr20.intervals

gatk -T IndelRealigner -R chr20.fa -I 7550.chr20.sorted.nodup.bam -targetIntervals 7550.chr20.intervals -o 7550.chr20.realigned.bam

gatk -T CountCovariates -R chr20.fa -l INFO -D dbsnp_131_chr20.rod -I 7550.chr20.realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov MappingQualityCovariate -recalFile 7550.chr20.recal_data.csv

gatk -T TableRecalibration -R chr20.fa -l INFO -I 7550.chr20.realigned.bam -recalFile 7550.chr20.recal_data.csv -o 7550.chr20.realigned.recal.bam

jj $picard/FixMateInformation.jar I=7550.chr20.realigned.recal.bam O=7550.chr20.realigned.recal.fixed.bam SO=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=.

gatk -T UnifiedGenotyper -R chr20.fa -I 7550.chr20.realigned.recal.fixed.bam -D dbsnp_131_chr20.rod -o 7550.chr20.snps.raw.vcf -stand_call_conf 50.0 -stand_emit_conf 20.0 -dcov 300

Thursday, December 9, 2010

HapMap

The HapMap is a catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world. The International HapMap Project is not using the information in the HapMap to establish connections between particular genetic variants and diseases. Rather, the Project is designed to provide information that other researchers can use to link genetic variants to the risk for specific illnesses, which will lead to new methods of preventing, diagnosing, and treating disease.
The DNA in our cells contains long chains of four chemical building blocks -- adenine, thymine, cytosine, and guanine, abbreviated A, T, C, and G. More than 6 billion of these chemical bases, strung together in 23 pairs of chromosomes, exist in a human cell. (See http://www.dnaftb.org/dnaftb/ for basic information about genetics.) These genetic sequences contain information that influences our physical traits, our likelihood of suffering from disease, and the responses of our bodies to substances that we encounter in the environment.
The genetic sequences of different people are remarkably similar. When the chromosomes of two humans are compared, their DNA sequences can be identical for hundreds of bases. But at about one in every 1,200 bases, on average, the sequences will differ (Figure 1). One person might have an A at that location, while another person has a G, or a person might have extra bases at a given location or a missing segment of DNA. Each distinct "spelling" of a chromosomal region is called an allele, and a collection of alleles in a person's chromosomes is known as a genotype.
Differences in individual bases are by far the most common type of genetic variation. These genetic differences are known as single nucleotide polymorphisms, or SNPs (pronounced "snips"). By identifying most of the approximately 10 million SNPs estimated to occur commonly in the human genome, the International HapMap Project is identifying the basis for a large fraction of the genetic diversity in the human species.
For geneticists, SNPs act as markers to locate genes in DNA sequences. Say that a spelling change in a gene increases the risk of suffering from high blood pressure, but researchers do not know where in our chromosomes that gene is located. They could compare the SNPs in people who have high blood pressure with the SNPs of people who do not. If a particular SNP is more common among people with hypertension, that SNP could be used as a pointer to locate and identify the gene involved in the disease.
However, testing all of the 10 million common SNPs in a person's chromosomes would be extremely expensive. The development of the HapMap will enable geneticists to take advantage of how SNPs and other genetic variants are organized on chromosomes. Genetic variants that are near each other tend to be inherited together. For example, all of the people who have an A rather than a G at a particular location in a chromosome can have identical genetic variants at other SNPs in the chromosomal region surrounding the A. These regions of linked variants are known as haplotypes.

In many parts of our chromosomes, just a handful of haplotypes are found in humans. [See The Origins of Haplotypes.] In a given population, 55 percent of people may have one version of a haplotype, 30 percent may have another, 8 percent may have a third, and the rest may have a variety of less common haplotypes. The International HapMap Project is identifying these common haplotypes in four populations from different parts of the world. It also is identifying "tag" SNPs that uniquely identify these haplotypes. By testing an individual's tag SNPs (a process known as genotyping), researchers will be able to identify the collection of haplotypes in a person's DNA. The number of tag SNPs that contain most of the information about the patterns of genetic variation is estimated to be about 300,000 to 600,000, which is far fewer than the 10 million common SNPs.

Once the information on tag SNPs from the HapMap is available, researchers will be able to use them to locate genes involved in medically important traits. Consider the researcher trying to find genetic variants associated with high blood pressure. Instead of determining the identity of all SNPs in a person's DNA, the researcher would genotype a much smaller number of tag SNPs to determine the collection of haplotypes present in each subject. The researcher could focus on specific candidate genes that may be associated with a disease, or even look across the entire genome to find chromosomal regions that may be associated with a disease. If people with high blood pressure tend to share a particular haplotype, variants contributing to the disease might be somewhere within or near that haplotype.

Transition and transcersion

DNA substitution mutations are of two types. Transitions are interchanges of two-ring purines (A G) or of one-ring pyrimidines (C T): they therefore involve bases of similar shape. Transversions are interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures.

Although there are twice as many possible transversions, because of the molecular mechanisms by which they are generated, transition mutations are generated at higher frequency than transversions. As well, transitions are less likely to result in amino acid substitutions (due to "wobble"), and are therefore more likely to persist as "silent substitutions" in populations as single nucleotide polymorphisms (SNPs).

Wednesday, December 8, 2010

make a simulation for 7550R

#train parameters from the raw fastq file (101bp, pair-end)

>maq simutrain 7550.simupars.dat 7550X1_101122_SN141_0314_B80R4NABXX_1_1.txt

#we want to make test on hg19 Chr20 only. Chr20 has ~63 million(63025520) bp , to get a 30x coverage with 101bp read length, we need 30*63000000/101 = 18712871 reads. (rounds to 19 million reads) .

>maq simulate -d 200 -s 20 -N 19000000 7550.chr20.1.fq 7550.chr20.2.fq chr20.fa 7550.simupars.dat > 7550.chr20.snps

#do alignment

>novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:6966R\tCN:HCI' -F ILMFQ -d chr20.ndx -f 7550.chr20.1.fq 7550.chr20.2.fq > 7550.chr20.sam 2>7550.chr20.log

Monday, December 6, 2010

VCF

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40

clustered SNP

clustered SNP calls are suspicious and are often caused by indels. For example, finding 3 SNPs in a 10bp window.

Thursday, December 2, 2010

Read group



For example, to create a paired end same file with fully formatted read groups, you can run:

bwa sampe -i MARK -m DEPRISTO -l LIBRARY -p ILLUMINA -h Homo_sapiens_assembly18.fasta first.fastq.sai secondfastq.sai first.fastq second.fastq > pe.sam
This will result in a file that looks like:

@HD     VN:1.0  SO:unsorted
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:247249719
@SQ     SN:chr2 LN:242951149
[More @SQ lines removed for clarity]
@RG     ID:MARK SM:DEPRISTO     LB:LIBRARY      PL:ILLUMINA
61CC3AAXX100125:6:48:11631:21212        81      chrM    8       37      76M     chr17   21944837        0       GGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGAT    99:<->=D>IF7.:7,H

Tuesday, November 30, 2010

dbSNP file naming convention

>ls ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606

dr-xr-xr-x 2 ftp anonymous 4096 Feb 28 2010 1000Genomes

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 14:52 ASN1_bin

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 14:55 ASN1_flat

dr-xr-xr-x 2 ftp anonymous 4096 Nov 29 16:39 ASN1_flat_b131

dr-xr-xr-x 3 ftp anonymous 4096 Feb 28 2010 GWAS_arrays

lr--r--r-- 1 ftp anonymous 17 Oct 28 20:17 README_human_9606_B132.txt -> /snp/00readme.txt

dr-xr-xr-x 3 ftp anonymous 4096 Sep 30 18:55 VCF

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:07 XML

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:08 chr_rpts

dr-xr-xr-x 7 ftp anonymous 4096 Sep 28 20:29 database

dr-xr-xr-x 2 ftp anonymous 167936 Nov 3 15:33 gene_report

dr-xr-xr-x 2 ftp anonymous 4096 Feb 28 2010 genome_report

dr-xr-xr-x 3 ftp anonymous 4096 Oct 14 17:39 genotype

dr-xr-xr-x 27 ftp anonymous 4096 Oct 14 18:20 genotype_by_gene

dr-xr-xr-x 3 ftp anonymous 4096 Feb 28 2010 haplotypes

dr-xr-xr-x 5 ftp anonymous 4096 Nov 15 19:05 misc

dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:14 rs_fasta

dr-xr-xr-x 15 ftp anonymous 4096 Apr 5 2010 ss_fasta

dr-xr-xr-x 2 ftp anonymous 45056 Nov 17 21:14 viewBatch

****************************************************

example:

14-12162-MKK.vcf.gz

14 -> Chromosome 14

12162 -> dbSNP population ID (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?pop=12162)

MKK -> Three letter population identifier (http://ccr.coriell.org/sections/collections/NHGRI/?SsId=11)

Friday, November 12, 2010

What is base calling?

Base-calling converts raw or processed data from a sequencing instrument into sequences and quality scores.

Base-calling usually refers to the conversion of intensity data into sequences and quality scores. Intensity information is extracted from images by the image analysis.

Base-calling has two aspects: Identifying the base-call and assigning a confidence estimate to the call

Quality scores quantify the probability that a base-call is correct (or wrong)

Base quality scores: Individual bases have quality scores which reflect the likelihood of the base being correct/incorrect

Alignment scores: Probability than an alignment to a given position in the reference genome is correct

Allele scores, SNP scores: Probability that a given allele, SNP was observed (often conditional on the alignment being correct)

Base and alignment scores: are single read scores; SNP scores are consensus scores

Consensus calls use information from multiple reads

Phred scores: A base quality score assigned by the phred software (or a program based on the phred which is the most prominent base-calling program for capillary sequencing.)

A quality score expressed on a logarithmic scale:

Q = -10 log10( probability of an error )

Example: Q20 = 1% error probability

The Phred method assigns quality scores to a base-call based on observed properties of the base (predictors)

Phred is a two-step process:

1. Training: Given a set of reads, labels as to which bases are correct, and a set of quality statistics for each base, produce a model that can predict error rates for unseen bases

2. Application: Given new reads and quality statistics, predict the quality for each of the bases.

Phred is essentially a big lookup table!

What goes into a quality table?

Quality predictors are numbers correlated with the quality of a base call, and attempt to quantify concepts such as:

–

“Is the signal for the called base much brighter than the others?”

–

“Did the spot get suspiciously dim, compared to the beginning?”

–

“Does the signal look clean in the next few cycles, and the previous few cycles?

A training data set

–

One of the fundamental questions is: How similar is the training data set to the data set the phred table is applied? Does it make sense for my data?

–

Run-to-run variance, type of sample etc.; need diversity in training

An alignment method and reference

–

phred is inherently based on the assumption that alignments are correct andthe reference is well known

–

Makes it intrinsically very hard to provide accurate scores for low quality bases!

–

Need high-quality references for training –what are the highest qualities we can get for references?

Monday, November 8, 2010

Pileup format

samtools

chr1 808922 G A 21 116 53 9 aAaaaaaAA 9=;:@?9:,

chr1 ---> name of this sequence is chr1

808922 ---> coordinate is 808922 (1-based)

G ---> reference base is 'G'

A ---> consensus base is 'A'

21 ---> consensus quality is 21

116 ---> SNP quality is 116

53 ---> maximum mapping quality is 53

9 ---> the number of reads covering the site is 9

aAaaaaaAA ---> read bases are 'aAaaaaaAA'

9=;:@?9:, --->base qualities are '9=;:@?9:,'

consensus quality (21) is the Phred-scaled probability that the consensus is wrong. SNP quality (116) is the Phred-scaled probability that the consensus is identical to the reference.

For SNP calling, SNP quality is of more importance.

For example, to further filter the SNP calls:

samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup

SNP call

novoalign -o SAM -F ILMFQ -t120 -r0.2 -q5 -d hg19.ndx -f 7746x2_s_6_1_sequence.txt.gz 7746x2_s_6_2_sequence.txt.gz | grep chr > 7746x2.sam

samtools view -bS 7746x2.sam | samtools sort - 7746x2

samtools rmdup 7746x2.bam 7746x2.sorted.bam

samtools index 7746x2.sorted.bam

samtools view -u -q 20 7746x2.sorted.bam | samtools pileup -vcf /DataDisk/Analysis/Sun/fa/hg19/chr1.fa - | samtools.pl varFilter -D 100 -G 50 > 1.var

awk '($3=="*"&&$6>=50)||($3!="*"&&$6>30)' 1.var > 1.var.filtered

Tuesday, November 2, 2010

delete lines from a file

sed -e '2d' hello.txt # delete line #2 from hello.txt

sed -e '1,36d' hello.txt # delete line 1-36 from hello.txt

awk '!(NR>=1 && NR<=36){print $0}' hello.txt # delete line 1-36 from hello.txt

soft clipped

Clipped alignment. In Smith-Waterman alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. We introduce operation ʻSʼ to describe (softly) clipped alignment. Here is an example. Suppose the clipped alignment is:

REF:AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG

READ: gggGTGTAACC-GACTAGgggg

where on the read sequence, bases in uppercase are matches and bases in lowercase are clipped off. The CIGAR for this alignment is: 3S8M1D6M4S.

Monday, October 18, 2010

A SNP Call pipeline

SNP and MicroIndel detection

Novoalign detects mismatches and short indels in short reads from next-generation sequencing platforms. The recipes below contain command line usage to generate list of single nucleotide polymorphisms SNPs and Indels for further analysis.

Generate sorted alignments in BAM format

Inititally we need to align our file(s) to the reference genome:

novoalign -d hg18.nix -f reads1.fastq -o SAM 2> stats.txt > novo.sam

samtools view -bS novo.sam | samtools sort - reads1

For paired-end or mate-pair reads adjust the parameters accordingly. If we have multiple alignment files these all need to be sorted individually and merged with either samtools or Picard.

Remove PCR duplicates

It is highly recommended to remove PCR duplicates before proceeding with the SNP and Indel calls. We use samtools as an example to remove PCR duplicates.

#remove PCR duplicates from single-end reads

samtools rmdup -s reads1.bam reads1.sorted.bam

#For Paired reads

samtools rmdup reads1.bam reads1.pe.sorted.bam

SNP and Microindel calling

In this example we use the samtools pileup SNP caller and accept reads with a minimum mapping quality of 20:

#The hg18.fasta reference genome sequence is required. samtools.pl is also required

samtools view -u -q 20 reads1.bam | samtools pileup -vc -f hg18.fasta - | samtools.pl varFilter -D 1000 -G 50 > variations.list

Filtering and final list generation

A simple awk script can be used to filter our list of variations where the indel SNP quality is greater than 50 or 60 respectively:

awk '($3=="*"&&$6>=50)||($3!="*"&&$6>30)' variations.list > variations.filtered.list

Exporting to other formats

The samtools distribution contains a helper script called sam2vcf.pl that may be used to convert to the Variant Call Format (VCF):

sam2vcf.pl variations.filtered.list -r hg18.fasta > variations.vcf

Thursday, October 7, 2010

Walltime estimation

Input: ~2.5G *2 (pair-end)

Genome: ~4G

250MB reads and 4GB genome per node: ~4 hours

Tuesday, September 21, 2010

Simulate data for pipeline

Organsim, CE6

#download

wget http://sourceforge.net/projects/maq/files/maq-data/20080929/calib-36.dat.gz/download

#unzip

gzip -d calib-36.dat.gz

#simulate reads

maq simulate ce6_1.fq ce6_2.fq ce6.fa calib-36.dat

#make indexed genome

novoindex ce6.ndx ce6.fa

#Align reads to ce6 genome

time novoalign -d ce6.ndx -d ce6_1.fq ce6_2.fq | grep chr > A.ce6.nal

real 1m43.108s

user 6m39.372s

sys 0m4.652s

#Format to MAQ

novo2maq A.map - A.ce6.nal

#Convert the reference sequences to the binary fasta format

#maq fasta2bfa ref.fasta ref.bfa

#Build the mapping assembly

maq assemble A.cns ref.bfa A.map 2>assemble.log

#Extract consensus sequences and qualities

maq cns2fq A.cns >A.cns.fq

#Extract list of SNPs

maq cns2snp A.cns >A.snp

Wednesday, September 15, 2010

update potato source code

#!/bin/sh

ssh hiseq@hci-bio2.hci.utah.edu "rsync -arue ssh ying@155.100.235.73:~/workspace/potato/src/*.py potato"

rsync -arue ssh hiseq@hci-bio2.hci.utah.edu:~/potato/*.py .

Tuesday, September 14, 2010

delete all jobs with name=hello

qstat -au u0592675 | grep 'hello' | awk '{print $1}' | awk -F '.' '{print $1}' | xargs qdel

Friday, September 10, 2010

Supported Genome Builds

Human (hg19, Feb_2009, GRCh37)

single-end, HiSeq2000, Rong Mao Lab

Human (hg18, Mar_2006, NCBI Build 36.1)

DNA-methylation, pair-end, Hiseq2000, Brad Cairns Lab

Chip-seq, single-end, GAIIx, Brad Cairns Lab

C_elegans (CE6, May_2008, WS190)

Small RNA sequencing, single-end, GAIIx, Brenda Bass Lab

A_Thaliana (TAIR8, Mar_2008)

Single-end, HiSeq2000, Jason Stajich Lab

M_musculus (mm9, Jul_2007, NCBI Build 37)

ChIP-Seq, single-end, HiSeq2000, Anne Moon Lab

DNA-methylation, pair-end, HiSeq2000, Brad Cairns Lab

D_rerio (Zv8, Jun_2008)

ChIP-Seq, single-end, HiSeq2000, Brad Cairns Lab

Friday, September 3, 2010

No available computational resource at all

1. I submitted a PBS job last Friday which only asked one node, one CPU and five seconds of walltime, now seven days passed, the job is still in the waiting queue. I can not make any tests on the cluster at all.

2. I tried to use Amazon's Cloud Server to do the alignment. However, to do so will require the purchasing of this service. We do not have a public account.

Torque and S3cmd

tmp> wget http://www.clusterresources.com/downloads/torque/torque-2.5.1.tar.gz

tmp> tar -zxvf torque-2.5.1.tar.gz

tmp> cd torque-2.5.1

torque-2.5.1> ./configure && make && make install

torque-2.5.1> ./torque-package-mom-linux-x86_64.sh --install

torque-2.5.1> ./torque-package-client-linux-x86_64.sh --install

torque-2.5.1> cp contrib/init.d/pbs_mom /etc/init.d/pbs_mom

torque-2.5.1> update-rc.d pbs_mom defaults

Modify "/etc/ld.so.conf"

###############################

include /etc/ld.so.conf.d/*.conf

/usr/local/lib

###############################

>ldconfig

torque-2.5.1> ./torque.setup root

initializing TORQUE (admin: root@ubuntu.localdomain)

Max open servers: 4

set server operators += root@ubuntu.localdomain

Max open servers: 4

set server managers += root@ubuntu.localdomain

torque-2.5.1> qmgr -c 'p s'

# Create queues and set their attributes.

# Create and define queue batch

create queue batch

set queue batch queue_type = Execution

set queue batch resources_default.nodes = 1

set queue batch resources_default.walltime = 01:00:00

set queue batch enabled = True

set queue batch started = True

# Set server attributes.

set server scheduling = True

set server acl_hosts = ubuntu

set server managers = root@ubuntu.localdomain

set server operators = root@ubuntu.localdomain

set server default_queue = batch

set server log_events = 511

set server mail_from = adm

set server scheduler_iteration = 600

set server node_check_rate = 150

set server tcp_timeout = 6

set server mom_job_sync = True

set server keep_completed = 300

$TORQUEHOME is /var/spool/torque/

S3CMD

>wget http://sourceforge.net/projects/s3tools/files/s3cmd/0.9.9.91/s3cmd-0.9.9.91.tar.gz/download

>sudo python setup.py install

>s3cmd --configure

Enter new values or accept defaults in brackets with Enter.

Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3

Access Key: 1qaz2wsx

Secret Key: 1qaz2wsx

Encryption password is used to protect your files from reading

by unauthorized persons while in transfer to S3

Wednesday, September 1, 2010

Amazon Cloud

Amazon S3 is a reasonably priced data storage service. Ideal for off-site backups, archiving and other data storage needs. Check out About Amazon S3 section to find out more.

S3cmd is a command line tool for uploading, retrieving and managing data in Amazon S3. It is best suited for power users who don't fear command line. It is also ideal for scripts, automated backups triggered from cron, etc.

Tuesday, August 31, 2010

Design of the pipeline

Message queue centralized system and coupled with independent I/O Channel

Friday, August 27, 2010

Design of the pipeline

what is the hardware environment?

Tuesday, August 24, 2010

Script mapping

#alignment
novoalign -F ILMFQ -t120 -r0.2 -q5 -d hg19Splices34.index -f 1.txt 2.txt > 84060.novoalign

#transform to MAQ
maq novo2maq 84060.novoalign.map - 84060.novoalign

#Convert the reference sequences to the binary fasta format
maq fasta2bfa ref.fasta ref.bfa

#Build the mapping assembly
maq assemble consensus.cns ref.bfa 84060.novoalign.map 2>assemble.log

#Extract consensus sequences and qualities
maq cns2fq consensus.cns >cns.fq

#Extract list of SNPs
maq cns2snp consensus.cns >cns.snp

will be translated to :

/PATH_TO_NONOALIGN/novoalign -F ILMFQ -t120 -r0.2 -q5 -d /PATH_TO_GENOME_INDEX/hg19Splices34.index -f /PATH_TO_INPUT_FILE/1.txt /PATH_TO_INPUT_FILE/2.txt > /PATH_TO_OUTPUT_FILE/1_84060.novoalign
/PATH_TO_MAQ/maq novo2maq /PATH_TO_OUTPUT_FILE/84060.novoalign.map - /PATH_TO_INPUT_FILE/84060.novoalign
...

Monday, August 23, 2010

query, download, upload and update with rsync


#!/home/ying/py2.6.5/bin/python
"""
Use rsync to synchronize the DATA_HOME(hci-bio2, which raw data was stored) and COMP_HOME(chpc-node, 
which process raw data and return output back to source. All the rsync operations are initiated from chpc-node.

Functions:
1. query: query the source if the new data is available
2. download: download new data ( from source to destination)
3. upload: upload result(output) to source.
4. update: update the source by appending a log file from the destination to show the alignment progress on chpc. "begin, percent-finished, end, error, etc"
      
"""
import os
import time
import operator
import subprocess

DATA_HOME = 'ying@123.123.123.123:~/alignment/' 
COMP_HOME = '/home/ying/alignment/' 
RSYNC_QUERY_PARAMS = '-are ssh'
RYSNC_DOWNLOAD_PARAMS = '-avuze ssh' #archive, verbose, update-only, compress-during-transfer and trasnfer-over-ssh  
RYSNC_UPLOAD_PARAMS = '-avuze ssh' #archive, verbose, update-only, compress-during-transfer and trasnfer-over-ssh  
LOG_FILE_NAME = 'log.txt'

def query():
    #query_cmd = """rsync -are ssh ying@155.100.235.73:~/alignment/ | awk '{print $1 " " $3"#"$4 " " $5}'"""
    cmd = ["rsync",RSYNC_QUERY_PARAMS,DATA_HOME,'|',"""awk '{print $1 " " $3"#"$4 " " $5}'"""]
    query_cmd = ' '.join(cmd)
    #print query_cmd
    p = subprocess.Popen(query_cmd,shell=True,stdout=subprocess.PIPE)
    line = p.communicate()
    fnames = ''.join(line[:-1]).strip().split('\n')
    df = {}
    dt = {}
    for fname in fnames:
        prop,ctime,fn = fname.split(' ')
        if fn != '.' and fn != '..':
            if prop[0] == 'd': #it is a directory
                pass
            if prop[0] == '-': #it is a file
                parent_path = os.path.dirname(fn)
                df.setdefault(parent_path,[]).append(os.path.basename(fn))
                #Transform time to seconds, for comparing which job should be run firstly. First In First Run.   
                #'2010/08/20#16:27:40' -> 1282343260.0
                dt[parent_path] = int(time.mktime(time.strptime(ctime,'%Y/%m/%d#%H:%M:%S'))) 
    jobs = []            
    for k,v in df.items():
        if "begin" in v: #the data is ready
            if not LOG_FILE_NAME in v: #this job is not running
                jobs.append((k,dt[k]))
    #print jobs       
    return sorted(jobs,key=operator.itemgetter(1))

def download(path_to_download):
    """download a whole directory from DATA_HOME to COMP_HOME 
    """
    normalized_path = path_to_download.strip('/') #/A, /A/ or A/ -> A
    source = os.path.join(DATA_HOME,normalized_path)
    cmd = ["rsync", RYSNC_DOWNLOAD_PARAMS, source, COMP_HOME] #download the whole directory to COMP_HOME.
    download_cmd = ' '.join(cmd)
    #print download_cmd
    p = subprocess.Popen(download_cmd,shell=True,stdout=subprocess.PIPE)
    line = p.communicate()
    stdout_message = ''.join(line[:-1]).strip().split('\n')
    #print 'Message:',stdout_message
    update(normalized_path,stdout_message)

def upload(path,file_to_upload):
    """upload a file from COMP_HOME to DATA_HOME
     upload 'file_to_upload' at local path 'folder'.
    :/folder/file_to_upload will be uploaded to :/
    """
    normalized_path = path.strip('/')+"/" #/A, /A/ or A/ -> A/
    source = os.path.join(COMP_HOME,normalized_path,file_to_upload) 
    dest = os.path.join(DATA_HOME,normalized_path)
    cmd = ["rsync",RYSNC_UPLOAD_PARAMS, source,dest]
    upload_cmd = ' '.join(cmd)
    p = subprocess.Popen(upload_cmd,shell=True,stdout=subprocess.PIPE)
    line = p.communicate()
    stdout_message = ''.join(line[:-1]).strip().split('\n')
    #fn_log = log(os.path.join(COMP_HOME,normalized_src),stdout_message)
    update(normalized_path,stdout_message)

def update(path,message):
    fn = os.path.join(COMP_HOME,path,LOG_FILE_NAME)
    ofn = None
    try:
        if os.path.exists(fn): 
            ofn = open(fn,'w+')#append 
        else:
            ofn = open(fn,'w') #new file
        print >>ofn,'-'*50
        print >>ofn,time.asctime()
        for m in message:
            print >>ofn,M
        ofn.close()
        normalized_path = path.strip('/')+"/" #/A, /A/ or A/ -> A/
        source = fn 
        dest = os.path.join(DATA_HOME,normalized_path)
        cmd = ["rsync",RYSNC_UPLOAD_PARAMS, source,dest]
        subprocess.Popen(cmd,shell=False,stdout=subprocess.PIPE)
    except:
        pass
        
def clear():
    pass
#print query()
#download('A')
upload('A','hello.txt')

Thursday, August 19, 2010

Failed to run RSA key on HCI-BIO2

Password is required.

How to: Make a passwordless ssh connection

How to: Make a passwordless ssh connection between chpc(hello@delicatearch.chpc.utah.edu) and hci(world@hci-bio1.hci.utah.edu)

Suppose after we log into chpc , we want to "ssh" to hci without password.

1. Create RSA keys (on chpc).

hello@chpc:MY_HOME>ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key ($HOME/.ssh/id_rsa):

Enter passphrase (empty for no passphrase): MY_PASSPHRASE

Enter same passphrase again:MY_PASSPHRASE

Your identification has been saved in $HOME/.ssh/id_rsa.

Your public key has been saved in $HOME/.ssh/id_rsa.pub.

The key fingerprint is:

ad:9e:ab:f9:e1:1a:a4:85:16:3b:24:5f:35:b6:76:a7 user@machine

MY_PASSPHRASE is password used to encrypt the keys, not the user's passworld.

Now we have two files under $HOME/.ssh, "id_rsa" is private key, "id_rsa.pub" is public key.

2. Transfer the public key to world@hci's home directory (on chpc)

hello@chpc:$HOME>scp .ssh/id_rsa.pub world@hci-bio1.hci.utah.edu:~/

3. Append the id_rsa.pub to aothorized keys (on hci)

world@hci:$HOME>cat id_rsa.pub >>authorized_keys

4. Make the hci accept RSA key style connection.

By default, this features is off.

We need to modify the following line in /etc/ssh/sshd_config (root privilege is required)

#RSAAuthentication yes

#PubkeyAuthentication yes

RSAAuthentication yes

PubkeyAuthentication yes

Then we need to refresh the ssh service.

>/etc/init.d/sshd restart

5. Make a test (on chpc)

>ssh -2 world@hci-bio1.hci.utah.edu

Enter passphrase for key '$HOME/.ssh/id_rsa':

Here we need to input the passphrase of the private key, it is . If everything goes well, we will see the welcome message:

Welcome to Ubuntu!

6. Now we need to eliminate the "passphrase" step using ssh-agent and ssh-add (on chpc)

ssh-agent are used to buffer the passphrase and keep it in memory, we we do not need input passphrase next time.

#start ssh-agent

>eval `ssh-agent`

#add passphrase

>ssh-add

Enter passphrase for chpc:$HOME/.ssh/id_rsa: MY_PASSPHRASE

Identity added: chpc:$HOME/.ssh/id_rsa

7. Make a test again (on chpc)

>ssh -2 world@hci-bio1.hci.utah.edu

Welcome to Ubuntu!

8. we can test scp (on chpc)

>scp hello.txt world@hci-bio1.hci.utah.edu:~/

To avoid run "eval `ssh-agent`" and "ssh-add" every time after we log into the chpc, we can append eval `ssh-agent` to ~/.bash_profile. So it will start automatically next time. However, we still need to run 'ssh-add' manually after each log in to register the passphrase for security. if passphrase is empty, we do not even need ssh-agent and ssh-add. But this may bring security risk to the private key.

we can also use "keychain" as the frontend of ssh-agent, so we do not have to create ssh-agent for each login. With keychain, only one ssh-agent is in service no matter how many consoles we open.

>wget http://www.funtoo.org/archive/keychain/keychain-2.7.1.tar.bz2

>tar -jxvf keychain-2.7.1.tar.bz2

>cd keychain-2.7.1

>./keychain

Wednesday, August 18, 2010

ssh connection from chpc to bio1

Suppose we want to passwordless ssh connection from machine A(chpc) to machine B(bio1)

1. On machine B

modify the following line in /etc/ssh/sshd_config

(root privilege is required)

#RSAAuthentication yes

#PubkeyAuthentication yes

#AuthorizedKeysFile .ssh/authorized_keys

RSAAuthentication yes

PubkeyAuthentication yes

AuthorizedKeysFile .ssh/authorized_keys

reload configuration

#/etc/init.d/ssh reload

2. On machine A

$ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/MY_HOME/.ssh/id_rsa):

Enter passphrase (empty for no passphrase): THIS_IS_MY_PASSWORD

Enter same passphrase again: THIS_IS_MY_PASSWORD

Your identification has been saved in /MY_HOME/.ssh/id_rsa.

Your public key has been saved in /MY_HOME/.ssh/id_rsa.pub.

The key fingerprint is:

ad:9e:ab:f9:e1:1a:a4:85:16:3b:24:5f:35:b6:76:a7 user@machine

$ls .ssh/

id_rsa id_rsa.pub known_hosts

id_rsa is my private key, and id_rsa.pub is the public key

$scp id_rsa.pub user@machineB:~/.ssh

3.On machine B

$cd .ssh/

$cat id_rsa.pub >>authorized_keys

Now make a test from machine A:

$ssh user@machineB

Enter passphrase for key '/MY_HOME/.ssh/id_rsa': THIS_IS_MY_PASSWORD

user@machineB$ #OK now we are using machineB.

Next, we use ssh-agent and ssh-add to avoid "Enter passphrase for key '/MY_HOME/.ssh/id_rsa'", avoid password completely.

Pipeline to run novoalign - 0818

Since we can run a deamon service at chpc, here is my idea to implement a pipeline

On HCI:
Setup rsync service on /home/alignment@hci-bio1. This directory will served as home to put all read files.

On CHPC:
A script running on chpc's interactive node (delicatearch) as deamon. This script will:

1. monitor /home/alignment/@hci-bio1 using rsync

2. if new files are found in that directory, download the file to chpc

3. split the read file into segments as jobs

4. submit jobs using PBS

5. wait for results by monitoring the local result directory e.g. /home/user/pid

6. re-submit jobs if necessary (some jobs may take exceptional long time to finish)

7. assemble result files

8. transfer result back to hci-bio1

Use case:

I have two files "A_1.fq.gz" and "A_2.fq.gz" from pair-end sequencing read. Now I want to align the read to human genome using novoalign.

1. Put "A_1.fq.gz" and "A_2.fq.gz" to /home/alignment/request/ @ hci-bio1

2. Create a file named "A.params" to list the parameters (gap penalty, genome, etc) for running novoalign.

3. Touch a dummy file named "A.ready". This file is to tell the daemon service on chpc that alignment request that starts with "A" are ready.

Once the job is done, I should be able to find my alignment at /home/alignment/response/

Now it looks like:

/home
|
|---alignment/
|
|---request/
| |
| |---A_1.fq.gz
| |
| |---A_2.fq.gz
| |
| |---A.params
| |
| |---A.ready
|
|---response/
|
|---A.result
|
|---other results