Wednesday, December 29, 2010

install srma

Short Read Micro re-Aligner



#sync
git clone git://srma.git.sourceforge.net/gitroot/srma/srma srma
cd srma
cd c-code

#samtools must be present as a subfolder to compile srma.
svn co https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/samtools samtools
cd samtools;make
cd ..
sh autogen.sh && ./configure && make

#if goes well, you will see a binary "srma"
./srma

Friday, December 10, 2010

Common indicators of variant quality

– Number of variants
– Transition/transversion ratio
– Hardy-Weinberg Equilibrium violations
- Presence in variant databases
– % in dbSNP build
– Concordance to Hapmap (II, II+III consensus, III)
– Concordance to validation data (i.e. array-based genotyping)
- Visualization (IGV*)
– Examination of alignments, local sequence context, error covariates, etc.

7550 Benchmark

/DataDisk/Analysis/Sun/7550R/simu

novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:7550R\tCN:HCI' -F ILMFQ -d hg19all.nix -f 7550.chr20.1.fq 7550.chr20.2.fq >7550.chr20.sam 2>7550.chr20.log

jj $picard/SortSam.jar INPUT=7550.chr20.sam OUTPUT=7550.chr20.sorted.bam SO=coordinate TMP_DIR=.

jj $picard/ValidateSamFile.jar I=7550.chr20.sorted.bam O=7550.chr20.validation

#"more 7550.chr20.validation" -> no errors found

jj $picard/BuildBamIndex.jar I=7550.chr20.sorted.nodup.bam O=7550.chr20.sorted.nodup.bam.bai

jj $picard/MarkDuplicates.jar I=7550.chr20.sorted.bam O=7550.chr20.sorted.nodup.bam M=7550.chr20.duplicate

gatk -T RealignerTargetCreator -R chr20.fa -I 7550.chr20.sorted.nodup.bam -D dbsnp_131_chr20.rod -o 7550.chr20.intervals

gatk -T IndelRealigner -R chr20.fa -I 7550.chr20.sorted.nodup.bam -targetIntervals 7550.chr20.intervals -o 7550.chr20.realigned.bam

gatk -T CountCovariates -R chr20.fa -l INFO -D dbsnp_131_chr20.rod -I 7550.chr20.realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov MappingQualityCovariate -recalFile 7550.chr20.recal_data.csv

gatk -T TableRecalibration -R chr20.fa -l INFO -I 7550.chr20.realigned.bam -recalFile 7550.chr20.recal_data.csv -o 7550.chr20.realigned.recal.bam

jj $picard/FixMateInformation.jar I=7550.chr20.realigned.recal.bam O=7550.chr20.realigned.recal.fixed.bam SO=coordinate VALIDATION_STRINGENCY=SILENT TMP_DIR=.

gatk -T UnifiedGenotyper -R chr20.fa -I 7550.chr20.realigned.recal.fixed.bam -D dbsnp_131_chr20.rod -o 7550.chr20.snps.raw.vcf -stand_call_conf 50.0 -stand_emit_conf 20.0 -dcov 300

Thursday, December 9, 2010

HapMap

The HapMap is a catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world. The International HapMap Project is not using the information in the HapMap to establish connections between particular genetic variants and diseases. Rather, the Project is designed to provide information that other researchers can use to link genetic variants to the risk for specific illnesses, which will lead to new methods of preventing, diagnosing, and treating disease.
The DNA in our cells contains long chains of four chemical building blocks -- adenine, thymine, cytosine, and guanine, abbreviated A, T, C, and G. More than 6 billion of these chemical bases, strung together in 23 pairs of chromosomes, exist in a human cell. (See http://www.dnaftb.org/dnaftb/ for basic information about genetics.) These genetic sequences contain information that influences our physical traits, our likelihood of suffering from disease, and the responses of our bodies to substances that we encounter in the environment.
The genetic sequences of different people are remarkably similar. When the chromosomes of two humans are compared, their DNA sequences can be identical for hundreds of bases. But at about one in every 1,200 bases, on average, the sequences will differ (Figure 1). One person might have an A at that location, while another person has a G, or a person might have extra bases at a given location or a missing segment of DNA. Each distinct "spelling" of a chromosomal region is called an allele, and a collection of alleles in a person's chromosomes is known as a genotype.
Differences in individual bases are by far the most common type of genetic variation. These genetic differences are known as single nucleotide polymorphisms, or SNPs (pronounced "snips"). By identifying most of the approximately 10 million SNPs estimated to occur commonly in the human genome, the International HapMap Project is identifying the basis for a large fraction of the genetic diversity in the human species.
For geneticists, SNPs act as markers to locate genes in DNA sequences. Say that a spelling change in a gene increases the risk of suffering from high blood pressure, but researchers do not know where in our chromosomes that gene is located. They could compare the SNPs in people who have high blood pressure with the SNPs of people who do not. If a particular SNP is more common among people with hypertension, that SNP could be used as a pointer to locate and identify the gene involved in the disease.
However, testing all of the 10 million common SNPs in a person's chromosomes would be extremely expensive. The development of the HapMap will enable geneticists to take advantage of how SNPs and other genetic variants are organized on chromosomes. Genetic variants that are near each other tend to be inherited together. For example, all of the people who have an A rather than a G at a particular location in a chromosome can have identical genetic variants at other SNPs in the chromosomal region surrounding the A. These regions of linked variants are known as haplotypes.



In many parts of our chromosomes, just a handful of haplotypes are found in humans. [See The Origins of Haplotypes.] In a given population, 55 percent of people may have one version of a haplotype, 30 percent may have another, 8 percent may have a third, and the rest may have a variety of less common haplotypes. The International HapMap Project is identifying these common haplotypes in four populations from different parts of the world. It also is identifying "tag" SNPs that uniquely identify these haplotypes. By testing an individual's tag SNPs (a process known as genotyping), researchers will be able to identify the collection of haplotypes in a person's DNA. The number of tag SNPs that contain most of the information about the patterns of genetic variation is estimated to be about 300,000 to 600,000, which is far fewer than the 10 million common SNPs.

Once the information on tag SNPs from the HapMap is available, researchers will be able to use them to locate genes involved in medically important traits. Consider the researcher trying to find genetic variants associated with high blood pressure. Instead of determining the identity of all SNPs in a person's DNA, the researcher would genotype a much smaller number of tag SNPs to determine the collection of haplotypes present in each subject. The researcher could focus on specific candidate genes that may be associated with a disease, or even look across the entire genome to find chromosomal regions that may be associated with a disease. If people with high blood pressure tend to share a particular haplotype, variants contributing to the disease might be somewhere within or near that haplotype.

Transition and transcersion

Ts vs Tv


DNA substitution mutations are of two types. Transitions are interchanges of two-ring purines (A G) or of one-ring pyrimidines (C T): they therefore involve bases of similar shape. Transversions are interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures.

Although there are twice as many possible transversions, because of the molecular mechanisms by which they are generated, transition mutations are generated at higher frequency than transversions. As well, transitions are less likely to result in amino acid substitutions (due to "wobble"), and are therefore more likely to persist as "silent substitutions" in populations as single nucleotide polymorphisms (SNPs).

Wednesday, December 8, 2010

make a simulation for 7550R

#train parameters from the raw fastq file (101bp, pair-end)

>maq simutrain 7550.simupars.dat 7550X1_101122_SN141_0314_B80R4NABXX_1_1.txt


#we want to make test on hg19 Chr20 only. Chr20 has ~63 million(63025520) bp , to get a 30x coverage with 101bp read length, we need 30*63000000/101 = 18712871 reads. (rounds to 19 million reads) .

>maq simulate -d 200 -s 20 -N 19000000 7550.chr20.1.fq 7550.chr20.2.fq chr20.fa 7550.simupars.dat > 7550.chr20.snps

#do alignment
>novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:6966R\tCN:HCI' -F ILMFQ -d chr20.ndx -f 7550.chr20.1.fq 7550.chr20.2.fq > 7550.chr20.sam 2>7550.chr20.log

Monday, December 6, 2010

Thursday, December 2, 2010

Read group



For example, to create a paired end same file with fully formatted read groups, you can run:

bwa sampe -i MARK -m DEPRISTO -l LIBRARY -p ILLUMINA -h Homo_sapiens_assembly18.fasta first.fastq.sai secondfastq.sai first.fastq second.fastq > pe.sam
This will result in a file that looks like:

@HD VN:1.0 SO:unsorted
@SQ SN:chrM LN:16571
@SQ SN:chr1 LN:247249719
@SQ SN:chr2 LN:242951149
[More @SQ lines removed for clarity]
@RG ID:MARK SM:DEPRISTO LB:LIBRARY PL:ILLUMINA
61CC3AAXX100125:6:48:11631:21212 81 chrM 8 37 76M chr17 21944837 0 GGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGAT 99:<->=D>IF7.:7,H

Tuesday, November 30, 2010

dbSNP file naming convention

>ls ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606

dr-xr-xr-x 2 ftp anonymous 4096 Feb 28 2010 1000Genomes
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 14:52 ASN1_bin
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 14:55 ASN1_flat
dr-xr-xr-x 2 ftp anonymous 4096 Nov 29 16:39 ASN1_flat_b131
dr-xr-xr-x 3 ftp anonymous 4096 Feb 28 2010 GWAS_arrays
lr--r--r-- 1 ftp anonymous 17 Oct 28 20:17 README_human_9606_B132.txt -> /snp/00readme.txt
dr-xr-xr-x 3 ftp anonymous 4096 Sep 30 18:55 VCF
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:07 XML
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:08 chr_rpts
dr-xr-xr-x 7 ftp anonymous 4096 Sep 28 20:29 database
dr-xr-xr-x 2 ftp anonymous 167936 Nov 3 15:33 gene_report
dr-xr-xr-x 2 ftp anonymous 4096 Feb 28 2010 genome_report
dr-xr-xr-x 3 ftp anonymous 4096 Oct 14 17:39 genotype
dr-xr-xr-x 27 ftp anonymous 4096 Oct 14 18:20 genotype_by_gene
dr-xr-xr-x 3 ftp anonymous 4096 Feb 28 2010 haplotypes
dr-xr-xr-x 5 ftp anonymous 4096 Nov 15 19:05 misc
dr-xr-xr-x 2 ftp anonymous 4096 Nov 15 15:14 rs_fasta
dr-xr-xr-x 15 ftp anonymous 4096 Apr 5 2010 ss_fasta
dr-xr-xr-x 2 ftp anonymous 45056 Nov 17 21:14 viewBatch

****************************************************

example:
14-12162-MKK.vcf.gz
14 -> Chromosome 14
12162 -> dbSNP population ID (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?pop=12162)
MKK -> Three letter population identifier (http://ccr.coriell.org/sections/collections/NHGRI/?SsId=11)

Friday, November 12, 2010

What is base calling?

Base-calling converts raw or processed data from a sequencing instrument into sequences and quality scores.

Base-calling usually refers to the conversion of intensity data into sequences and quality scores. Intensity information is extracted from images by the image analysis.

Base-calling has two aspects: Identifying the base-call and assigning a confidence estimate to the call

Quality scores quantify the probability that a base-call is correct (or wrong)

Base quality scores: Individual bases have quality scores which reflect the likelihood of the base being correct/incorrect

Alignment scores: Probability than an alignment to a given position in the reference genome is correct

Allele scores, SNP scores: Probability that a given allele, SNP was observed (often conditional on the alignment being correct)

Base and alignment scores: are single read scores; SNP scores are consensus scores

Consensus calls use information from multiple reads


Phred scores: A base quality score assigned by the phred software (or a program based on the phred which is the most prominent base-calling program for capillary sequencing.)

A quality score expressed on a logarithmic scale:

Q = -10 log10( probability of an error )

Example: Q20 = 1% error probability

The Phred method assigns quality scores to a base-call based on observed properties of the base (predictors)

Phred is a two-step process:

1. Training: Given a set of reads, labels as to which bases are correct, and a set of quality statistics for each base, produce a model that can predict error rates for unseen bases

2. Application: Given new reads and quality statistics, predict the quality for each of the bases.
Phred is essentially a big lookup table!

What goes into a quality table?

Quality predictors are numbers correlated with the quality of a base call, and attempt to quantify concepts such as:
“Is the signal for the called base much brighter than the others?”
“Did the spot get suspiciously dim, compared to the beginning?”
“Does the signal look clean in the next few cycles, and the previous few cycles?
A training data set
One of the fundamental questions is: How similar is the training data set to the data set the phred table is applied? Does it make sense for my data?
Run-to-run variance, type of sample etc.; need diversity in training
An alignment method and reference
phred is inherently based on the assumption that alignments are correct andthe reference is well known
Makes it intrinsically very hard to provide accurate scores for low quality bases!
Need high-quality references for training –what are the highest qualities we can get for references?



Monday, November 8, 2010

Pileup format

samtools

chr1 808922 G A 21 116 53 9 aAaaaaaAA 9=;:@?9:,


chr1 ---> name of this sequence is chr1
808922 ---> coordinate is 808922 (1-based)
G ---> reference base is 'G'
A ---> consensus base is 'A'
21 ---> consensus quality is 21
116 ---> SNP quality is 116
53 ---> maximum mapping quality is 53
9 ---> the number of reads covering the site is 9
aAaaaaaAA ---> read bases are 'aAaaaaaAA'
9=;:@?9:, --->base qualities are '9=;:@?9:,'


consensus quality (21) is the Phred-scaled probability that the consensus is wrong. SNP quality (116) is the Phred-scaled probability that the consensus is identical to the reference.
For SNP calling, SNP quality is of more importance.

For example, to further filter the SNP calls:
samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup

SNP call

novoalign -o SAM -F ILMFQ -t120 -r0.2 -q5 -d hg19.ndx -f 7746x2_s_6_1_sequence.txt.gz 7746x2_s_6_2_sequence.txt.gz | grep chr > 7746x2.sam

samtools view -bS 7746x2.sam | samtools sort - 7746x2

samtools rmdup 7746x2.bam 7746x2.sorted.bam

samtools index 7746x2.sorted.bam

samtools view -u -q 20 7746x2.sorted.bam | samtools pileup -vcf /DataDisk/Analysis/Sun/fa/hg19/chr1.fa - | samtools.pl varFilter -D 100 -G 50 > 1.var
awk '($3=="*"&&$6>=50)||($3!="*"&&$6>30)' 1.var > 1.var.filtered

Tuesday, November 2, 2010

delete lines from a file

sed -e '2d' hello.txt # delete line #2 from hello.txt
sed -e '1,36d' hello.txt # delete line 1-36 from hello.txt
awk '!(NR>=1 && NR<=36){print $0}' hello.txt # delete line 1-36 from hello.txt

soft clipped

Clipped alignment. In Smith-Waterman alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. We introduce operation ʻSʼ to describe (softly) clipped alignment. Here is an example. Suppose the clipped alignment is:
REF:AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG
READ: gggGTGTAACC-GACTAGgggg

where on the read sequence, bases in uppercase are matches and bases in lowercase are clipped off. The CIGAR for this alignment is: 3S8M1D6M4S.

Monday, October 18, 2010

A SNP Call pipeline

SNP and MicroIndel detection

Novoalign detects mismatches and short indels in short reads from next-generation sequencing platforms. The recipes below contain command line usage to generate list of single nucleotide polymorphisms SNPs and Indels for further analysis.

Generate sorted alignments in BAM format

Inititally we need to align our file(s) to the reference genome:

novoalign -d hg18.nix -f reads1.fastq -o SAM 2> stats.txt > novo.sam

samtools view -bS novo.sam | samtools sort - reads1

For paired-end or mate-pair reads adjust the parameters accordingly. If we have multiple alignment files these all need to be sorted individually and merged with either samtools or Picard.

Remove PCR duplicates

It is highly recommended to remove PCR duplicates before proceeding with the SNP and Indel calls. We use samtools as an example to remove PCR duplicates.

#remove PCR duplicates from single-end reads

samtools rmdup -s reads1.bam reads1.sorted.bam

#For Paired reads

samtools rmdup reads1.bam reads1.pe.sorted.bam

SNP and Microindel calling

In this example we use the samtools pileup SNP caller and accept reads with a minimum mapping quality of 20:

#The hg18.fasta reference genome sequence is required. samtools.pl is also required

samtools view -u -q 20 reads1.bam | samtools pileup -vc -f hg18.fasta - | samtools.pl varFilter -D 1000 -G 50 > variations.list

Filtering and final list generation

A simple awk script can be used to filter our list of variations where the indel SNP quality is greater than 50 or 60 respectively:

awk '($3=="*"&&$6>=50)||($3!="*"&&$6>30)' variations.list > variations.filtered.list

Exporting to other formats

The samtools distribution contains a helper script called sam2vcf.pl that may be used to convert to the Variant Call Format (VCF):

sam2vcf.pl variations.filtered.list -r hg18.fasta > variations.vcf

Thursday, October 7, 2010

Walltime estimation

Input: ~2.5G *2 (pair-end)
Genome: ~4G

250MB reads and 4GB genome per node: ~4 hours


Tuesday, September 21, 2010

Simulate data for pipeline

Organsim, CE6

#download
wget http://sourceforge.net/projects/maq/files/maq-data/20080929/calib-36.dat.gz/download

#unzip
gzip -d calib-36.dat.gz

#simulate reads
maq simulate ce6_1.fq ce6_2.fq ce6.fa calib-36.dat

#make indexed genome
novoindex ce6.ndx ce6.fa

#Align reads to ce6 genome
time novoalign -d ce6.ndx -d ce6_1.fq ce6_2.fq | grep chr > A.ce6.nal

real 1m43.108s
user 6m39.372s
sys 0m4.652s

#Format to MAQ
novo2maq A.map - A.ce6.nal

#Convert the reference sequences to the binary fasta format
#maq fasta2bfa ref.fasta ref.bfa

#Build the mapping assembly
maq assemble A.cns ref.bfa A.map 2>assemble.log

#Extract consensus sequences and qualities
maq cns2fq A.cns >A.cns.fq

#Extract list of SNPs
maq cns2snp A.cns >A.snp


Wednesday, September 15, 2010

update potato source code

#!/bin/sh
ssh hiseq@hci-bio2.hci.utah.edu "rsync -arue ssh ying@155.100.235.73:~/workspace/potato/src/*.py potato"
rsync -arue ssh hiseq@hci-bio2.hci.utah.edu:~/potato/*.py .

Tuesday, September 14, 2010

delete all jobs with name=hello

qstat -au u0592675 | grep 'hello' | awk '{print $1}' | awk -F '.' '{print $1}' | xargs qdel

Friday, September 10, 2010

Supported Genome Builds

Human (hg19, Feb_2009, GRCh37)

single-end, HiSeq2000, Rong Mao Lab

Human (hg18, Mar_2006, NCBI Build 36.1)

DNA-methylation, pair-end, Hiseq2000, Brad Cairns Lab

Chip-seq, single-end, GAIIx, Brad Cairns Lab

C_elegans (CE6, May_2008, WS190)

Small RNA sequencing, single-end, GAIIx, Brenda Bass Lab

A_Thaliana (TAIR8, Mar_2008)

Single-end, HiSeq2000, Jason Stajich Lab

M_musculus (mm9, Jul_2007, NCBI Build 37)

ChIP-Seq, single-end, HiSeq2000, Anne Moon Lab

DNA-methylation, pair-end, HiSeq2000, Brad Cairns Lab

D_rerio (Zv8, Jun_2008)

ChIP-Seq, single-end, HiSeq2000, Brad Cairns Lab

Friday, September 3, 2010

No available computational resource at all

1. I submitted a PBS job last Friday which only asked one node, one CPU and five seconds of walltime, now seven days passed, the job is still in the waiting queue. I can not make any tests on the cluster at all.

2. I tried to use Amazon's Cloud Server to do the alignment. However, to do so will require the purchasing of this service. We do not have a public account.





Torque and S3cmd

tmp> wget http://www.clusterresources.com/downloads/torque/torque-2.5.1.tar.gz
tmp> tar -zxvf torque-2.5.1.tar.gz
tmp> cd torque-2.5.1
torque-2.5.1> ./configure && make && make install
torque-2.5.1> ./torque-package-mom-linux-x86_64.sh --install
torque-2.5.1> ./torque-package-client-linux-x86_64.sh --install
torque-2.5.1> cp contrib/init.d/pbs_mom /etc/init.d/pbs_mom
torque-2.5.1> update-rc.d pbs_mom defaults

Modify "/etc/ld.so.conf"
###############################
include /etc/ld.so.conf.d/*.conf
/usr/local/lib
###############################

>ldconfig

torque-2.5.1> ./torque.setup root
initializing TORQUE (admin: root@ubuntu.localdomain)
Max open servers: 4
set server operators += root@ubuntu.localdomain
Max open servers: 4
set server managers += root@ubuntu.localdomain


torque-2.5.1> qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = ubuntu
set server managers = root@ubuntu.localdomain
set server operators = root@ubuntu.localdomain
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300

$TORQUEHOME is /var/spool/torque/


S3CMD


>wget http://sourceforge.net/projects/s3tools/files/s3cmd/0.9.9.91/s3cmd-0.9.9.91.tar.gz/download
>sudo python setup.py install

>s3cmd --configure

Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3
Access Key: 1qaz2wsx
Secret Key: 1qaz2wsx

Encryption password is used to protect your files from reading
by unauthorized persons while in transfer to S3

Wednesday, September 1, 2010

Amazon Cloud

Amazon S3 is a reasonably priced data storage service. Ideal for off-site backups, archiving and other data storage needs. Check out About Amazon S3 section to find out more.

S3cmd is a command line tool for uploading, retrieving and managing data in Amazon S3. It is best suited for power users who don't fear command line. It is also ideal for scripts, automated backups triggered from cron, etc.

Tuesday, August 31, 2010

Design of the pipeline

Message queue centralized system and coupled with independent I/O Channel

Friday, August 27, 2010

Tuesday, August 24, 2010

Script mapping

#alignment
novoalign -F ILMFQ -t120 -r0.2 -q5 -d hg19Splices34.index -f 1.txt 2.txt > 84060.novoalign

#transform to MAQ
maq novo2maq 84060.novoalign.map - 84060.novoalign

#Convert the reference sequences to the binary fasta format
maq fasta2bfa ref.fasta ref.bfa

#Build the mapping assembly
maq assemble consensus.cns ref.bfa 84060.novoalign.map 2>assemble.log

#Extract consensus sequences and qualities
maq cns2fq consensus.cns >cns.fq

#Extract list of SNPs
maq cns2snp consensus.cns >cns.snp


will be translated to :


/PATH_TO_NONOALIGN/novoalign -F ILMFQ -t120 -r0.2 -q5 -d /PATH_TO_GENOME_INDEX/hg19Splices34.index -f /PATH_TO_INPUT_FILE/1.txt /PATH_TO_INPUT_FILE/2.txt > /PATH_TO_OUTPUT_FILE/1_84060.novoalign
/PATH_TO_MAQ/maq novo2maq /PATH_TO_OUTPUT_FILE/84060.novoalign.map - /PATH_TO_INPUT_FILE/84060.novoalign
...

Monday, August 23, 2010

query, download, upload and update with rsync


#!/home/ying/py2.6.5/bin/python
"""
Use rsync to synchronize the DATA_HOME(hci-bio2, which raw data was stored) and COMP_HOME(chpc-node,
which process raw data and return output back to source. All the rsync operations are initiated from chpc-node.

Functions:
1. query: query the source if the new data is available
2. download: download new data ( from source to destination)
3. upload: upload result(output) to source.
4. update: update the source by appending a log file from the destination to show the alignment progress on chpc. "begin, percent-finished, end, error, etc"

"""
import os
import time
import operator
import subprocess

DATA_HOME = 'ying@123.123.123.123:~/alignment/'
COMP_HOME = '/home/ying/alignment/'
RSYNC_QUERY_PARAMS = '-are ssh'
RYSNC_DOWNLOAD_PARAMS = '-avuze ssh' #archive, verbose, update-only, compress-during-transfer and trasnfer-over-ssh
RYSNC_UPLOAD_PARAMS = '-avuze ssh' #archive, verbose, update-only, compress-during-transfer and trasnfer-over-ssh
LOG_FILE_NAME = 'log.txt'

def query():
#query_cmd = """rsync -are ssh ying@155.100.235.73:~/alignment/ | awk '{print $1 " " $3"#"$4 " " $5}'"""
cmd = ["rsync",RSYNC_QUERY_PARAMS,DATA_HOME,'|',"""awk '{print $1 " " $3"#"$4 " " $5}'"""]
query_cmd = ' '.join(cmd)
#print query_cmd
p = subprocess.Popen(query_cmd,shell=True,stdout=subprocess.PIPE)
line = p.communicate()
fnames = ''.join(line[:-1]).strip().split('\n')
df = {}
dt = {}
for fname in fnames:
prop,ctime,fn = fname.split(' ')
if fn != '.' and fn != '..':
if prop[0] == 'd': #it is a directory
pass
if prop[0] == '-': #it is a file
parent_path = os.path.dirname(fn)
df.setdefault(parent_path,[]).append(os.path.basename(fn))
#Transform time to seconds, for comparing which job should be run firstly. First In First Run.
#'2010/08/20#16:27:40' -> 1282343260.0
dt[parent_path] = int(time.mktime(time.strptime(ctime,'%Y/%m/%d#%H:%M:%S')))
jobs = []
for k,v in df.items():
if "begin" in v: #the data is ready
if not LOG_FILE_NAME in v: #this job is not running
jobs.append((k,dt[k]))
#print jobs
return sorted(jobs,key=operator.itemgetter(1))

def download(path_to_download):
"""download a whole directory from DATA_HOME to COMP_HOME
"""
normalized_path = path_to_download.strip('/') #/A, /A/ or A/ -> A
source = os.path.join(DATA_HOME,normalized_path)
cmd = ["rsync", RYSNC_DOWNLOAD_PARAMS, source, COMP_HOME] #download the whole directory to COMP_HOME.
download_cmd = ' '.join(cmd)
#print download_cmd
p = subprocess.Popen(download_cmd,shell=True,stdout=subprocess.PIPE)
line = p.communicate()
stdout_message = ''.join(line[:-1]).strip().split('\n')
#print 'Message:',stdout_message
update(normalized_path,stdout_message)

def upload(path,file_to_upload):
"""upload a file from COMP_HOME to DATA_HOME
upload 'file_to_upload' at local path 'folder'.
:/folder/file_to_upload will be uploaded to :/
"""
normalized_path = path.strip('/')+"/" #/A, /A/ or A/ -> A/
source = os.path.join(COMP_HOME,normalized_path,file_to_upload)
dest = os.path.join(DATA_HOME,normalized_path)
cmd = ["rsync",RYSNC_UPLOAD_PARAMS, source,dest]
upload_cmd = ' '.join(cmd)
p = subprocess.Popen(upload_cmd,shell=True,stdout=subprocess.PIPE)
line = p.communicate()
stdout_message = ''.join(line[:-1]).strip().split('\n')
#fn_log = log(os.path.join(COMP_HOME,normalized_src),stdout_message)
update(normalized_path,stdout_message)

def update(path,message):
fn = os.path.join(COMP_HOME,path,LOG_FILE_NAME)
ofn = None
try:
if os.path.exists(fn):
ofn = open(fn,'w+')#append
else:
ofn = open(fn,'w') #new file
print >>ofn,'-'*50
print >>ofn,time.asctime()
for m in message:
print >>ofn,M
ofn.close()
normalized_path = path.strip('/')+"/" #/A, /A/ or A/ -> A/
source = fn
dest = os.path.join(DATA_HOME,normalized_path)
cmd = ["rsync",RYSNC_UPLOAD_PARAMS, source,dest]
subprocess.Popen(cmd,shell=False,stdout=subprocess.PIPE)
except:
pass

def clear():
pass
#print query()
#download('A')
upload('A','hello.txt')

Thursday, August 19, 2010

Failed to run RSA key on HCI-BIO2

Password is required.


How to: Make a passwordless ssh connection

How to: Make a passwordless ssh connection between chpc(hello@delicatearch.chpc.utah.edu) and hci(world@hci-bio1.hci.utah.edu)

Suppose after we log into chpc , we want to "ssh" to hci without password.

1. Create RSA keys (on chpc).

hello@chpc:MY_HOME>ssh-keygen -t rsa

Generating public/private rsa key pair.
Enter file in which to save the key ($HOME/.ssh/id_rsa):
Enter passphrase (empty for no passphrase): MY_PASSPHRASE
Enter same passphrase again:MY_PASSPHRASE
Your identification has been saved in $HOME/.ssh/id_rsa.
Your public key has been saved in $HOME/.ssh/id_rsa.pub.
The key fingerprint is:
ad:9e:ab:f9:e1:1a:a4:85:16:3b:24:5f:35:b6:76:a7 user@machine


MY_PASSPHRASE is password used to encrypt the keys, not the user's passworld.

Now we have two files under $HOME/.ssh, "id_rsa" is private key, "id_rsa.pub" is public key.


2. Transfer the public key to world@hci's home directory (on chpc)
hello@chpc:$HOME>scp .ssh/id_rsa.pub world@hci-bio1.hci.utah.edu:~/


3. Append the id_rsa.pub to aothorized keys (on hci)
world@hci:$HOME>cat id_rsa.pub >>authorized_keys

4. Make the hci accept RSA key style connection.

By default, this features is off.
We need to modify the following line in /etc/ssh/sshd_config (root privilege is required)

#RSAAuthentication yes
#PubkeyAuthentication yes

to

RSAAuthentication yes
PubkeyAuthentication yes

Then we need to refresh the ssh service.
>/etc/init.d/sshd restart

5. Make a test (on chpc)
>ssh -2 world@hci-bio1.hci.utah.edu
Enter passphrase for key '$HOME/.ssh/id_rsa':

Here we need to input the passphrase of the private key, it is . If everything goes well, we will see the welcome message:

Welcome to Ubuntu!


6. Now we need to eliminate the "passphrase" step using ssh-agent and ssh-add (on chpc)

ssh-agent are used to buffer the passphrase and keep it in memory, we we do not need input passphrase next time.

#start ssh-agent
>eval `ssh-agent`

#add passphrase
>ssh-add

Enter passphrase for chpc:$HOME/.ssh/id_rsa: MY_PASSPHRASE
Identity added: chpc:$HOME/.ssh/id_rsa

7. Make a test again (on chpc)
>ssh -2 world@hci-bio1.hci.utah.edu
Welcome to Ubuntu!

8. we can test scp (on chpc)
>scp hello.txt world@hci-bio1.hci.utah.edu:~/

To avoid run "eval `ssh-agent`" and "ssh-add" every time after we log into the chpc, we can append eval `ssh-agent` to ~/.bash_profile. So it will start automatically next time. However, we still need to run 'ssh-add' manually after each log in to register the passphrase for security. if passphrase is empty, we do not even need ssh-agent and ssh-add. But this may bring security risk to the private key.

we can also use "keychain" as the frontend of ssh-agent, so we do not have to create ssh-agent for each login. With keychain, only one ssh-agent is in service no matter how many consoles we open.

>wget http://www.funtoo.org/archive/keychain/keychain-2.7.1.tar.bz2
>tar -jxvf keychain-2.7.1.tar.bz2
>cd keychain-2.7.1
>./keychain

Wednesday, August 18, 2010

ssh connection from chpc to bio1

Suppose we want to passwordless ssh connection from machine A(chpc) to machine B(bio1)


1. On machine B
modify the following line in /etc/ssh/sshd_config
(root privilege is required)

#RSAAuthentication yes
#PubkeyAuthentication yes
#AuthorizedKeysFile .ssh/authorized_keys

to

RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys


reload configuration

#/etc/init.d/ssh reload

2. On machine A

$ssh-keygen -t rsa

Generating public/private rsa key pair.
Enter file in which to save the key (/MY_HOME/.ssh/id_rsa):
Enter passphrase (empty for no passphrase): THIS_IS_MY_PASSWORD
Enter same passphrase again: THIS_IS_MY_PASSWORD
Your identification has been saved in /MY_HOME/.ssh/id_rsa.
Your public key has been saved in /MY_HOME/.ssh/id_rsa.pub.
The key fingerprint is:
ad:9e:ab:f9:e1:1a:a4:85:16:3b:24:5f:35:b6:76:a7 user@machine


$ls .ssh/
id_rsa id_rsa.pub known_hosts

id_rsa is my private key, and id_rsa.pub is the public key

$scp id_rsa.pub user@machineB:~/.ssh


3.On machine B

$cd .ssh/

$cat id_rsa.pub >>authorized_keys

Now make a test from machine A:

$ssh user@machineB

Enter passphrase for key '/MY_HOME/.ssh/id_rsa': THIS_IS_MY_PASSWORD
user@machineB$ #OK now we are using machineB.


Next, we use ssh-agent and ssh-add to avoid "Enter passphrase for key '/MY_HOME/.ssh/id_rsa'", avoid password completely.

Pipeline to run novoalign - 0818

Since we can run a deamon service at chpc, here is my idea to implement a pipeline


On HCI:
Setup rsync service on /home/alignment@hci-bio1.
This directory will served as home to put all read files.

On CHPC:
A script running on chpc's interactive node (delicatearch) as deamon.
This script will:

1. monitor /home/alignment/@hci-bio1 using rsync

2. if new files are found in that directory, download the file to chpc

3. split the read file into segments as jobs

4. submit jobs using PBS

5. wait for results by monitoring the local result directory e.g. /home/user/pid

6. re-submit jobs if necessary (some jobs may take exceptional long time to finish)

7. assemble result files

8. transfer result back to hci-bio1


Use case:

I have two files "A_1.fq.gz" and "A_2.fq.gz" from pair-end sequencing read. Now I want to align the read to human genome using novoalign.

1. Put "A_1.fq.gz" and "A_2.fq.gz" to /home/alignment/request/ @ hci-bio1

2. Create a file named "A.params" to list the parameters (gap penalty, genome, etc) for running novoalign.

3. Touch a dummy file named "A.ready". This file is to tell the daemon service on chpc that alignment request that starts with "A" are ready.

Once the job is done, I should be able to find my alignment at /home/alignment/response/

Now it looks like:

/home
|
|---alignment/
|
|---request/
| |
| |---A_1.fq.gz
| |
| |---A_2.fq.gz
| |
| |---A.params
| |
| |---A.ready
|
|---response/
|
|---A.result
|
|---other results