Friday, December 16, 2011


Describes an individual who has only one member of a chromosome pair or chromosome segment rather than the usual two;
refers in particular to X-linked genes in males who under usual circumstances have only one X chromosome.

Friday, December 2, 2011

Minus and plus strand


X - 15353623 15353645

SO this mapped to the minus strand

To match the plus strand, first reverse the sequence to:


Then compliment the reversed sequence to:


Then it will be mapped to:

X + 15353623 15353645

Monday, November 14, 2011

Trac and Mercurial

#mkdir tool; cd tool

tar -zxvf Python-2.7.2.tgz
cd Python-2.7.2
./configure --prefix=/home/u0592675/tool/python-2.7.2
make install

tar -zxvf httpd-2.2.21.tar.gz
cd httpd-2.2.21
./configure --prefix=/home/u0592675/tool/apache-2.2.21
make install

tar -zxvf mod_wsgi-3.3.tar.gz
cd mod_wsgi-3.3
./configure --with-apxs=/home/u0592675/tool/apache-2.2.21/bin/apxs --with-python=/home/u0592675/tool/python_2.7.2/bin/python
make install


trac-admin /home/ying/mercurial_database/tomato initenv
trac-admin /home/ying/mercurial_database/tomato deploy /home/ying/web/tomato_trac

{'mod_wsgi.listener_port': '80',
'HTTP_COOKIE': 'trac_form_token=0b8daa13a5ec1c5f4819083c; trac_session=80547f6793f2d13cbfea5b1c; trac_form_token=a30953232455e34b94f82aa5; trac_session=d5bdc2b3269ef9a4001807dd',
'mod_wsgi.listener_host': '',
'SERVER_SOFTWARE': 'Apache/2.2.21 (Unix) mod_wsgi/3.3 Python/2.7.2+',
'SCRIPT_NAME': '/tomato',
'mod_wsgi.handler_script': '',
'PATH_INFO': '/',
'HTTP_ACCEPT_CHARSET': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.106 Safari/535.2', 'HTTP_CONNECTION': 'keep-alive', 'SERVER_NAME': '', 'REMOTE_ADDR': '', 'mod_wsgi.request_handler': 'wsgi-script', 'wsgi.url_scheme': 'http', 'PATH_TRANSLATED': '/home/ying/tool/apache-2.2.21/htdocs/index.html', 'SERVER_PORT': '80', 'wsgi.multiprocess': True, 'mod_wsgi.input_chunked': '0', 'SERVER_ADDR': '', 'DOCUMENT_ROOT': '/home/ying/tool/apache-2.2.21/htdocs', 'mod_wsgi.process_group': '', 'SCRIPT_FILENAME': '/home/ying/trac/cgi-bin/trac.wsgi', 'SERVER_ADMIN': '', 'wsgi.input': , 'HTTP_HOST': '', 'wsgi.multithread': False, 'mod_wsgi.callable_object': 'application', 'REQUEST_URI': '/tomato/', 'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'wsgi.version': (1, 1), 'GATEWAY_INTERFACE': 'CGI/1.1', 'wsgi.run_once': False, 'wsgi.errors': , 'REMOTE_PORT': '6064', 'HTTP_ACCEPT_LANGUAGE': 'en-US,en;q=0.8', 'mod_wsgi.version': (3, 3), 'mod_wsgi.application_group': '',
'mod_wsgi.script_reloading': '1',
'wsgi.file_wrapper': ,
'HTTP_ACCEPT_ENCODING': 'gzip,deflate,sdch'}

[Tue Nov 08 11:15:24 2011] [error]

prepare GATK resource for pipeline

Three VCF Files for known SNPs from GATK resource

1000G_omni2.5.hg19.sites.vcf -> hg19.1000G_omni2.5.vcf
hapmap_3.3.hg19.sites.vcf -> hg19.hapmap_3.3.vcf
dbsnp_132.hg19.vcf -> hg19.dbsnp_132.vcf

Also copy corresponding ".idx" files to DATA_PATH

mv 1000G_omni2.5.hg19.sites.vcf hg19.1000G_omni2.5.vcf
mv hapmap_3.3.hg19.sites.vcf hg19.hapmap_3.3.vcf
mv dbsnp_132.hg19.vcf hg19.dbsnp_132.vcf

##### ERROR MESSAGE: Input files /scratch/local/4/hg19.dbsnp_132.vcf and reference have incompatible contigs: Order of contigs differences, which is unsafe.
##### ERROR /scratch/local/4/hg19.dbsnp_132.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]

1. get "dbsnp_132.hg19.vcf.gz" from ""

2. gunzip dbsnp_132.hg19.vcf.gz

3. awk '/^chrM/ {print $0}' dbsnp_132.hg19.vcf > m.vcf

4. awk '/^chrM/ {next} {print $0}' dbsnp_132.hg19.vcf > wom.vcf

5. cat m.vcf >> wom.vcf

6. mv wom.vcf hg19.dbsnp_132.vcf

7. vcf-validator hg19.dbsnp_132.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chr1:120995

8. vcf-validator dbsnp_132.hg19.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chrM:16189

8. /home/u0592675/vcftools_0.1.7/bin/vcf-validator hg19.dbsnp_132.vcf

10 ./home/u0592675/vcftools_0.1.7/bin/vcf-validator dbsnp_132.hg19.vcf

awk '/^chr/ {print $1}' /tomato/data/hg19.dbsnp_132.vcf | sort |uniq

awk '/^chr/ {print $1}' hg19.1000G_omni2.5.vcf | sort |uniq

awk '/^chr/ {print $1}' hg19.hapmap_3.3.vcf | sort |uniq

awk '/^chrM/ {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > m.vcf
awk '/^chrM/ {next} {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > wom.vcf
cat m.vcf >> wom.vcf
rm m.vcf
mv wom.vcf hg19.dbsnp_132.vcf

vim hg19.dbsnp_132.vcf
delete all non-mapped contigs in header section ##contig


hg push ssh://
hg clone tomato

Thursday, November 3, 2011

Got new genome fasta file?

#1. hg19.fasta.idx
samtools faidx hg19.fasta

#2. hg19.dict
java -Xmx2g -jar ~/tool/picard/CreateSequenceDictionary.jar R=hg19.fasta O=hg19.dict

Tuesday, November 1, 2011


hg push ssh://hiseq@bio2:/home/hiseq/tomato/
hg clone http://bio2:1234 tomato

prepare dbsnp132

awk '/^chrM/ {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > m.vcf
awk '/^chrM/ {next} {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > wom.vcf
cat m.vcf >> wom.vcf
rm m.vcf
mv wom.vcf hg19.dbsnp_132.vcf

vim hg19.dbsnp_132.vcf
delete all non-mapped contigs in header section ##contig

Monday, October 31, 2011

Resource Bundle

Thursday, October 27, 2011

Known SNPs

Three VCF Files for known SNPs from GATK resource

1000G_omni2.5.hg19.sites.vcf -> hg19.1000G_omni2.5.vcf
hapmap_3.3.hg19.sites.vcf -> hg19.hapmap_3.3.vcf
dbsnp_132.hg19.vcf -> hg19.dbsnp_132.vcf

Also copy corresponding ".idx" files to DATA_PATH

Friday, October 14, 2011

I am running out of disk space

again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again

Multiple hits? All or None?

#-r All
57G Oct 14 09:44 1.sam
65G Oct 14 10:06 2.sam
65G Oct 14 10:26 3.sam
45G Oct 14 10:46 4.sam
56G Oct 14 11:09 5.sam

#-r None
56G Oct 11 02:10 1.sam
64G Oct 11 02:32 2.sam
63G Oct 11 02:53 3.sam
44G Oct 11 03:07 4.sam
54G Oct 11 03:25 5.sam

Thursday, October 13, 2011

pre-processing of masked FASTA for novoalign

1. Convert lower case to upper case
sed -i 's/\(.*\)/\U\1/' hg19.fasta

> chr1

Will be

> CHR1

2. covert '>CHR' to '>chr'
sed -i -e 's/>CHR/>chr/g' hg19.fasta

3. convert N to n
sed -i -e 's/N/n/g' hg19.masked.fasta

4. mask these UTR
novoindex -m hg19.nix hg19.fasta

Monday, September 26, 2011

Keyless ssh connection between A and B

ssh-keygen -t rsa

Then scp the "" to B

cat >> .ssh/authorized_keys
chmod 644 .ssh/authorized_keys

Now connection from A to B is keyless.

Friday, September 16, 2011

ACL Setup

On the job side, I need a root-like user who has write permission on any sub-folders under PATH_JOB while the individual user has only 744 like permissions on his/her own job folders. However, Unix/Linux does not have fine control over file/directory by default, so you can not assign permissions to different users. ACL (Access Control Lists) can be used to solve this problem.

Here is a simple test on my VM_Ubuntu_10.10:

1. List partitions, the /dev/sda1 is the root directory "/"
# blkid
/dev/sda1: UUID="xxxx" TYPE="ext4"
/dev/sda5: UUID="yyyy" TYPE="swap"

2. Modify the /etc/fstab to support ACL by inserting "acl" into options.
#vim /etc/fstab
UUID=xxxx / ext4 errors=remount-ro,acl 0 1

3. Remount the root directory
#mount / -o remount

4. Make a test
#setfacl -m u:abc:rw hello
#getfacl hello

# file: hello.txt
# owner: root
# group: root

5. Now user "abc" can read and write "hello"!

rw-rw----+ 1 root root 0 2011-09-16 11:41 hello.txt

Note the new "+" character appears the end of permission field. This shows this file/folder supports ACL

Friday, August 19, 2011

Merge the SAM/BAM file

Suppose I have two alignments in SAM (1.sam and 2.sam). Since these two alignments are two lanes from the same sample, I should merge these two alignments into one.

>wc -l 1.sam

>wc -l 2.sam

So both the SAM files has 10000 lines (include the header)

First using picard's "MergeSamFiles.jar"

>java -jar MergeSamFiles.jar I=1.sam I=2.sam O=12.sam
>wc -l 12.sam

The missing 27 lines are actually duplicated header lines and they were removed by PICARD automatically. However, there is a subtle bug here.

>less 12.sam

@HD VN:1.0 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621 AS:hg19.nov.illumina.nix
@SQ SN:chr2 LN:243199373 AS:hg19.nov.illumina.nix
@SQ SN:chr3 LN:198022430 AS:hg19.nov.illumina.nix
@SQ SN:chr4 LN:191154276 AS:hg19.nov.illumina.nix
@SQ SN:chr5 LN:180915260 AS:hg19.nov.illumina.nix
@SQ SN:chr6 LN:171115067 AS:hg19.nov.illumina.nix
@SQ SN:chr7 LN:159138663 AS:hg19.nov.illumina.nix
@SQ SN:chr8 LN:146364022 AS:hg19.nov.illumina.nix
@SQ SN:chr9 LN:141213431 AS:hg19.nov.illumina.nix
@SQ SN:chr10 LN:135534747 AS:hg19.nov.illumina.nix
@SQ SN:chr11 LN:135006516 AS:hg19.nov.illumina.nix
@SQ SN:chr12 LN:133851895 AS:hg19.nov.illumina.nix
@SQ SN:chr13 LN:115169878 AS:hg19.nov.illumina.nix
@SQ SN:chr14 LN:107349540 AS:hg19.nov.illumina.nix
@SQ SN:chr15 LN:102531392 AS:hg19.nov.illumina.nix
@SQ SN:chr16 LN:90354753 AS:hg19.nov.illumina.nix
@SQ SN:chr17 LN:81195210 AS:hg19.nov.illumina.nix
@SQ SN:chr18 LN:78077248 AS:hg19.nov.illumina.nix
@SQ SN:chr19 LN:59128983 AS:hg19.nov.illumina.nix
@SQ SN:chr20 LN:63025520 AS:hg19.nov.illumina.nix
@SQ SN:chr21 LN:48129895 AS:hg19.nov.illumina.nix
@SQ SN:chr22 LN:51304566 AS:hg19.nov.illumina.nix
@SQ SN:chrX LN:155270560 AS:hg19.nov.illumina.nix
@SQ SN:chrY LN:59373566 AS:hg19.nov.illumina.nix
@SQ SN:chrM LN:16571 AS:hg19.nov.illumina.nix
@PG ID:novoalign VN:V2.07.10 CL:novoalign
@PG ID:novoalign.1 VN:V2.07.10 CL:novoalign

The last two lines are both @PG which differ the ID. Before merging, they are both "novoalign", after merging, the duplicated ID was renamed as "novoalign.1" (if merging 3 files, you will have "novoalign.2" as well). When you check the individual read alignment, you will see the PG:Z field referenced the PG line. When you call variances later, the genotyper application will process merged file as two files which is not what you want.

To fix this:

1. Keep the first @PG line and delete all other @PG lines
2. Replace all "novoalign.X" to "novoalign".

Or you can use samtools.

>java -jar SortSam.jar INPUT=1.sam OUTPUT=1.bam SO=unsorted
>java -jar SortSam.jar INPUT=2.sam OUTPUT=2.bam SO=unsorted
>samtools merge 3.bam 1.bam 2.bam
>samtools view -h -o 3.sam 3.bam
>less 3.sam

Samtools only use the first SAM's header in merged SAM, so it is OK in this case.

Tuesday, July 12, 2011

mouse dbSNP VCF

OK I spend few hours on searching data and writing a script to make a dbSNP VCF for mouse. The steps are simple and dirty. Test of this VCF file has not been done yet. Put it here for reference only.


#1. download dbSNP mouse
wget --continue --no-host-directories --no-directories --preserve-permissions*

#2. unzip
gunzip *.gz

#3. exclude all SNPs that are not mapped to mm9 (MGSCv37) genome.
for i in `ls`; do grep 'MGSCv37' "$i" > mm9."$i" ; done

#However, you can NOT find REF/ALT field in above files. So we must get it from other files.
#4. download dbSNP mouse with genotype
wget --continue --no-host-directories --no-directories --preserve-permissions*

#5. unzip
gunzip *.gz

#6. check out these files, should end with ".xml"

Black6GtyFromContig gt_chr13.xml gt_chr17.xml gt_chr2.xml gt_chr6.xml gt_chrAltOnly.xml gt_chrUn.xml
gt_chr10.xml gt_chr14.xml gt_chr18.xml gt_chr3.xml gt_chr7.xml gt_chrMT.xml gt_chrX.xml
gt_chr11.xml gt_chr15.xml gt_chr19.xml gt_chr4.xml gt_chr8.xml gt_chrMulti.xml gt_chrY.xml
gt_chr12.xml gt_chr16.xml gt_chr1.xml gt_chr5.xml gt_chr9.xml gt_chrNotOn.xml README

#7. make a 3 column file: "id REF ALT"
grep '$//g' | sed 's/rsId=//g' | sed 's/observed=//g' | awk -F '/' '{print $1, $2}' > id.allele.txt

#8. write a simple python script that mapping id from mm9.chr_N.txt to make the final VCF file - "mm9.dbsnp.vcf"

def mm9():
m = {}
for line in file('id.allele.txt'):
line = line.strip()
toks=line.split(' ')
if len(toks)==3:
m[toks[0]] = (toks[1],toks[2])

#chr1 - chr19, chr X,Y and MT only.
nfs = [str(i) for i in range(1,20)]

fo = file('mm9.dbsnp.vcf','w')
print >>fo,'##fileformat=VCFv4.0'
print >>fo,'##fileDate=20110712'
print >>fo,'##source=dbSNP'
print >>fo,'##dbSNP_BUILD_ID=132'
print >>fo,'##reference=mm9'

total = 0
for n in nfs:
sm = {}
fn = os.path.join('mm9.chr_%s.txt'%n)
for line in file(fn):
line = line.strip()
rs_id = toks[0]
mapped = toks[1]
withdraw =toks[2]
chr =toks[6]
position =toks[11]
if m.has_key(rs_id) and mapped=='2' and withdraw=='0':
ref,alt = m[rs_id]
#use an arbitratry "ms"+ID as prefix refID (instead of "rs")
sm[int(position)] = [chr,position,'ms'+rs_id,ref,alt,'.','PASS','.']
except Exception,e:
sk = sorted(sm.keys())
for k in sk:
total += 1
print >>fo,'\t'.join(sm[k])
print total

#9. make sure no duplicated IDs
awk '{print $3}' mm9.dbsnp.vcf | sort| uniq -d


Monday, June 20, 2011

Indel filter

java -Xmx32g -jar $HOME/tool/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R $HOME/data/hg19.fasta \
-o x1.indel.filtered.vcf \
-B:variant,VCF x1.indel.vcf \
-B:mask,VCF ../SNP/x1.snp.vcf \
--maskName SNP \
--filterExpression "QUAL < 30 || SB > -1.0 || QD < 2.0" \
--filterName "QUAL<30;SB>-1.0;QD<2"

SNP Filter

#uses SB >= 0.10 filter
#uses QD < 5.0 filter
#uses HRun >= 4 filter
#uses clustered SNPs (>= 3 SNPs within 10 bp) filter
#masks out SNPs that overlap the filtered indels

java -Xmx32g -jar $HOME/tool/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R $HOME/data/hg19.fasta \
-o x1.snp.filtered.vcf \
-B:variant,VCF x1.snp.vcf \
-B:mask,VCF ../INDEL/x1.indel.vcf \
--maskName InDel \
--clusterSize 3 \
--clusterWindowSize 10 \
--filterExpression "SB >= 0.10 || QD < 5.0 || HRun >= 4" \
--filterName "SB>=0.1;QD<5.0;HRun>=4;Cluster_3in10;MaskIndel"

#copy header and "PASS" items to "x1.snp.passed.vcf"

awk '/^#/ { print $0;next} $7 == "PASS" {print $0} ' x1.snp.filtered.vcf > x1.snp.passed.vcf

Friday, June 17, 2011

Build Genome Index

novoindex -k 14 -s 1 hg19.nov.normal.nix hg19.fasta

#solid color space
novoindex -k 14 -s 1 -c hg19.nov.color.nix hg19.fasta

novoindex -k 14 -s 1 -b hg19.nov.bis.nix hg19.fasta

bwa index -a bwtsw -p hg19.bwa.normal hg19.fasta

bwa index -a bwtsw -c -p hg19.bwa.color hg19.fasta


novoindex -k 14 -s 1 mm9.nov.normal.nix genomes/mm9/fasta/mm9.fasta
novoindex -k 14 -s 1 -c mm9.nov.color.nix genomes/mm9/fasta/mm9.fasta
novoindex -k 14 -s 1 -b mm9.nov.bis.nix genomes/mm9/fasta/mm9.fasta

./bwa index -a bwtsw -p mm9.bwa.normal genomes/mm9/fasta/mm9.fasta
./bwa index -a bwtsw -c -p mm9.bwa.color genomes/mm9/fasta/mm9.fasta

Call SNP & Indel

java -Xmx32g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R hg19.fasta -B:dbsnp,VCF dbsnp132.human.vcf -glm BOTH -o X1.vcf -I X1.bam -l INFO -stand_call_conf 20 -stand_emit_conf 10




#download&install GNU Scientific Library
$sudo tar -zxvf gsl-1.15.tar.gz;cd gsl-1.15; ./configure; make; make install

#download&install gettext
$sudo tar -zxvf gettext-;cd gettext-;./configure; make; make install

#download GD Library
sudo tar -zxvf gd-2.0.35.tar.gz;cd gd/2.0.35/;./configure; make; make install

$sudo tar -zxvf libbios-1.0.0.tar.gz; cd libbios-1.0.0
$sudo export CPPFLAGS="-I$HOME/VAT/libbios-1.0.0/include"
$sudo export LDFLAGS="-L$HOME/VAT/libbios-1.0.0/lib"
$sudo ./configure;make;make install

#twoBitToFa is required for "interval2sequences"
$chmod +x twoBitToFa
$cp twoBitToFa $HOME/tool/

#download&install VAT package
$tar -zxvf vat-1.0.0.tar.gz; cd vat-1.0.0
$mkdir tmp
$./configure --prefix=$HOME/VAT/vat-1.0.0/tmp;make;make install
$mv tmp/* $HOME/VAT/tool

gunzip gencode.v3c.annotation.GRCh37.gtf.gz



cat x2.snp.vcf | snpMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > x2.snp.annotated.vcf

cat x2.indel.vcf | indelMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > x2.indel.annotated.vcf

Thursday, June 16, 2011

An python script to seperate SNP and Indel calls in one VCF

def split_snp_indel(fvcf):
basename = fvcf.rstrip('.vcf')
fsnp = file(basename+'.snp.vcf','w')
findel = file(basename+'.indel.vcf','w')
for line in file(fvcf):
line = line.strip()
if line.startswith('#'):
print >>fsnp, line
print >> findel, line
toks = line.split('\t')
ref = toks[3]
alt = toks[4]
if len(ref)> 1 or len(alt)> 1:
print >> findel, line
print >>fsnp, line

if __name__=='__main__':
import sys

Wednesday, June 15, 2011

GATK resource bundle

Monday, June 13, 2011

update GATK and PICARD

rm -fr app/gatk/*
svn co GATK
cd GATK; ant; cd ..; cp GATK/dist/* app/gatk/; rm -fr GATK
rm -fr app/picard/*
svn co PICARD
cd PICARD; ant; cd ..; cp PICRAD/dist/* app/picard/; rm -fr PICARD

Monday, June 6, 2011

Hide a control in Flex

btn_hello.visible = false; 
btn_hello.includeInLayout = false;

Wednesday, June 1, 2011

Ambiguous Code in SOLiD Color Space

An Example from a SOLiD Color Space aligned SAM file:


Here "CS:Z" field gives the color read sequence "T3.33". The first character after the "CS:Z:" is always "T", after that, all valid code should be range from 0 to 3.

0 -this base is same as previous base
1 - this base is the transversion of previous base
2 - this base is the transition of previous base
3 - this base is the complement of previous base

However sometimes we can observe ambiguous code like "." or "4". You may notice the 3rd base is not in 0~3, it is a "." - an ambiguous code indicates an ambiguous base, could be any base of A, C, G and T.

This ambiguous code will fail the GATK's "CountCovariates" tool. It will complain

Error Caused by: org.broadinstitute.sting.utils.exceptions.UserException$MalformedBAM: SAM/BAM file SAMFileReader{/home/u0592675/A232/441_10xF3.mate.dup.sort.realign.bam}
is malformed: Unrecognized color space in SOLID read, color = . Unfortunately this bam file can not be recalibrated without full color space information because of potential reference bias.

In a simple test, around 1% (58297 out of 5275136) alignments has at least one ambiguous code. So here we have two options to eliminate this problem:

1. make a script to change all ambiguous codes into random valid codes. It looks like:

m = re.compile("\.")

where CS_Z_STRING is like above example.

However, this may bring a serious problem because all bases execpt for the 1st base are depending on previous base. If we choose a wrong code which probability is 3/4, then all the following bases will be wrong.

2. delete the affected alignments

Considering the small percentage (1%) of ambiguous code affected alignments, it may be wise to delete those troublesome lines.

m1 = re.compile("CS:Z:T[0-3]*?[^0-3\s]")
def delete_ambiguous_line(fin,fout):
fo = file(fout,'w')
i = j = 0
for line in file(fin):
line = line.strip()
if line.startswith('@'):
nline = line
j += 1
i += 1
nline = line
print >>fo,nline
print "Fix ambiguous code - Total: %d, Deleted: %d, Percent of Deleted: %d%%" % (i+j,j,j*100/(i+j))

Friday, April 8, 2011

merge sam files by "cat"

tail -n +29 7550X2_1.sam > 1.sam
tail -n +29 7550X2_2.sam > 2.sam
tail -n +29 7550X2_3.sam > 3.sam
tail -n +29 7550X2_4.sam > 4.sam
tail -n +29 7550X2_5.sam > 5.sam
head -n 28 1.sam > header.txt
cat header.txt 1.sam 2.sam 3.sam 4.sam 5.sam > hello.sam
picard-tools-1.42/SortSam.jar INPUT=hello.sam OUTPUT=hello.sorted.sam CREATE_INDEX=false SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.
picard-tools-1.42/SortSam.jar INPUT=hello.sorted.sam OUTPUT=hello.sorted.bam CREATE_INDEX=true SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.

Wednesday, April 6, 2011

novoalign without "-k"(left) and with "-k" (right)

# Hash length: 13 # Hash length: 13
# Step size: 1 # Step size: 1
# Paired Reads: 96409627 # Paired Reads: 96409627
# Pairs Aligned: 62936012 # Pairs Aligned: 62971552
# Read Sequences: 192819254 # Read Sequences: 192819254
# Aligned: 128634931 # Aligned: 128720259
# Unique Alignment: 126260501 # Unique Alignment: 126343058
# Gapped Alignment: 2033737 # Gapped Alignment: 2039288
# Quality Filter: 1530071 # Quality Filter: 1530624
# Homopolymer Filter: 48808 # Homopolymer Filter: 48964
# Elapsed Time: 9737.411 (sec.) # Elapsed Time: 10373.065 (sec.)
# CPU Time: 3654.4 (min.) # CPU Time: 3922.4 (min.)
# Fragment Length Distribution # Fragment Length Distribution

Tuesday, April 5, 2011

A ZS field in Sam File from novoalign

ZS:Z: Novoalign alignment status. Not present for unique alignments.
ZS:Z:R Multiple alignments with similar score were found.
ZS:Z:QC The read was not aligned as it bases qualities were too low or it was a homopolymer read.
ZS:Z:NM No alignment was found.
ZS:Z:QL An alignment was found but it was below the quality threshold.

ZN:i: Number of alignments to read. Only present if there was more than one alignment.
ZO:Z: Indictaes long or short insert fragment for mate pair alignments when short insert has been enabled.
'+' Indicates pair was aligned as a short insert fragment.
'+' Pair was aligned as a long insert fragment.
This tag is only present for Illumina mate pairs when a short fragment length size has been specified with the i option and reads are aligned as a proper pair .
ZH:i: Hairpin score for miRNA a
ZL:i: In miRNA mode this is the alignment location for adjacent opposite strand alignment.


ZS:Z:R ZN:i:2 Indicates a read aligned to two locations.
ZS:Z:NM No alignment was found.
ZS:Z:QC The read failed quality checks and was not aligned.

Friday, April 1, 2011


sed -i 's/YING\../YING/g' 7550x1.merged.sam
sed -i 's/novoalign\../novoalign/g' 7550x1.merged.sam
time java -jar -Xmx30g ~/tool/picard-tools-1.41/SortSam.jar INPUT=7550x1.merged.sam OUTPUT=7550X1.merged.bam CREATE_INDEX=true SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.
time java -jar -Xmx30g ~/tool/gatk/dist/GenomeAnalysisTK.jar -T UnifiedGenotyper -R chr20.fa -B:dbsnp,VCF dbsnp132.human.vcf -o 7550X1.snps.raw.vcf -I 7550X1.merged.bam -stand_call_conf 50.0 -stand_emit_conf 20.0 -dcov 300
time java -jar -Xmx32g ~/tool/gatk/dist/GenomeAnalysisTK.jar -T VariantFiltration -R chr20.fa -B:variant,VCF 7550X1.snps.raw.moab.vcf -o 7550X1.snps.raw.moab.homosnp.vcf --maskName InDel --clusterWindowSize 10 --filterExpression "AF<1.0 || DP<10 || DP>100" --filterName "AF<1.0 or DP<10"
grep -c "PASS" 7550X1.snps.raw.moab.homosnp.vcf


#see all unique IP addresses connected to the server
netstat -nat | awk '{ print $5}' | cut -d: -f1 | sed -e '/^$/d' | uniq

#if your server is under a DOS attack
netstat -anp |grep 'tcp\|udp' | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n

Thursday, March 31, 2011

phased genotype

A phased genotype indicates that for heterozygous loci it is known which chromosome each variant
sequence belongs to. That is, if one variant has a Variant_seq=A,C attribute and another variant on the same landmark
feature (chromosome) has a Variant_seq=T,G attribute then the A and T occur together on one copy of the chromosome and the C and G
occur together on the other copy of that chromosome.

Thursday, March 24, 2011

VCF manipulation (to be edited)

#how many raw snps?
$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | wc -l


#filter dbSNP member and retain PASSed SNPs.
$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | awk '$3 !~ "rs" {print $0}' | awk '$7 ~ "PASS" {print $0}' > 7550x1.nodbsnp.pass.vcf

#how many novel SNPs?
$wc -l 7550x1.nodbsnp.pass.vcf


$vcf-validator 7550x1.nodbsnp.pass.vcf

#error , no header

$head -n 22 7550X1.snps.raw.vcf > head.txt
$cat head.txt 7550x1.nodbsnp.pass.vcf > 7550x1.vcf

#have to build index!

$bgzip 7550x1.vcf
$tabix -p vcf 7550x.vcf.gz

#the commom snps between x1 and x4 (affected)
$vcf-isec -n =2 7550x1.vcf.gz 7550x4.vcf.gz > x1x4.vcf

$wc -l x1x4.vcf

130480 x1x4.vcf

Tuesday, March 22, 2011

Append .jar to USeq applications

for f in $FILES
echo "Processing $f file..."
# take action on each file. $f store current file name
#cat $f
mv $f $f'.jar'

Monday, March 14, 2011

GNUMAP and novoalign benchmark

7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz, pair-end, 25000 reads, 101bp per read.

chr20.fa, human genome chromosome 20, build hg19.

$gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz


This is GNUMAP, Version 2.2.0, for public and private use.

Command Line Arguments: gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt,7550X1_1_2.txt
Verbose: 1
Genome file(s): chr20.fa
Output file: example.output
Sequence file(s): 7550X1_1_1.txt,7550X1_1_2.txt
Align score: 0.9
Number of threads: 1
Mer size: 10
Using jump size of 5
Using Default Alignment Scores
Gap score: -4
Maximum Gaps: 3

Hashing the genome.
[0/1] gen_size=0, my_start=0, my_end=0
Hashing Genome...
Reading: chr20
Size of genome: 63025520
[0/-] Converting to Vector...
[0/-] Trying to create hash of size 1048576
[0/-] Finished create hash.
[0/-] Stats: Total hashes is 59499396, Longest hash is 69516, shortest is 1, and average is 56.743046
[0/-] Trying to create a new genome with a size of 63025520...Success!
[0/-] Trying to malloc 7878190 elements for positions array...Success!
[0/-] Finished Vector Conversion

Time to hash: 12 seconds
Matching 2 file(s):

[-/0] Matching 25000 sequences of: 7550X1_1_1.txt
Reads per processor: 128
[0/0] 33% reads complete
[0/0] 66% reads complete
[0/0] 98% reads complete

[-/0] Matching 25000 sequences of: 7550X1_1_2.txt
[0/0] 0% reads complete
[0/0] 33% reads complete
[0/0] 66% reads complete
[0/0] 98% reads complete
[0/0] 98% reads complete

[0/-] Time since start: 4890.28

[0/-] Printing output.
Finished printing to example.output.sgr

# Total Time: 4890.38 seconds.
# Found 49152 sequences.
# Sequences matched: 13921
# Sequences not matched: 35231
# Output written to example.output
Total wait time is 0.000000

The same data using novoalign(for fairness, remove license file to avoid threading and using unzipped fastq files):

$novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:7851\tCN:HCI' -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt >7550x1.sam

# novoalign (2.07.05 - Nov 29 2010 @ 13:34:51) - A short read aligner with qualities.
# (C) 2008 NovoCraft
# Licensed for evaluation, educational, and not-for-profit use only.
# novoalign -o SAM @RG ID:YING PL:ILLUMINA LB:LB_TEST SM:7851 CN:HCI -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt
# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.
# Index Build Version: 2.7
# Hash length: 14
# Step size: 1
# Paired Reads: 25000
# Pairs Aligned: 368
# Read Sequences: 50000
# Aligned: 767
# Unique Alignment: 749
# Gapped Alignment: 18
# Quality Filter: 490
# Homopolymer Filter: 32
# Elapsed Time: 59.363 (sec.)
# CPU Time: 0.9 (min.)

Monday, March 7, 2011

Loss of function annotation

From 1000 genome project:

Functional annotation of SNPs, short indels and large structural variants was determined
with reference to the GENCODE v3b annotation release (Harrow, Denoeud et al. 2006).
Coding SNPs were mapped on to transcripts annotated as “protein_coding” and
containing an annotated START codon, and classified as synonymous, non-synonymous,
nonsense (stop codon-introducing), stop codon-disrupting or splice site-disrupting
(canonical splice sites). Transcripts labeled as NMD (predicted to be subject to nonsensemediated
decay) were not used. Small deletions predicted to cause a frame-shift and
large deletions predicted to disrupt gene function were also analysed.

Nonsense and splice-disrupting SNPs were flagged as likely representing reference error
or annotation artefacts if the inferred loss-of-function (LOF) allele was also the ancestral
state, or if the reference (non-LOF) allele was not observed in any individual in that
population, and were excluded from the per-individual counts in Table 2. Splice-disrupting
SNPs in non-canonical splice sites were also discarded. We did not consider the frameshift
status of splice-disrupting SNPs due to the challenges of inferring the effects of
removal of splice-donor and acceptor sites on exon structure, but rather treated all such
SNPs as likely to affect gene function.

We classified large deletions as gene-disrupting if they fulfilled the following criteria:

1. Removal of >50% of the coding sequence; or
2. Removal of the gene’s transcriptional start site or start codon; or
3. Removal of an odd number of internal splice sites; or
4. Removal of one or more internal coding exons that would be predicted to generate
a frameshift.

For large deletions with imprecise breakpoints, we conservatively required that the
deletions defined by both the inner and outer confidence intervals would have the same
predicted effect on gene function. For cases with microhomology at the break-point we
treated the breakpoint as falling to the right-hand side of the region of microhomology.
We did not perform functional annotation for large duplications due to the challenges of
inferring functional consequences. The numbers stated in the text should thus be
regarded as a lower bound for the number of observed loss-of-function variants per
individual genome.

However, it should also be noted that the proportion of false positive calls in the LOF class
due to sequencing and annotation errors is expected to be substantially higher than the
genome-wide average. This effect is expected as LOF sites show a low level of true
polymorphism due to selective constraint, meaning that a uniform error rate across the
genome will result in a higher proportion of false positive calls compared to other (more
variable) sites.
Enrichment of false positive calls in LOF variants is most evident in the CHB+JPT
samples, which showed a higher per-individual number of LOF SNPs than other
populations despite a comparable number of synonymous variants (Supplementary Table
11), as well as an unusual peak in the derived allele frequency spectrum (Supplementary
Figure 13). This is likely due to a mild elevation in genome-wide false positive rates for
SNPs in this population compared to other samples, which is then highly enriched at
functionally constrained sites.
To lower the number of false positive indel calls we applied more stringent filters to the
subset of indels that were called in the genome-wide set and were predicted to fall into the
LOF class. The stringent filter requires that the range of positions where an indel would
yield the same alternative haplotype sequence as the original called indel (for instance, in
a repeat, the deletion of any repeat unit would give the same alternative haplotype), plus 4
bases of reference sequence on both sides of this region, was covered by at least one
read on the forward strand, and at least one read on the reverse strand, with at most one
mismatch between the read and the alternative haplotype sequence resulting from the
indel (regardless of base-qualities). This filter removed an excess of 1 bp frameshift
insertions seen in CHB+JPT with respect to CEU in the less stringently filtered genomewide
indel call set, although it is expected to remove a significant number of true positive
calls as well. The indels that pass these stringent filters have been annotated in the
project’s VCF files.
Experimental validation and manual reannotation of identified LOF variants is currently
ongoing (manuscript in preparation).
For extrapolating the functional variants identified per individual in the exon project to the
whole genome (Table 2) we used the ratio of the total coding sequence and splice sites in
the exon-capture target regions (1,423,559 bp and 7,513, respectively) to the
corresponding numbers for the GENCODE v3b annotation set as a whole (35,676,620 bp
and 384,439, respectively).

The coordinates and predicted functional consequences of all of the LOF variants
identified in the project are available on the 1000 Genomes FTP site.

Wednesday, February 23, 2011

Tomato tutorial

Suppose you have a folder named "test" which looks like:



-rw-r--r-- 1 hiseq hiseq 2145917 Feb 23 11:05 7550X1_1_1.txt.gz
-rw-r--r-- 1 hiseq hiseq 2212841 Feb 23 11:05 7550X1_1_2.txt.gz
-rw-r--r-- 1 hiseq hiseq 2114459 Feb 23 11:05 7550X1_2_1.txt.gz
-rw-r--r-- 1 hiseq hiseq 2135003 Feb 23 11:05 7550X1_2_2.txt.gz
-rw-r--r-- 1 hiseq hiseq 2129933 Feb 23 11:05 7550X1_3_1.txt.gz
-rw-r--r-- 1 hiseq hiseq 2168390 Feb 23 11:05 7550X1_3_2.txt.gz
-rw-r--r-- 1 hiseq hiseq 2147386 Feb 23 11:08 7550X2_5_1.txt.gz

Now you want to do a alignment against "humen genome build 19" using these fastq files.
Following two steps to do this:

1. Create a "cmd.txt" file in this folder with two lines:
$cat > cmd.txt

echo 'hello,world'

The first line starts with '#e' gives your emails address if you want to be notified once the job is done. It is optional.

The second line starts with '@' gives a keywords based commands. It will be translated into:
"I want to do a alignment against hg19.nix(a novoalign indexed human genome) using all fastq files in this folder, the output should be in .bam format."
If you want the output in .sam format, use "@hg19.nix,sam".
All the fastq files are automatically paired. For above example, if will be:

7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz

7550X1_2_1.txt.gz and 7550X1_2_2.txt.gz

7550X1_3_1.txt.gz and 7550X1_3_2.txt.gz


so the pseudo commands to be executed on the cluster machines will be like:

novoalign -o SAM -d hg19.nix -f 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz > 7550X1_1.sam 2>7550X1_1.err
novoalign -o SAM -d hg19.nix -f 7550X1_2_1.txt.gz 7550X1_2_2.txt.gz > 7550X1_2.sam 2>7550X1_2.err
novoalign -o SAM -d hg19.nix -f 7550X1_3_1.txt.gz 7550X1_3_2.txt.gz > 7550X1_3.sam 2>7550X1_3.err
novoalign -o SAM -d hg19.nix -f 7550X2_5_1.txt.gz > 7550X2_5.sam 2>7550X2_5.err
echo "hello,world"
*All lines that do not start with '#' or '@' are just normal Linux commands (in this case, echo 'hello,world' and date ) you want to execute.

2. Create an empty "b" file to tell the tomato this job is ready to "begin".
>touch b

'''Further Reading'''

The "keyword based commands" was designed to provide an easy way to do simple jobs which share standard procedures.
At now, the keywords only recognize "bam","sam" and any words with ".nix". Next, "snp" and "indel" are under development and test.
For example, a command like:

'''@hg19.nix, snp'''

will be translated into a standard procedure which comprise of many steps (commands) to call snp variants:

remove duplicates...
re-align around indels...
calibrate quality score...
merge lane samples....
call snp...
filter snp calls....

However, you can always use your own commands in "cmd.txt" if you want (and you know how):

#i 7550X1_1_1.txt.gz , 7550X1_1_1.txt.gz
novoalign -o SAM $'@RG\tID:USER\tPL:ILLUMINA\tLB:LIBRARY\tSM:7550X1\tCN:HCI' -d hg19.nix -f 7550X1_1_1.txt.gz 7550X1_1_1.txt.gz > 7550X1_1.sam 2>7550X1_1.err
SortSam.jar INPUT=7550X1_1.sam OUTPUT=7550X1_1.bam CREATE_INDEX=true SO=coordinate
GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fasta -I 7550X1_1.bam -B:indels,VCF dbsnp132.human.vcf -o 7550X1_1.intervals

*The second line starts with '#i' gives the input files(seperated by comma ",") to be used in your commands. In the case, the tomato will download these specified files only.
If '#i' was not given, then all files under this folder will be downloaded into cluster machine as possible input files.

[to be continued]

Monday, February 7, 2011

Craete GATK compatible dbSNP file

$ lftp
$ gunzip 00-All.vcf.gz
$ grep -c '#' 00-All.vcf

$ tail -n+58 00-All.vcf | awk '$0 !~ "_random" {print}' | awk '$0 !~ "_hap" {print}'| awk '$0 !~ "PAR" {print}' | awk '{sub(/MT/, "M", $0); print}' | awk '{print "chr" $0}'> dbsnp132.human.body.vcf

$ awk '{print $1}' dbsnp132.human.body.vcf | sort -u


$ head -n 57 00-All.vcf > dbsnp132.human.head.vcf

$ cat dbsnp132.human.head.vcf dbsnp132.human.body.vcf > dbsnp132.human.vcf

$ rm dbsnp132.human.body.vcf dbsnp132.human.head.vcf

Friday, February 4, 2011


svn checkout tropico

Tuesday, February 1, 2011


gunzip *.gz
cat `ls *.txt | sort -g` > seq.txt
gzip -c seq.txt > 7550X1_B20AAHABXX_1_1.txt.gz

Tuesday, January 11, 2011

Humen Genome hg19 size

3095693983 bp

If we need a coverage of 30x on a 100-bp reads, then we need

3095693983*30/100 = 928708195 reads