Next Generation Sequencing and Data Analysis: 2011

Friday, December 16, 2011

Hemizygous

Describes an individual who has only one member of a chromosome pair or chromosome segment rather than the usual two;
refers in particular to X-linked genes in males who under usual circumstances have only one X chromosome.

Friday, December 2, 2011

Minus and plus strand

ACTGGCGGCCATGGAACTCACCG

X - 15353623 15353645

SO this mapped to the minus strand

To match the plus strand, first reverse the sequence to:

GCCACTCAAGGTACCGGCGGTCA

Then compliment the reversed sequence to:

CGGTGAGTTCCATGGCCGCCAGT

Then it will be mapped to:

X + 15353623 15353645

Monday, November 14, 2011

Trac and Mercurial

#mkdir tool; cd tool

#Python-2.7.2
wget http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz
tar -zxvf Python-2.7.2.tgz
cd Python-2.7.2
./configure --prefix=/home/u0592675/tool/python-2.7.2
make
make install

#apache
wget http://www.eng.lsu.edu/mirrors/apache//httpd/httpd-2.2.21.tar.gz
tar -zxvf httpd-2.2.21.tar.gz
cd httpd-2.2.21
./configure --prefix=/home/u0592675/tool/apache-2.2.21
make
make install

#mod_wsgi
wget http://modwsgi.googlecode.com/files/mod_wsgi-3.3.tar.gz
tar -zxvf mod_wsgi-3.3.tar.gz
cd mod_wsgi-3.3
./configure --with-apxs=/home/u0592675/tool/apache-2.2.21/bin/apxs --with-python=/home/u0592675/tool/python_2.7.2/bin/python
make
make install

#Trac
wget http://ftp.edgewall.com/pub/trac/Trac-0.12.2.tar.gz
wget http://peak.telecommunity.com/dist/ez_setup.py
~/tool/python-2.7.2/bin/python ez_setup.py

#
trac-admin /home/ying/mercurial_database/tomato initenv
trac-admin /home/ying/mercurial_database/tomato deploy /home/ying/web/tomato_trac

{'mod_wsgi.listener_port': '80',
'HTTP_COOKIE': 'trac_form_token=0b8daa13a5ec1c5f4819083c; trac_session=80547f6793f2d13cbfea5b1c; trac_form_token=a30953232455e34b94f82aa5; trac_session=d5bdc2b3269ef9a4001807dd',
'mod_wsgi.listener_host': '',
'SERVER_SOFTWARE': 'Apache/2.2.21 (Unix) mod_wsgi/3.3 Python/2.7.2+',
'SCRIPT_NAME': '/tomato',
'mod_wsgi.handler_script': '',
'SERVER_SIGNATURE': '',
'REQUEST_METHOD': 'GET',
'PATH_INFO': '/',
'SERVER_PROTOCOL': 'HTTP/1.1',
'QUERY_STRING': '',
'HTTP_ACCEPT_CHARSET': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.106 Safari/535.2', 'HTTP_CONNECTION': 'keep-alive', 'SERVER_NAME': '155.100.213.215', 'REMOTE_ADDR': '155.100.213.215', 'mod_wsgi.request_handler': 'wsgi-script', 'wsgi.url_scheme': 'http', 'PATH_TRANSLATED': '/home/ying/tool/apache-2.2.21/htdocs/index.html', 'SERVER_PORT': '80', 'wsgi.multiprocess': True, 'mod_wsgi.input_chunked': '0', 'SERVER_ADDR': '192.168.247.129', 'DOCUMENT_ROOT': '/home/ying/tool/apache-2.2.21/htdocs', 'mod_wsgi.process_group': '', 'SCRIPT_FILENAME': '/home/ying/trac/cgi-bin/trac.wsgi', 'SERVER_ADMIN': 'you@example.com', 'wsgi.input': , 'HTTP_HOST': '155.100.213.215', 'wsgi.multithread': False, 'mod_wsgi.callable_object': 'application', 'REQUEST_URI': '/tomato/', 'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'wsgi.version': (1, 1), 'GATEWAY_INTERFACE': 'CGI/1.1', 'wsgi.run_once': False, 'wsgi.errors': , 'REMOTE_PORT': '6064', 'HTTP_ACCEPT_LANGUAGE': 'en-US,en;q=0.8', 'mod_wsgi.version': (3, 3), 'mod_wsgi.application_group': '',
'mod_wsgi.script_reloading': '1',
'wsgi.file_wrapper': ,
'HTTP_ACCEPT_ENCODING': 'gzip,deflate,sdch'}

[Tue Nov 08 11:15:24 2011] [error]

prepare GATK resource for pipeline

Three VCF Files for known SNPs from GATK resource

1000G_omni2.5.hg19.sites.vcf -> hg19.1000G_omni2.5.vcf
hapmap_3.3.hg19.sites.vcf -> hg19.hapmap_3.3.vcf
dbsnp_132.hg19.vcf -> hg19.dbsnp_132.vcf

Also copy corresponding ".idx" files to DATA_PATH

mv 1000G_omni2.5.hg19.sites.vcf hg19.1000G_omni2.5.vcf
mv hapmap_3.3.hg19.sites.vcf hg19.hapmap_3.3.vcf
mv dbsnp_132.hg19.vcf hg19.dbsnp_132.vcf

##### ERROR MESSAGE: Input files /scratch/local/4/hg19.dbsnp_132.vcf and reference have incompatible contigs: Order of contigs differences, which is unsafe.
##### ERROR /scratch/local/4/hg19.dbsnp_132.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]

1. get "dbsnp_132.hg19.vcf.gz" from "ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/1.2/hg19/"

2. gunzip dbsnp_132.hg19.vcf.gz

3. awk '/^chrM/ {print $0}' dbsnp_132.hg19.vcf > m.vcf

4. awk '/^chrM/ {next} {print $0}' dbsnp_132.hg19.vcf > wom.vcf

5. cat m.vcf >> wom.vcf

6. mv wom.vcf hg19.dbsnp_132.vcf

7. vcf-validator hg19.dbsnp_132.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chr1:120995

8. vcf-validator dbsnp_132.hg19.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chrM:16189

8. /home/u0592675/vcftools_0.1.7/bin/vcf-validator hg19.dbsnp_132.vcf

10 ./home/u0592675/vcftools_0.1.7/bin/vcf-validator dbsnp_132.hg19.vcf

awk '/^chr/ {print $1}' /tomato/data/hg19.dbsnp_132.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
chrM

awk '/^chr/ {print $1}' hg19.1000G_omni2.5.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY

awk '/^chr/ {print $1}' hg19.hapmap_3.3.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX

1.
awk '/^chrM/ {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > m.vcf
awk '/^chrM/ {next} {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > wom.vcf
cat m.vcf >> wom.vcf
rm m.vcf
mv wom.vcf hg19.dbsnp_132.vcf

2.
vim hg19.dbsnp_132.vcf
delete all non-mapped contigs in header section ##contig

hg19.hapmap_3.3.vcf

hg push ssh://hiseq@hci-bio2.hci.utah.edu:/home/hiseq/tomato/
hg clone http://hci-bio2.hci.utah.edu:8011 tomato

Thursday, November 3, 2011

Got new genome fasta file?

#1. hg19.fasta.idx
samtools faidx hg19.fasta

#2. hg19.dict
java -Xmx2g -jar ~/tool/picard/CreateSequenceDictionary.jar R=hg19.fasta O=hg19.dict

Tuesday, November 1, 2011

Mercurial

hg push ssh://hiseq@bio2:/home/hiseq/tomato/
hg clone http://bio2:1234 tomato

prepare dbsnp132

1.
awk '/^chrM/ {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > m.vcf
awk '/^chrM/ {next} {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > wom.vcf
cat m.vcf >> wom.vcf
rm m.vcf
mv wom.vcf hg19.dbsnp_132.vcf

2.
vim hg19.dbsnp_132.vcf
delete all non-mapped contigs in header section ##contig

Monday, October 31, 2011

Resource Bundle

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/1.2

Thursday, October 27, 2011

Known SNPs

Friday, October 14, 2011

I am running out of disk space

again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again

Multiple hits? All or None?

#-r All
57G Oct 14 09:44 1.sam
65G Oct 14 10:06 2.sam
65G Oct 14 10:26 3.sam
45G Oct 14 10:46 4.sam
56G Oct 14 11:09 5.sam

#-r None
56G Oct 11 02:10 1.sam
64G Oct 11 02:32 2.sam
63G Oct 11 02:53 3.sam
44G Oct 11 03:07 4.sam
54G Oct 11 03:25 5.sam

Thursday, October 13, 2011

pre-processing of masked FASTA for novoalign

1. Convert lower case to upper case
sed -i 's/$.*$/\U\1/' hg19.fasta

> chr1
NNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNN
ggggggttttTGaaaaaaaCCC

Will be

> CHR1
NNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNN
GGGGGGGTTTTTTGAAAAACCC

2. covert '>CHR' to '>chr'
sed -i -e 's/>CHR/>chr/g' hg19.fasta

3. convert N to n
sed -i -e 's/N/n/g' hg19.masked.fasta

4. mask these UTR
novoindex -m hg19.nix hg19.fasta

Monday, September 26, 2011

Keyless ssh connection between A and B

A:
ssh-keygen -t rsa
...
...

Then scp the "id_rsa.pub" to B

B:
cat id_rsa.pub >> .ssh/authorized_keys
chmod 644 .ssh/authorized_keys

Now connection from A to B is keyless.

Friday, September 16, 2011

ACL Setup

On the job side, I need a root-like user who has write permission on any sub-folders under PATH_JOB while the individual user has only 744 like permissions on his/her own job folders. However, Unix/Linux does not have fine control over file/directory by default, so you can not assign permissions to different users. ACL (Access Control Lists) can be used to solve this problem.

Here is a simple test on my VM_Ubuntu_10.10:

1. List partitions, the /dev/sda1 is the root directory "/"
# blkid
/dev/sda1: UUID="xxxx" TYPE="ext4"
/dev/sda5: UUID="yyyy" TYPE="swap"

2. Modify the /etc/fstab to support ACL by inserting "acl" into options.
#vim /etc/fstab
UUID=xxxx / ext4 errors=remount-ro,acl 0 1

3. Remount the root directory
#mount / -o remount

4. Make a test
#setfacl -m u:abc:rw hello
#getfacl hello

# file: hello.txt
# owner: root
# group: root
user::rw-
user:abc:rw-
group::rw-
mask::rw-
other::---

5. Now user "abc" can read and write "hello"!

#ls
rw-rw----+ 1 root root 0 2011-09-16 11:41 hello.txt

Note the new "+" character appears the end of permission field. This shows this file/folder supports ACL

Friday, August 19, 2011

Merge the SAM/BAM file

Suppose I have two alignments in SAM (1.sam and 2.sam). Since these two alignments are two lanes from the same sample, I should merge these two alignments into one.

>wc -l 1.sam
10000

>wc -l 2.sam
10000

So both the SAM files has 10000 lines (include the header)

First using picard's "MergeSamFiles.jar"

>java -jar MergeSamFiles.jar I=1.sam I=2.sam O=12.sam
>wc -l 12.sam
19973

The missing 27 lines are actually duplicated header lines and they were removed by PICARD automatically. However, there is a subtle bug here.

>less 12.sam

@HD VN:1.0 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621 AS:hg19.nov.illumina.nix
@SQ SN:chr2 LN:243199373 AS:hg19.nov.illumina.nix
@SQ SN:chr3 LN:198022430 AS:hg19.nov.illumina.nix
@SQ SN:chr4 LN:191154276 AS:hg19.nov.illumina.nix
@SQ SN:chr5 LN:180915260 AS:hg19.nov.illumina.nix
@SQ SN:chr6 LN:171115067 AS:hg19.nov.illumina.nix
@SQ SN:chr7 LN:159138663 AS:hg19.nov.illumina.nix
@SQ SN:chr8 LN:146364022 AS:hg19.nov.illumina.nix
@SQ SN:chr9 LN:141213431 AS:hg19.nov.illumina.nix
@SQ SN:chr10 LN:135534747 AS:hg19.nov.illumina.nix
@SQ SN:chr11 LN:135006516 AS:hg19.nov.illumina.nix
@SQ SN:chr12 LN:133851895 AS:hg19.nov.illumina.nix
@SQ SN:chr13 LN:115169878 AS:hg19.nov.illumina.nix
@SQ SN:chr14 LN:107349540 AS:hg19.nov.illumina.nix
@SQ SN:chr15 LN:102531392 AS:hg19.nov.illumina.nix
@SQ SN:chr16 LN:90354753 AS:hg19.nov.illumina.nix
@SQ SN:chr17 LN:81195210 AS:hg19.nov.illumina.nix
@SQ SN:chr18 LN:78077248 AS:hg19.nov.illumina.nix
@SQ SN:chr19 LN:59128983 AS:hg19.nov.illumina.nix
@SQ SN:chr20 LN:63025520 AS:hg19.nov.illumina.nix
@SQ SN:chr21 LN:48129895 AS:hg19.nov.illumina.nix
@SQ SN:chr22 LN:51304566 AS:hg19.nov.illumina.nix
@SQ SN:chrX LN:155270560 AS:hg19.nov.illumina.nix
@SQ SN:chrY LN:59373566 AS:hg19.nov.illumina.nix
@SQ SN:chrM LN:16571 AS:hg19.nov.illumina.nix
@RG ID:TOMATO PL:ILLUMINA LB:LIBTMP SM:SAMPLE
@PG ID:novoalign VN:V2.07.10 CL:novoalign
@PG ID:novoalign.1 VN:V2.07.10 CL:novoalign

The last two lines are both @PG which differ the ID. Before merging, they are both "novoalign", after merging, the duplicated ID was renamed as "novoalign.1" (if merging 3 files, you will have "novoalign.2" as well). When you check the individual read alignment, you will see the PG:Z field referenced the PG line. When you call variances later, the genotyper application will process merged file as two files which is not what you want.

To fix this:

1. Keep the first @PG line and delete all other @PG lines
2. Replace all "novoalign.X" to "novoalign".

Or you can use samtools.

>java -jar SortSam.jar INPUT=1.sam OUTPUT=1.bam SO=unsorted
>java -jar SortSam.jar INPUT=2.sam OUTPUT=2.bam SO=unsorted
>samtools merge 3.bam 1.bam 2.bam
>samtools view -h -o 3.sam 3.bam
>less 3.sam

Samtools only use the first SAM's header in merged SAM, so it is OK in this case.

Tuesday, July 12, 2011

mouse dbSNP VCF

OK I spend few hours on searching data and writing a script to make a dbSNP VCF for mouse. The steps are simple and dirty. Test of this VCF file has not been done yet. Put it here for reference only.

******************************************************************************

#1. download dbSNP mouse
wget --continue --no-host-directories --no-directories --preserve-permissions ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/chr_rpts/*

#2. unzip
gunzip *.gz

#3. exclude all SNPs that are not mapped to mm9 (MGSCv37) genome.
for i in `ls`; do grep 'MGSCv37' "$i" > mm9."$i" ; done

#However, you can NOT find REF/ALT field in above files. So we must get it from other files.
#4. download dbSNP mouse with genotype
wget --continue --no-host-directories --no-directories --preserve-permissions ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/genotype/*

#5. unzip
gunzip *.gz

#6. check out these files, should end with ".xml"
ls

Black6GtyFromContig gt_chr13.xml gt_chr17.xml gt_chr2.xml gt_chr6.xml gt_chrAltOnly.xml gt_chrUn.xml
gt_chr10.xml gt_chr14.xml gt_chr18.xml gt_chr3.xml gt_chr7.xml gt_chrMT.xml gt_chrX.xml
gt_chr11.xml gt_chr15.xml gt_chr19.xml gt_chr4.xml gt_chr8.xml gt_chrMulti.xml gt_chrY.xml
gt_chr12.xml gt_chr16.xml gt_chr1.xml gt_chr5.xml gt_chr9.xml gt_chrNotOn.xml README

#7. make a 3 column file: "id REF ALT"
grep '$//g' | sed 's/rsId=//g' | sed 's/observed=//g' | awk -F '/' '{print $1, $2}' > id.allele.txt

#8. write a simple python script that mapping id from mm9.chr_N.txt to make the final VCF file - "mm9.dbsnp.vcf"


def mm9():
    m = {}
    for line in file('id.allele.txt'):
        line = line.strip()
        toks=line.split(' ')
        if len(toks)==3:
            m[toks[0]] = (toks[1],toks[2])
    
    #chr1 - chr19, chr X,Y and MT only.
    nfs = [str(i) for i in range(1,20)]
    nfs.append('X')
    nfs.append('Y')
    nfs.append('MT')
    
    fo = file('mm9.dbsnp.vcf','w')
    print >>fo,'##fileformat=VCFv4.0'
    print >>fo,'##fileDate=20110712'
    print >>fo,'##source=dbSNP'
    print >>fo,'##dbSNP_BUILD_ID=132'
    print >>fo,'##reference=mm9'
    print >>fo,'#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO'
    
    total = 0
    for n in nfs:
        sm = {}
        fn = os.path.join('mm9.chr_%s.txt'%n)
        for line in file(fn):
            line = line.strip()
            toks=line.split('\t')
            try:
                rs_id = toks[0]
                mapped = toks[1]
                withdraw =toks[2]  
                chr =toks[6]  
                position =toks[11]
                if m.has_key(rs_id) and mapped=='2' and withdraw=='0':
                    ref,alt = m[rs_id]
                    #use an arbitratry "ms"+ID as prefix refID (instead of "rs")
                    sm[int(position)] = [chr,position,'ms'+rs_id,ref,alt,'.','PASS','.'] 
            except Exception,e:
                pass
        sk = sorted(sm.keys())
        for k in sk:
            total += 1
            print >>fo,'\t'.join(sm[k])
    fo.close()
    print total
mm9()

#9. make sure no duplicated IDs
awk '{print $3}' mm9.dbsnp.vcf | sort| uniq -d

13745646
13501994

Monday, June 20, 2011

Indel filter

java -Xmx32g -jar $HOME/tool/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R $HOME/data/hg19.fasta \

-o x1.indel.filtered.vcf \

-B:variant,VCF x1.indel.vcf \

-B:mask,VCF ../SNP/x1.snp.vcf \

--maskName SNP \

--filterExpression "QUAL < 30 || SB > -1.0 || QD < 2.0" \

--filterName "QUAL<30;SB>-1.0;QD<2"

SNP Filter

#uses SB >= 0.10 filter

#uses QD < 5.0 filter

#uses HRun >= 4 filter

#uses clustered SNPs (>= 3 SNPs within 10 bp) filter

#masks out SNPs that overlap the filtered indels

java -Xmx32g -jar $HOME/tool/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R $HOME/data/hg19.fasta \

-o x1.snp.filtered.vcf \

-B:variant,VCF x1.snp.vcf \

-B:mask,VCF ../INDEL/x1.indel.vcf \

--maskName InDel \

--clusterSize 3 \

--clusterWindowSize 10 \

--filterExpression "SB >= 0.10 || QD < 5.0 || HRun >= 4" \

--filterName "SB>=0.1;QD<5.0;HRun>=4;Cluster_3in10;MaskIndel"

#copy header and "PASS" items to "x1.snp.passed.vcf"

awk '/^#/ { print $0;next} $7 == "PASS" {print $0} ' x1.snp.filtered.vcf > x1.snp.passed.vcf

Friday, June 17, 2011

Build Genome Index

#novoalign

#normal

novoindex -k 14 -s 1 hg19.nov.normal.nix hg19.fasta

#solid color space

novoindex -k 14 -s 1 -c hg19.nov.color.nix hg19.fasta

#bisulphite

novoindex -k 14 -s 1 -b hg19.nov.bis.nix hg19.fasta

#normal

bwa index -a bwtsw -p hg19.bwa.normal hg19.fasta

#color

bwa index -a bwtsw -c -p hg19.bwa.color hg19.fasta

wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/chr*.fa.gz

novoindex -k 14 -s 1 mm9.nov.normal.nix genomes/mm9/fasta/mm9.fasta

novoindex -k 14 -s 1 -c mm9.nov.color.nix genomes/mm9/fasta/mm9.fasta

novoindex -k 14 -s 1 -b mm9.nov.bis.nix genomes/mm9/fasta/mm9.fasta

#normal

./bwa index -a bwtsw -p mm9.bwa.normal genomes/mm9/fasta/mm9.fasta

./bwa index -a bwtsw -c -p mm9.bwa.color genomes/mm9/fasta/mm9.fasta

Call SNP & Indel

java -Xmx32g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R hg19.fasta -B:dbsnp,VCF dbsnp132.human.vcf -glm BOTH -o X1.vcf -I X1.bam -l INFO -stand_call_conf 20 -stand_emit_conf 10

VAT

http://archive.gersteinlab.org/proj/VAT/

$pwd

$HOME/VAT

#download&install GNU Scientific Library

$wget http://mirror.clarkson.edu/gnu/gsl/gsl-1.15.tar.gz

$sudo tar -zxvf gsl-1.15.tar.gz;cd gsl-1.15; ./configure; make; make install

#download&install gettext

$wget http://ftp.gnu.org/pub/gnu/gettext/gettext-0.18.1.1.tar.gz

$sudo tar -zxvf gettext-0.18.1.1.tar.gz;cd gettext-0.18.1.1;./configure; make; make install

#download GD Library

http://google-desktop-for-linux-mirror.googlecode.com/files/gd-2.0.35.tar.gz

sudo tar -zxvf gd-2.0.35.tar.gz;cd gd/2.0.35/;./configure; make; make install

#libbios

$wget http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz

$sudo tar -zxvf libbios-1.0.0.tar.gz; cd libbios-1.0.0

$sudo export CPPFLAGS="-I$HOME/VAT/libbios-1.0.0/include"

$sudo export LDFLAGS="-L$HOME/VAT/libbios-1.0.0/lib"

$sudo ./configure;make;make install

#twoBitToFa is required for "interval2sequences"

$wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa

$chmod +x twoBitToFa

$cp twoBitToFa $HOME/tool/

#download&install VAT package

$wget http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz

$tar -zxvf vat-1.0.0.tar.gz; cd vat-1.0.0

$mkdir tmp

$./configure --prefix=$HOME/VAT/vat-1.0.0/tmp;make;make install

$mv tmp/* $HOME/VAT/tool

wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz

gunzip gencode.v3c.annotation.GRCh37.gtf.gz

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

export LD_LIBRARY_PATH=$HOME/VAT/lib:$LD_LIBRARY_PATH

cat x2.snp.vcf | snpMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > x2.snp.annotated.vcf

cat x2.indel.vcf | indelMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > x2.indel.annotated.vcf

Thursday, June 16, 2011

An python script to seperate SNP and Indel calls in one VCF


#!/usr/bin/python
def split_snp_indel(fvcf):
    basename = fvcf.rstrip('.vcf')
    fsnp = file(basename+'.snp.vcf','w') 
    findel = file(basename+'.indel.vcf','w')
    for line in file(fvcf):
        line = line.strip()
        if line.startswith('#'):
            print >>fsnp, line  
            print >> findel, line 
        else:
            toks = line.split('\t')
            ref = toks[3]   
            alt = toks[4]
            if len(ref)> 1 or len(alt)> 1:
                print >> findel, line 
            else:
                print >>fsnp, line
                  
if __name__=='__main__':
    import sys
    split_snp_indel(sys.argv[1])

Wednesday, June 15, 2011

GATK resource bundle

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/5974/hg19

Monday, June 13, 2011

update GATK and PICARD

#!/bin/bash

rm -fr app/gatk/*

svn co https://svn.broadinstitute.org/Sting/trunk GATK

cd GATK; ant; cd ..; cp GATK/dist/* app/gatk/; rm -fr GATK

rm -fr app/picard/*

svn co https://picard.svn.sourceforge.net/svnroot/picard/trunk PICARD

cd PICARD; ant; cd ..; cp PICRAD/dist/* app/picard/; rm -fr PICARD

Monday, June 6, 2011

Hide a control in Flex

btn_hello.visible = false;

btn_hello.includeInLayout = false;

Wednesday, June 1, 2011

Ambiguous Code in SOLiD Color Space

An Example from a SOLiD Color Space aligned SAM file:

CS:Z:T3.3300220113.2323222031301323303202033330300301210

Here "CS:Z" field gives the color read sequence "T3.33". The first character after the "CS:Z:" is always "T", after that, all valid code should be range from 0 to 3.

0 -this base is same as previous base

1 - this base is the transversion of previous base

2 - this base is the transition of previous base

3 - this base is the complement of previous base

However sometimes we can observe ambiguous code like "." or "4". You may notice the 3rd base is not in 0~3, it is a "." - an ambiguous code indicates an ambiguous base, could be any base of A, C, G and T.

This ambiguous code will fail the GATK's "CountCovariates" tool. It will complain

Error Caused by: org.broadinstitute.sting.utils.exceptions.UserException$MalformedBAM: SAM/BAM file SAMFileReader{/home/u0592675/A232/441_10xF3.mate.dup.sort.realign.bam}

is malformed: Unrecognized color space in SOLID read, color = . Unfortunately this bam file can not be recalibrated without full color space information because of potential reference bias.

In a simple test, around 1% (58297 out of 5275136) alignments has at least one ambiguous code. So here we have two options to eliminate this problem:

1. make a script to change all ambiguous codes into random valid codes. It looks like:

m = re.compile("\.")

m.sub(random.choice(('0','1','2','3')),CS_Z_STRING)

where CS_Z_STRING is like above example.

However, this may bring a serious problem because all bases execpt for the 1st base are depending on previous base. If we choose a wrong code which probability is 3/4, then all the following bases will be wrong.

2. delete the affected alignments

Considering the small percentage (1%) of ambiguous code affected alignments, it may be wise to delete those troublesome lines.

m1 = re.compile("CS:Z:T[0-3]*?[^0-3\s]")

def delete_ambiguous_line(fin,fout):

fo = file(fout,'w')

i = j = 0

for line in file(fin):

line = line.strip()

if line.startswith('@'):

nline = line

else:

if m1.search(line):

j += 1

else:

i += 1

nline = line

print >>fo,nline

print "Fix ambiguous code - Total: %d, Deleted: %d, Percent of Deleted: %d%%" % (i+j,j,j*100/(i+j))

fo.close()

Friday, April 8, 2011

merge sam files by "cat"

tail -n +29 7550X2_1.sam > 1.sam

tail -n +29 7550X2_2.sam > 2.sam

tail -n +29 7550X2_3.sam > 3.sam

tail -n +29 7550X2_4.sam > 4.sam

tail -n +29 7550X2_5.sam > 5.sam

head -n 28 1.sam > header.txt

cat header.txt 1.sam 2.sam 3.sam 4.sam 5.sam > hello.sam

picard-tools-1.42/SortSam.jar INPUT=hello.sam OUTPUT=hello.sorted.sam CREATE_INDEX=false SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.

picard-tools-1.42/SortSam.jar INPUT=hello.sorted.sam OUTPUT=hello.sorted.bam CREATE_INDEX=true SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.

Wednesday, April 6, 2011

novoalign without "-k"(left) and with "-k" (right)

# Hash length: 13 # Hash length: 13

# Step size: 1 # Step size: 1

# Paired Reads: 96409627 # Paired Reads: 96409627

# Pairs Aligned: 62936012 # Pairs Aligned: 62971552

# Read Sequences: 192819254 # Read Sequences: 192819254

# Aligned: 128634931 # Aligned: 128720259

# Unique Alignment: 126260501 # Unique Alignment: 126343058

# Gapped Alignment: 2033737 # Gapped Alignment: 2039288

# Quality Filter: 1530071 # Quality Filter: 1530624

# Homopolymer Filter: 48808 # Homopolymer Filter: 48964

# Elapsed Time: 9737.411 (sec.) # Elapsed Time: 10373.065 (sec.)

# CPU Time: 3654.4 (min.) # CPU Time: 3922.4 (min.)

# Fragment Length Distribution # Fragment Length Distribution

Tuesday, April 5, 2011

A ZS field in Sam File from novoalign

ZS:Z: Novoalign alignment status. Not present for unique alignments.

ZS:Z:R Multiple alignments with similar score were found.

ZS:Z:QC The read was not aligned as it bases qualities were too low or it was a homopolymer read.

ZS:Z:NM No alignment was found.

ZS:Z:QL An alignment was found but it was below the quality threshold.

ZN:i: Number of alignments to read. Only present if there was more than one alignment.

ZO:Z: Indictaes long or short insert fragment for mate pair alignments when short insert has been enabled.

'+' Indicates pair was aligned as a short insert fragment.

'+' Pair was aligned as a long insert fragment.

This tag is only present for Illumina mate pairs when a short fragment length size has been specified with the i option and reads are aligned as a proper pair .

ZH:i: Hairpin score for miRNA a

ZL:i: In miRNA mode this is the alignment location for adjacent opposite strand alignment.

Examples:

ZS:Z:R ZN:i:2 Indicates a read aligned to two locations.

ZS:Z:NM No alignment was found.

ZS:Z:QC The read failed quality checks and was not aligned.

Friday, April 1, 2011

merge

sed -i 's/YING\../YING/g' 7550x1.merged.sam

sed -i 's/novoalign\../novoalign/g' 7550x1.merged.sam

time java -jar -Xmx30g ~/tool/picard-tools-1.41/SortSam.jar INPUT=7550x1.merged.sam OUTPUT=7550X1.merged.bam CREATE_INDEX=true SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.

time java -jar -Xmx30g ~/tool/gatk/dist/GenomeAnalysisTK.jar -T UnifiedGenotyper -R chr20.fa -B:dbsnp,VCF dbsnp132.human.vcf -o 7550X1.snps.raw.vcf -I 7550X1.merged.bam -stand_call_conf 50.0 -stand_emit_conf 20.0 -dcov 300

time java -jar -Xmx32g ~/tool/gatk/dist/GenomeAnalysisTK.jar -T VariantFiltration -R chr20.fa -B:variant,VCF 7550X1.snps.raw.moab.vcf -o 7550X1.snps.raw.moab.homosnp.vcf --maskName InDel --clusterWindowSize 10 --filterExpression "AF<1.0 || DP<10 || DP>100" --filterName "AF<1.0 or DP<10"

grep -c "PASS" 7550X1.snps.raw.moab.homosnp.vcf

netstat

#see all unique IP addresses connected to the server

netstat -nat | awk '{ print $5}' | cut -d: -f1 | sed -e '/^$/d' | uniq

#if your server is under a DOS attack

netstat -anp |grep 'tcp\|udp' | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n

Thursday, March 31, 2011

phased genotype

A phased genotype indicates that for heterozygous loci it is known which chromosome each variant

sequence belongs to. That is, if one variant has a Variant_seq=A,C attribute and another variant on the same landmark

feature (chromosome) has a Variant_seq=T,G attribute then the A and T occur together on one copy of the chromosome and the C and G

occur together on the other copy of that chromosome.

Thursday, March 24, 2011

VCF manipulation (to be edited)

#how many raw snps?

$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | wc -l

3492007

#filter dbSNP member and retain PASSed SNPs.

$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | awk '$3 !~ "rs" {print $0}' | awk '$7 ~ "PASS" {print $0}' > 7550x1.nodbsnp.pass.vcf

#how many novel SNPs?

$wc -l 7550x1.nodbsnp.pass.vcf

312776

$vcf-validator 7550x1.nodbsnp.pass.vcf

#error , no header

$head -n 22 7550X1.snps.raw.vcf > head.txt

$cat head.txt 7550x1.nodbsnp.pass.vcf > 7550x1.vcf

#have to build index!

$bgzip 7550x1.vcf

$tabix -p vcf 7550x.vcf.gz

#the commom snps between x1 and x4 (affected)

$vcf-isec -n =2 7550x1.vcf.gz 7550x4.vcf.gz > x1x4.vcf

$wc -l x1x4.vcf

130480 x1x4.vcf

Tuesday, March 22, 2011

Append .jar to USeq applications

FILES=USeq_7.6.2/Apps/*

for f in $FILES

echo "Processing $f file..."

# take action on each file. $f store current file name

#cat $f

mv $f $f'.jar'

done

Monday, March 14, 2011

GNUMAP and novoalign benchmark

Data:

7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz, pair-end, 25000 reads, 101bp per read.

chr20.fa, human genome chromosome 20, build hg19.

Command:

$gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz

Output:

This is GNUMAP, Version 2.2.0, for public and private use.

Command Line Arguments: gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt,7550X1_1_2.txt

Parameters:

Verbose: 1

Genome file(s): chr20.fa

Output file: example.output

Sequence file(s): 7550X1_1_1.txt,7550X1_1_2.txt

Align score: 0.9

Number of threads: 1

Mer size: 10

Using jump size of 5

Using Default Alignment Scores

Gap score: -4

Maximum Gaps: 3

Hashing the genome.

[0/1] gen_size=0, my_start=0, my_end=0

chr20.fa

Hashing Genome...

Reading: chr20

..............................................

Size of genome: 63025520

[0/-] Converting to Vector...

[0/-] Trying to create hash of size 1048576

[0/-] Finished create hash.

[0/-] Stats: Total hashes is 59499396, Longest hash is 69516, shortest is 1, and average is 56.743046

[0/-] Trying to create a new genome with a size of 63025520...Success!

[0/-] Trying to malloc 7878190 elements for positions array...Success!

[0/-] Finished Vector Conversion

Time to hash: 12 seconds

Matching 2 file(s):

[-/0] Matching 25000 sequences of: 7550X1_1_1.txt

Reads per processor: 128

[0/0] 33% reads complete

[0/0] 66% reads complete

[0/0] 98% reads complete

[-/0] Matching 25000 sequences of: 7550X1_1_2.txt

[0/0] 0% reads complete

[0/0] 33% reads complete

[0/0] 66% reads complete

[0/0] 98% reads complete

[0/-] Time since start: 4890.28

[0/-] Printing output.

Finished printing to example.output.sgr

#Finished.

# Total Time: 4890.38 seconds.

# Found 49152 sequences.

# Sequences matched: 13921

# Sequences not matched: 35231

# Output written to example.output

Total wait time is 0.000000

The same data using novoalign(for fairness, remove license file to avoid threading and using unzipped fastq files):

$novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:7851\tCN:HCI' -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt >7550x1.sam

# novoalign (2.07.05 - Nov 29 2010 @ 13:34:51) - A short read aligner with qualities.

# Licensed for evaluation, educational, and not-for-profit use only.

# novoalign -o SAM @RG ID:YING PL:ILLUMINA LB:LB_TEST SM:7851 CN:HCI -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt

# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.

# Index Build Version: 2.7

# Hash length: 14

# Step size: 1

# Paired Reads: 25000

# Pairs Aligned: 368

# Read Sequences: 50000

# Aligned: 767

# Unique Alignment: 749

# Gapped Alignment: 18

# Quality Filter: 490

# Homopolymer Filter: 32

# Elapsed Time: 59.363 (sec.)

# CPU Time: 0.9 (min.)

Monday, March 7, 2011

Loss of function annotation

From 1000 genome project:

Functional annotation of SNPs, short indels and large structural variants was determined

with reference to the GENCODE v3b annotation release (Harrow, Denoeud et al. 2006).

Coding SNPs were mapped on to transcripts annotated as “protein_coding” and

containing an annotated START codon, and classified as synonymous, non-synonymous,

nonsense (stop codon-introducing), stop codon-disrupting or splice site-disrupting

(canonical splice sites). Transcripts labeled as NMD (predicted to be subject to nonsensemediated

decay) were not used. Small deletions predicted to cause a frame-shift and

large deletions predicted to disrupt gene function were also analysed.

Nonsense and splice-disrupting SNPs were flagged as likely representing reference error

or annotation artefacts if the inferred loss-of-function (LOF) allele was also the ancestral

state, or if the reference (non-LOF) allele was not observed in any individual in that

population, and were excluded from the per-individual counts in Table 2. Splice-disrupting

SNPs in non-canonical splice sites were also discarded. We did not consider the frameshift

status of splice-disrupting SNPs due to the challenges of inferring the effects of

removal of splice-donor and acceptor sites on exon structure, but rather treated all such

SNPs as likely to affect gene function.

We classified large deletions as gene-disrupting if they fulfilled the following criteria:

1. Removal of >50% of the coding sequence; or

2. Removal of the gene’s transcriptional start site or start codon; or

3. Removal of an odd number of internal splice sites; or

4. Removal of one or more internal coding exons that would be predicted to generate

a frameshift.

For large deletions with imprecise breakpoints, we conservatively required that the

deletions defined by both the inner and outer confidence intervals would have the same

predicted effect on gene function. For cases with microhomology at the break-point we

treated the breakpoint as falling to the right-hand side of the region of microhomology.

We did not perform functional annotation for large duplications due to the challenges of

inferring functional consequences. The numbers stated in the text should thus be

regarded as a lower bound for the number of observed loss-of-function variants per

individual genome.

However, it should also be noted that the proportion of false positive calls in the LOF class

due to sequencing and annotation errors is expected to be substantially higher than the

genome-wide average. This effect is expected as LOF sites show a low level of true

polymorphism due to selective constraint, meaning that a uniform error rate across the

genome will result in a higher proportion of false positive calls compared to other (more

variable) sites.

Enrichment of false positive calls in LOF variants is most evident in the CHB+JPT

samples, which showed a higher per-individual number of LOF SNPs than other

populations despite a comparable number of synonymous variants (Supplementary Table

11), as well as an unusual peak in the derived allele frequency spectrum (Supplementary

Figure 13). This is likely due to a mild elevation in genome-wide false positive rates for

SNPs in this population compared to other samples, which is then highly enriched at

functionally constrained sites.

To lower the number of false positive indel calls we applied more stringent filters to the

subset of indels that were called in the genome-wide set and were predicted to fall into the

LOF class. The stringent filter requires that the range of positions where an indel would

yield the same alternative haplotype sequence as the original called indel (for instance, in

a repeat, the deletion of any repeat unit would give the same alternative haplotype), plus 4

bases of reference sequence on both sides of this region, was covered by at least one

read on the forward strand, and at least one read on the reverse strand, with at most one

mismatch between the read and the alternative haplotype sequence resulting from the

indel (regardless of base-qualities). This filter removed an excess of 1 bp frameshift

insertions seen in CHB+JPT with respect to CEU in the less stringently filtered genomewide

indel call set, although it is expected to remove a significant number of true positive

calls as well. The indels that pass these stringent filters have been annotated in the

project’s VCF files.

Experimental validation and manual reannotation of identified LOF variants is currently

ongoing (manuscript in preparation).

For extrapolating the functional variants identified per individual in the exon project to the

whole genome (Table 2) we used the ratio of the total coding sequence and splice sites in

the exon-capture target regions (1,423,559 bp and 7,513, respectively) to the

corresponding numbers for the GENCODE v3b annotation set as a whole (35,676,620 bp

and 384,439, respectively).

The coordinates and predicted functional consequences of all of the LOF variants

identified in the project are available on the 1000 Genomes FTP site.

Wednesday, February 23, 2011

Tomato tutorial

Suppose you have a folder named "test" which looks like:

$pwd

/home/u12345/test

$ll

-rw-r--r-- 1 hiseq hiseq 2145917 Feb 23 11:05 7550X1_1_1.txt.gz

-rw-r--r-- 1 hiseq hiseq 2212841 Feb 23 11:05 7550X1_1_2.txt.gz

-rw-r--r-- 1 hiseq hiseq 2114459 Feb 23 11:05 7550X1_2_1.txt.gz

-rw-r--r-- 1 hiseq hiseq 2135003 Feb 23 11:05 7550X1_2_2.txt.gz

-rw-r--r-- 1 hiseq hiseq 2129933 Feb 23 11:05 7550X1_3_1.txt.gz

-rw-r--r-- 1 hiseq hiseq 2168390 Feb 23 11:05 7550X1_3_2.txt.gz

-rw-r--r-- 1 hiseq hiseq 2147386 Feb 23 11:08 7550X2_5_1.txt.gz

Now you want to do a alignment against "humen genome build 19" using these fastq files.

Following two steps to do this:

1. Create a "cmd.txt" file in this folder with two lines:

$cat > cmd.txt

#e your.email@gmail.com

@hg19.nix,bam

echo 'hello,world'

date

The first line starts with '#e' gives your emails address if you want to be notified once the job is done. It is optional.

The second line starts with '@' gives a keywords based commands. It will be translated into:

"I want to do a alignment against hg19.nix(a novoalign indexed human genome) using all fastq files in this folder, the output should be in .bam format."

If you want the output in .sam format, use "@hg19.nix,sam".

All the fastq files are automatically paired. For above example, if will be:

7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz

7550X1_2_1.txt.gz and 7550X1_2_2.txt.gz

7550X1_3_1.txt.gz and 7550X1_3_2.txt.gz

7550X2_5_1.txt.gz

so the pseudo commands to be executed on the cluster machines will be like:

novoalign -o SAM -d hg19.nix -f 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz > 7550X1_1.sam 2>7550X1_1.err

novoalign -o SAM -d hg19.nix -f 7550X1_2_1.txt.gz 7550X1_2_2.txt.gz > 7550X1_2.sam 2>7550X1_2.err

novoalign -o SAM -d hg19.nix -f 7550X1_3_1.txt.gz 7550X1_3_2.txt.gz > 7550X1_3.sam 2>7550X1_3.err

novoalign -o SAM -d hg19.nix -f 7550X2_5_1.txt.gz > 7550X2_5.sam 2>7550X2_5.err

echo "hello,world"

date

*All lines that do not start with '#' or '@' are just normal Linux commands (in this case, echo 'hello,world' and date ) you want to execute.

2. Create an empty "b" file to tell the tomato this job is ready to "begin".

>touch b

'''Further Reading'''

The "keyword based commands" was designed to provide an easy way to do simple jobs which share standard procedures.

At now, the keywords only recognize "bam","sam" and any words with ".nix". Next, "snp" and "indel" are under development and test.

For example, a command like:

'''@hg19.nix, snp'''

will be translated into a standard procedure which comprise of many steps (commands) to call snp variants:

'''

align...

sort...

remove duplicates...

re-align around indels...

calibrate quality score...

merge lane samples....

sort...

call snp...

filter snp calls....

'''

However, you can always use your own commands in "cmd.txt" if you want (and you know how):

'''

#e your.email@gmail.com

#i 7550X1_1_1.txt.gz , 7550X1_1_1.txt.gz

novoalign -o SAM $'@RG\tID:USER\tPL:ILLUMINA\tLB:LIBRARY\tSM:7550X1\tCN:HCI' -d hg19.nix -f 7550X1_1_1.txt.gz 7550X1_1_1.txt.gz > 7550X1_1.sam 2>7550X1_1.err

SortSam.jar INPUT=7550X1_1.sam OUTPUT=7550X1_1.bam CREATE_INDEX=true SO=coordinate

GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fasta -I 7550X1_1.bam -B:indels,VCF dbsnp132.human.vcf -o 7550X1_1.intervals

...

'''

*The second line starts with '#i' gives the input files(seperated by comma ",") to be used in your commands. In the case, the tomato will download these specified files only.

If '#i' was not given, then all files under this folder will be downloaded into cluster machine as possible input files.

[to be continued]

Monday, February 7, 2011

Craete GATK compatible dbSNP file

$ lftp ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-All.vcf.gz

$ gunzip 00-All.vcf.gz

$ grep -c '#' 00-All.vcf

$ tail -n+58 00-All.vcf | awk '$0 !~ "_random" {print}' | awk '$0 !~ "_hap" {print}'| awk '$0 !~ "PAR" {print}' | awk '{sub(/MT/, "M", $0); print}' | awk '{print "chr" $0}'> dbsnp132.human.body.vcf

$ awk '{print $1}' dbsnp132.human.body.vcf | sort -u

chr1

chr10

chr11

chr12

chr13

chr14

chr15

chr16

chr17

chr18

chr19

chr2

chr20

chr21

chr22

chr3

chr4

chr5

chr6

chr7

chr8

chr9

chrM

chrX

chrY

$ head -n 57 00-All.vcf > dbsnp132.human.head.vcf

$ cat dbsnp132.human.head.vcf dbsnp132.human.body.vcf > dbsnp132.human.vcf

$ rm dbsnp132.human.body.vcf dbsnp132.human.head.vcf

Friday, February 4, 2011

tropico

svn checkout http://tropico.googlecode.com/svn/ tropico

Tuesday, February 1, 2011

7550x.tar

gunzip *.gz

cat `ls *.txt | sort -g` > seq.txt

gzip -c seq.txt > 7550X1_B20AAHABXX_1_1.txt.gz

Tuesday, January 11, 2011

Humen Genome hg19 size

3095693983 bp

If we need a coverage of 30x on a 100-bp reads, then we need

3095693983*30/100 = 928708195 reads

Tuesday, January 4, 2011

Annovar

http://www.openbioinformatics.org/annovar/