Describes an individual who has only one member of a chromosome pair or chromosome segment rather than the usual two;
refers in particular to X-linked genes in males who under usual circumstances have only one X chromosome.
Friday, December 16, 2011
Friday, December 2, 2011
Minus and plus strand
ACTGGCGGCCATGGAACTCACCG
X - 15353623 15353645
SO this mapped to the minus strand
To match the plus strand, first reverse the sequence to:
GCCACTCAAGGTACCGGCGGTCA
Then compliment the reversed sequence to:
CGGTGAGTTCCATGGCCGCCAGT
Then it will be mapped to:
X + 15353623 15353645
X - 15353623 15353645
SO this mapped to the minus strand
To match the plus strand, first reverse the sequence to:
GCCACTCAAGGTACCGGCGGTCA
Then compliment the reversed sequence to:
CGGTGAGTTCCATGGCCGCCAGT
Then it will be mapped to:
X + 15353623 15353645
Monday, November 14, 2011
Trac and Mercurial
#mkdir tool; cd tool
#Python-2.7.2
wget http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz
tar -zxvf Python-2.7.2.tgz
cd Python-2.7.2
./configure --prefix=/home/u0592675/tool/python-2.7.2
make
make install
#apache
wget http://www.eng.lsu.edu/mirrors/apache//httpd/httpd-2.2.21.tar.gz
tar -zxvf httpd-2.2.21.tar.gz
cd httpd-2.2.21
./configure --prefix=/home/u0592675/tool/apache-2.2.21
make
make install
#mod_wsgi
wget http://modwsgi.googlecode.com/files/mod_wsgi-3.3.tar.gz
tar -zxvf mod_wsgi-3.3.tar.gz
cd mod_wsgi-3.3
./configure --with-apxs=/home/u0592675/tool/apache-2.2.21/bin/apxs --with-python=/home/u0592675/tool/python_2.7.2/bin/python
make
make install
#Trac
wget http://ftp.edgewall.com/pub/trac/Trac-0.12.2.tar.gz
wget http://peak.telecommunity.com/dist/ez_setup.py
~/tool/python-2.7.2/bin/python ez_setup.py
#
trac-admin /home/ying/mercurial_database/tomato initenv
trac-admin /home/ying/mercurial_database/tomato deploy /home/ying/web/tomato_trac
{'mod_wsgi.listener_port': '80',
'HTTP_COOKIE': 'trac_form_token=0b8daa13a5ec1c5f4819083c; trac_session=80547f6793f2d13cbfea5b1c; trac_form_token=a30953232455e34b94f82aa5; trac_session=d5bdc2b3269ef9a4001807dd',
'mod_wsgi.listener_host': '',
'SERVER_SOFTWARE': 'Apache/2.2.21 (Unix) mod_wsgi/3.3 Python/2.7.2+',
'SCRIPT_NAME': '/tomato',
'mod_wsgi.handler_script': '',
'SERVER_SIGNATURE': '',
'REQUEST_METHOD': 'GET',
'PATH_INFO': '/',
'SERVER_PROTOCOL': 'HTTP/1.1',
'QUERY_STRING': '',
'HTTP_ACCEPT_CHARSET': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.106 Safari/535.2', 'HTTP_CONNECTION': 'keep-alive', 'SERVER_NAME': '155.100.213.215', 'REMOTE_ADDR': '155.100.213.215', 'mod_wsgi.request_handler': 'wsgi-script', 'wsgi.url_scheme': 'http', 'PATH_TRANSLATED': '/home/ying/tool/apache-2.2.21/htdocs/index.html', 'SERVER_PORT': '80', 'wsgi.multiprocess': True, 'mod_wsgi.input_chunked': '0', 'SERVER_ADDR': '192.168.247.129', 'DOCUMENT_ROOT': '/home/ying/tool/apache-2.2.21/htdocs', 'mod_wsgi.process_group': '', 'SCRIPT_FILENAME': '/home/ying/trac/cgi-bin/trac.wsgi', 'SERVER_ADMIN': 'you@example.com', 'wsgi.input':, 'HTTP_HOST': '155.100.213.215', 'wsgi.multithread': False, 'mod_wsgi.callable_object': 'application', 'REQUEST_URI': '/tomato/', 'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'wsgi.version': (1, 1), 'GATEWAY_INTERFACE': 'CGI/1.1', 'wsgi.run_once': False, 'wsgi.errors': , 'REMOTE_PORT': '6064', 'HTTP_ACCEPT_LANGUAGE': 'en-US,en;q=0.8', 'mod_wsgi.version': (3, 3), 'mod_wsgi.application_group': '',
'mod_wsgi.script_reloading': '1',
'wsgi.file_wrapper':,
'HTTP_ACCEPT_ENCODING': 'gzip,deflate,sdch'}
[Tue Nov 08 11:15:24 2011] [error]
#Python-2.7.2
wget http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz
tar -zxvf Python-2.7.2.tgz
cd Python-2.7.2
./configure --prefix=/home/u0592675/tool/python-2.7.2
make
make install
#apache
wget http://www.eng.lsu.edu/mirrors/apache//httpd/httpd-2.2.21.tar.gz
tar -zxvf httpd-2.2.21.tar.gz
cd httpd-2.2.21
./configure --prefix=/home/u0592675/tool/apache-2.2.21
make
make install
#mod_wsgi
wget http://modwsgi.googlecode.com/files/mod_wsgi-3.3.tar.gz
tar -zxvf mod_wsgi-3.3.tar.gz
cd mod_wsgi-3.3
./configure --with-apxs=/home/u0592675/tool/apache-2.2.21/bin/apxs --with-python=/home/u0592675/tool/python_2.7.2/bin/python
make
make install
#Trac
wget http://ftp.edgewall.com/pub/trac/Trac-0.12.2.tar.gz
wget http://peak.telecommunity.com/dist/ez_setup.py
~/tool/python-2.7.2/bin/python ez_setup.py
#
trac-admin /home/ying/mercurial_database/tomato initenv
trac-admin /home/ying/mercurial_database/tomato deploy /home/ying/web/tomato_trac
{'mod_wsgi.listener_port': '80',
'HTTP_COOKIE': 'trac_form_token=0b8daa13a5ec1c5f4819083c; trac_session=80547f6793f2d13cbfea5b1c; trac_form_token=a30953232455e34b94f82aa5; trac_session=d5bdc2b3269ef9a4001807dd',
'mod_wsgi.listener_host': '',
'SERVER_SOFTWARE': 'Apache/2.2.21 (Unix) mod_wsgi/3.3 Python/2.7.2+',
'SCRIPT_NAME': '/tomato',
'mod_wsgi.handler_script': '',
'SERVER_SIGNATURE': '',
'REQUEST_METHOD': 'GET',
'PATH_INFO': '/',
'SERVER_PROTOCOL': 'HTTP/1.1',
'QUERY_STRING': '',
'HTTP_ACCEPT_CHARSET': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.106 Safari/535.2', 'HTTP_CONNECTION': 'keep-alive', 'SERVER_NAME': '155.100.213.215', 'REMOTE_ADDR': '155.100.213.215', 'mod_wsgi.request_handler': 'wsgi-script', 'wsgi.url_scheme': 'http', 'PATH_TRANSLATED': '/home/ying/tool/apache-2.2.21/htdocs/index.html', 'SERVER_PORT': '80', 'wsgi.multiprocess': True, 'mod_wsgi.input_chunked': '0', 'SERVER_ADDR': '192.168.247.129', 'DOCUMENT_ROOT': '/home/ying/tool/apache-2.2.21/htdocs', 'mod_wsgi.process_group': '', 'SCRIPT_FILENAME': '/home/ying/trac/cgi-bin/trac.wsgi', 'SERVER_ADMIN': 'you@example.com', 'wsgi.input':
'mod_wsgi.script_reloading': '1',
'wsgi.file_wrapper':
'HTTP_ACCEPT_ENCODING': 'gzip,deflate,sdch'}
[Tue Nov 08 11:15:24 2011] [error]
prepare GATK resource for pipeline
Three VCF Files for known SNPs from GATK resource
1000G_omni2.5.hg19.sites.vcf -> hg19.1000G_omni2.5.vcf
hapmap_3.3.hg19.sites.vcf -> hg19.hapmap_3.3.vcf
dbsnp_132.hg19.vcf -> hg19.dbsnp_132.vcf
Also copy corresponding ".idx" files to DATA_PATH
mv 1000G_omni2.5.hg19.sites.vcf hg19.1000G_omni2.5.vcf
mv hapmap_3.3.hg19.sites.vcf hg19.hapmap_3.3.vcf
mv dbsnp_132.hg19.vcf hg19.dbsnp_132.vcf
##### ERROR MESSAGE: Input files /scratch/local/4/hg19.dbsnp_132.vcf and reference have incompatible contigs: Order of contigs differences, which is unsafe.
##### ERROR /scratch/local/4/hg19.dbsnp_132.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
1. get "dbsnp_132.hg19.vcf.gz" from "ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/1.2/hg19/"
2. gunzip dbsnp_132.hg19.vcf.gz
3. awk '/^chrM/ {print $0}' dbsnp_132.hg19.vcf > m.vcf
4. awk '/^chrM/ {next} {print $0}' dbsnp_132.hg19.vcf > wom.vcf
5. cat m.vcf >> wom.vcf
6. mv wom.vcf hg19.dbsnp_132.vcf
7. vcf-validator hg19.dbsnp_132.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chr1:120995
8. vcf-validator dbsnp_132.hg19.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chrM:16189
8. /home/u0592675/vcftools_0.1.7/bin/vcf-validator hg19.dbsnp_132.vcf
10 ./home/u0592675/vcftools_0.1.7/bin/vcf-validator dbsnp_132.hg19.vcf
awk '/^chr/ {print $1}' /tomato/data/hg19.dbsnp_132.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
chrM
awk '/^chr/ {print $1}' hg19.1000G_omni2.5.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
awk '/^chr/ {print $1}' hg19.hapmap_3.3.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
1.
awk '/^chrM/ {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > m.vcf
awk '/^chrM/ {next} {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > wom.vcf
cat m.vcf >> wom.vcf
rm m.vcf
mv wom.vcf hg19.dbsnp_132.vcf
2.
vim hg19.dbsnp_132.vcf
delete all non-mapped contigs in header section ##contig
hg19.hapmap_3.3.vcf
hg push ssh://hiseq@hci-bio2.hci.utah.edu:/home/hiseq/tomato/
hg clone http://hci-bio2.hci.utah.edu:8011 tomato
1000G_omni2.5.hg19.sites.vcf -> hg19.1000G_omni2.5.vcf
hapmap_3.3.hg19.sites.vcf -> hg19.hapmap_3.3.vcf
dbsnp_132.hg19.vcf -> hg19.dbsnp_132.vcf
Also copy corresponding ".idx" files to DATA_PATH
mv 1000G_omni2.5.hg19.sites.vcf hg19.1000G_omni2.5.vcf
mv hapmap_3.3.hg19.sites.vcf hg19.hapmap_3.3.vcf
mv dbsnp_132.hg19.vcf hg19.dbsnp_132.vcf
##### ERROR MESSAGE: Input files /scratch/local/4/hg19.dbsnp_132.vcf and reference have incompatible contigs: Order of contigs differences, which is unsafe.
##### ERROR /scratch/local/4/hg19.dbsnp_132.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
1. get "dbsnp_132.hg19.vcf.gz" from "ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/1.2/hg19/"
2. gunzip dbsnp_132.hg19.vcf.gz
3. awk '/^chrM/ {print $0}' dbsnp_132.hg19.vcf > m.vcf
4. awk '/^chrM/ {next} {print $0}' dbsnp_132.hg19.vcf > wom.vcf
5. cat m.vcf >> wom.vcf
6. mv wom.vcf hg19.dbsnp_132.vcf
7. vcf-validator hg19.dbsnp_132.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chr1:120995
8. vcf-validator dbsnp_132.hg19.vcf
The header tag contig already exists, ignoring.
...same as above...
Warning: Duplicate entries, for example chrM:16189
8. /home/u0592675/vcftools_0.1.7/bin/vcf-validator hg19.dbsnp_132.vcf
10 ./home/u0592675/vcftools_0.1.7/bin/vcf-validator dbsnp_132.hg19.vcf
awk '/^chr/ {print $1}' /tomato/data/hg19.dbsnp_132.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
chrM
awk '/^chr/ {print $1}' hg19.1000G_omni2.5.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
awk '/^chr/ {print $1}' hg19.hapmap_3.3.vcf | sort |uniq
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
1.
awk '/^chrM/ {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > m.vcf
awk '/^chrM/ {next} {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > wom.vcf
cat m.vcf >> wom.vcf
rm m.vcf
mv wom.vcf hg19.dbsnp_132.vcf
2.
vim hg19.dbsnp_132.vcf
delete all non-mapped contigs in header section ##contig
hg19.hapmap_3.3.vcf
hg push ssh://hiseq@hci-bio2.hci.utah.edu:/home/hiseq/tomato/
hg clone http://hci-bio2.hci.utah.edu:8011 tomato
Thursday, November 3, 2011
Got new genome fasta file?
#1. hg19.fasta.idx
samtools faidx hg19.fasta
#2. hg19.dict
java -Xmx2g -jar ~/tool/picard/CreateSequenceDictionary.jar R=hg19.fasta O=hg19.dict
samtools faidx hg19.fasta
#2. hg19.dict
java -Xmx2g -jar ~/tool/picard/CreateSequenceDictionary.jar R=hg19.fasta O=hg19.dict
Tuesday, November 1, 2011
prepare dbsnp132
1.
awk '/^chrM/ {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > m.vcf
awk '/^chrM/ {next} {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > wom.vcf
cat m.vcf >> wom.vcf
rm m.vcf
mv wom.vcf hg19.dbsnp_132.vcf
2.
vim hg19.dbsnp_132.vcf
delete all non-mapped contigs in header section ##contig
awk '/^chrM/ {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > m.vcf
awk '/^chrM/ {next} {print $0}' ../res/hg19/dbsnp_132.hg19.vcf > wom.vcf
cat m.vcf >> wom.vcf
rm m.vcf
mv wom.vcf hg19.dbsnp_132.vcf
2.
vim hg19.dbsnp_132.vcf
delete all non-mapped contigs in header section ##contig
Monday, October 31, 2011
Thursday, October 27, 2011
Known SNPs
Three VCF Files for known SNPs from GATK resource
1000G_omni2.5.hg19.sites.vcf -> hg19.1000G_omni2.5.vcf
hapmap_3.3.hg19.sites.vcf -> hg19.hapmap_3.3.vcf
dbsnp_132.hg19.vcf -> hg19.dbsnp_132.vcf
Also copy corresponding ".idx" files to DATA_PATH
1000G_omni2.5.hg19.sites.vcf -> hg19.1000G_omni2.5.vcf
hapmap_3.3.hg19.sites.vcf -> hg19.hapmap_3.3.vcf
dbsnp_132.hg19.vcf -> hg19.dbsnp_132.vcf
Also copy corresponding ".idx" files to DATA_PATH
Friday, October 14, 2011
I am running out of disk space
again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again again
Multiple hits? All or None?
#-r All
57G Oct 14 09:44 1.sam
65G Oct 14 10:06 2.sam
65G Oct 14 10:26 3.sam
45G Oct 14 10:46 4.sam
56G Oct 14 11:09 5.sam
#-r None
56G Oct 11 02:10 1.sam
64G Oct 11 02:32 2.sam
63G Oct 11 02:53 3.sam
44G Oct 11 03:07 4.sam
54G Oct 11 03:25 5.sam
57G Oct 14 09:44 1.sam
65G Oct 14 10:06 2.sam
65G Oct 14 10:26 3.sam
45G Oct 14 10:46 4.sam
56G Oct 14 11:09 5.sam
#-r None
56G Oct 11 02:10 1.sam
64G Oct 11 02:32 2.sam
63G Oct 11 02:53 3.sam
44G Oct 11 03:07 4.sam
54G Oct 11 03:25 5.sam
Thursday, October 13, 2011
pre-processing of masked FASTA for novoalign
1. Convert lower case to upper case
sed -i 's/\(.*\)/\U\1/' hg19.fasta
> chr1
NNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNN
ggggggttttTGaaaaaaaCCC
Will be
> CHR1
NNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNN
GGGGGGGTTTTTTGAAAAACCC
2. covert '>CHR' to '>chr'
sed -i -e 's/>CHR/>chr/g' hg19.fasta
3. convert N to n
sed -i -e 's/N/n/g' hg19.masked.fasta
4. mask these UTR
novoindex -m hg19.nix hg19.fasta
sed -i 's/\(.*\)/\U\1/' hg19.fasta
> chr1
NNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNN
ggggggttttTGaaaaaaaCCC
Will be
> CHR1
NNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNN
GGGGGGGTTTTTTGAAAAACCC
2. covert '>CHR' to '>chr'
sed -i -e 's/>CHR/>chr/g' hg19.fasta
3. convert N to n
sed -i -e 's/N/n/g' hg19.masked.fasta
4. mask these UTR
novoindex -m hg19.nix hg19.fasta
Monday, September 26, 2011
Keyless ssh connection between A and B
A:
ssh-keygen -t rsa
...
...
Then scp the "id_rsa.pub" to B
B:
cat id_rsa.pub >> .ssh/authorized_keys
chmod 644 .ssh/authorized_keys
Now connection from A to B is keyless.
ssh-keygen -t rsa
...
...
Then scp the "id_rsa.pub" to B
B:
cat id_rsa.pub >> .ssh/authorized_keys
chmod 644 .ssh/authorized_keys
Now connection from A to B is keyless.
Friday, September 16, 2011
ACL Setup
On the job side, I need a root-like user who has write permission on any sub-folders under PATH_JOB while the individual user has only 744 like permissions on his/her own job folders. However, Unix/Linux does not have fine control over file/directory by default, so you can not assign permissions to different users. ACL (Access Control Lists) can be used to solve this problem.
Here is a simple test on my VM_Ubuntu_10.10:
1. List partitions, the /dev/sda1 is the root directory "/"
# blkid
/dev/sda1: UUID="xxxx" TYPE="ext4"
/dev/sda5: UUID="yyyy" TYPE="swap"
2. Modify the /etc/fstab to support ACL by inserting "acl" into options.
#vim /etc/fstab
UUID=xxxx / ext4 errors=remount-ro,acl 0 1
3. Remount the root directory
#mount / -o remount
4. Make a test
#setfacl -m u:abc:rw hello
#getfacl hello
# file: hello.txt
# owner: root
# group: root
user::rw-
user:abc:rw-
group::rw-
mask::rw-
other::---
5. Now user "abc" can read and write "hello"!
#ls
rw-rw----+ 1 root root 0 2011-09-16 11:41 hello.txt
Note the new "+" character appears the end of permission field. This shows this file/folder supports ACL
Here is a simple test on my VM_Ubuntu_10.10:
1. List partitions, the /dev/sda1 is the root directory "/"
# blkid
/dev/sda1: UUID="xxxx" TYPE="ext4"
/dev/sda5: UUID="yyyy" TYPE="swap"
2. Modify the /etc/fstab to support ACL by inserting "acl" into options.
#vim /etc/fstab
UUID=xxxx / ext4 errors=remount-ro,acl 0 1
3. Remount the root directory
#mount / -o remount
4. Make a test
#setfacl -m u:abc:rw hello
#getfacl hello
# file: hello.txt
# owner: root
# group: root
user::rw-
user:abc:rw-
group::rw-
mask::rw-
other::---
5. Now user "abc" can read and write "hello"!
#ls
rw-rw----+ 1 root root 0 2011-09-16 11:41 hello.txt
Note the new "+" character appears the end of permission field. This shows this file/folder supports ACL
Friday, August 19, 2011
Merge the SAM/BAM file
Suppose I have two alignments in SAM (1.sam and 2.sam). Since these two alignments are two lanes from the same sample, I should merge these two alignments into one.
>wc -l 1.sam
10000
>wc -l 2.sam
10000
So both the SAM files has 10000 lines (include the header)
First using picard's "MergeSamFiles.jar"
>java -jar MergeSamFiles.jar I=1.sam I=2.sam O=12.sam
>wc -l 12.sam
19973
The missing 27 lines are actually duplicated header lines and they were removed by PICARD automatically. However, there is a subtle bug here.
>less 12.sam
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621 AS:hg19.nov.illumina.nix
@SQ SN:chr2 LN:243199373 AS:hg19.nov.illumina.nix
@SQ SN:chr3 LN:198022430 AS:hg19.nov.illumina.nix
@SQ SN:chr4 LN:191154276 AS:hg19.nov.illumina.nix
@SQ SN:chr5 LN:180915260 AS:hg19.nov.illumina.nix
@SQ SN:chr6 LN:171115067 AS:hg19.nov.illumina.nix
@SQ SN:chr7 LN:159138663 AS:hg19.nov.illumina.nix
@SQ SN:chr8 LN:146364022 AS:hg19.nov.illumina.nix
@SQ SN:chr9 LN:141213431 AS:hg19.nov.illumina.nix
@SQ SN:chr10 LN:135534747 AS:hg19.nov.illumina.nix
@SQ SN:chr11 LN:135006516 AS:hg19.nov.illumina.nix
@SQ SN:chr12 LN:133851895 AS:hg19.nov.illumina.nix
@SQ SN:chr13 LN:115169878 AS:hg19.nov.illumina.nix
@SQ SN:chr14 LN:107349540 AS:hg19.nov.illumina.nix
@SQ SN:chr15 LN:102531392 AS:hg19.nov.illumina.nix
@SQ SN:chr16 LN:90354753 AS:hg19.nov.illumina.nix
@SQ SN:chr17 LN:81195210 AS:hg19.nov.illumina.nix
@SQ SN:chr18 LN:78077248 AS:hg19.nov.illumina.nix
@SQ SN:chr19 LN:59128983 AS:hg19.nov.illumina.nix
@SQ SN:chr20 LN:63025520 AS:hg19.nov.illumina.nix
@SQ SN:chr21 LN:48129895 AS:hg19.nov.illumina.nix
@SQ SN:chr22 LN:51304566 AS:hg19.nov.illumina.nix
@SQ SN:chrX LN:155270560 AS:hg19.nov.illumina.nix
@SQ SN:chrY LN:59373566 AS:hg19.nov.illumina.nix
@SQ SN:chrM LN:16571 AS:hg19.nov.illumina.nix
@RG ID:TOMATO PL:ILLUMINA LB:LIBTMP SM:SAMPLE
@PG ID:novoalign VN:V2.07.10 CL:novoalign
@PG ID:novoalign.1 VN:V2.07.10 CL:novoalign
The last two lines are both @PG which differ the ID. Before merging, they are both "novoalign", after merging, the duplicated ID was renamed as "novoalign.1" (if merging 3 files, you will have "novoalign.2" as well). When you check the individual read alignment, you will see the PG:Z field referenced the PG line. When you call variances later, the genotyper application will process merged file as two files which is not what you want.
To fix this:
1. Keep the first @PG line and delete all other @PG lines
2. Replace all "novoalign.X" to "novoalign".
Or you can use samtools.
>java -jar SortSam.jar INPUT=1.sam OUTPUT=1.bam SO=unsorted
>java -jar SortSam.jar INPUT=2.sam OUTPUT=2.bam SO=unsorted
>samtools merge 3.bam 1.bam 2.bam
>samtools view -h -o 3.sam 3.bam
>less 3.sam
Samtools only use the first SAM's header in merged SAM, so it is OK in this case.
>wc -l 1.sam
10000
>wc -l 2.sam
10000
So both the SAM files has 10000 lines (include the header)
First using picard's "MergeSamFiles.jar"
>java -jar MergeSamFiles.jar I=1.sam I=2.sam O=12.sam
>wc -l 12.sam
19973
The missing 27 lines are actually duplicated header lines and they were removed by PICARD automatically. However, there is a subtle bug here.
>less 12.sam
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621 AS:hg19.nov.illumina.nix
@SQ SN:chr2 LN:243199373 AS:hg19.nov.illumina.nix
@SQ SN:chr3 LN:198022430 AS:hg19.nov.illumina.nix
@SQ SN:chr4 LN:191154276 AS:hg19.nov.illumina.nix
@SQ SN:chr5 LN:180915260 AS:hg19.nov.illumina.nix
@SQ SN:chr6 LN:171115067 AS:hg19.nov.illumina.nix
@SQ SN:chr7 LN:159138663 AS:hg19.nov.illumina.nix
@SQ SN:chr8 LN:146364022 AS:hg19.nov.illumina.nix
@SQ SN:chr9 LN:141213431 AS:hg19.nov.illumina.nix
@SQ SN:chr10 LN:135534747 AS:hg19.nov.illumina.nix
@SQ SN:chr11 LN:135006516 AS:hg19.nov.illumina.nix
@SQ SN:chr12 LN:133851895 AS:hg19.nov.illumina.nix
@SQ SN:chr13 LN:115169878 AS:hg19.nov.illumina.nix
@SQ SN:chr14 LN:107349540 AS:hg19.nov.illumina.nix
@SQ SN:chr15 LN:102531392 AS:hg19.nov.illumina.nix
@SQ SN:chr16 LN:90354753 AS:hg19.nov.illumina.nix
@SQ SN:chr17 LN:81195210 AS:hg19.nov.illumina.nix
@SQ SN:chr18 LN:78077248 AS:hg19.nov.illumina.nix
@SQ SN:chr19 LN:59128983 AS:hg19.nov.illumina.nix
@SQ SN:chr20 LN:63025520 AS:hg19.nov.illumina.nix
@SQ SN:chr21 LN:48129895 AS:hg19.nov.illumina.nix
@SQ SN:chr22 LN:51304566 AS:hg19.nov.illumina.nix
@SQ SN:chrX LN:155270560 AS:hg19.nov.illumina.nix
@SQ SN:chrY LN:59373566 AS:hg19.nov.illumina.nix
@SQ SN:chrM LN:16571 AS:hg19.nov.illumina.nix
@RG ID:TOMATO PL:ILLUMINA LB:LIBTMP SM:SAMPLE
@PG ID:novoalign VN:V2.07.10 CL:novoalign
@PG ID:novoalign.1 VN:V2.07.10 CL:novoalign
The last two lines are both @PG which differ the ID. Before merging, they are both "novoalign", after merging, the duplicated ID was renamed as "novoalign.1" (if merging 3 files, you will have "novoalign.2" as well). When you check the individual read alignment, you will see the PG:Z field referenced the PG line. When you call variances later, the genotyper application will process merged file as two files which is not what you want.
To fix this:
1. Keep the first @PG line and delete all other @PG lines
2. Replace all "novoalign.X" to "novoalign".
Or you can use samtools.
>java -jar SortSam.jar INPUT=1.sam OUTPUT=1.bam SO=unsorted
>java -jar SortSam.jar INPUT=2.sam OUTPUT=2.bam SO=unsorted
>samtools merge 3.bam 1.bam 2.bam
>samtools view -h -o 3.sam 3.bam
>less 3.sam
Samtools only use the first SAM's header in merged SAM, so it is OK in this case.
Tuesday, July 12, 2011
mouse dbSNP VCF
OK I spend few hours on searching data and writing a script to make a dbSNP VCF for mouse. The steps are simple and dirty. Test of this VCF file has not been done yet. Put it here for reference only.
******************************************************************************
#1. download dbSNP mouse
wget --continue --no-host-directories --no-directories --preserve-permissions ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/chr_rpts/*
#2. unzip
gunzip *.gz
#3. exclude all SNPs that are not mapped to mm9 (MGSCv37) genome.
for i in `ls`; do grep 'MGSCv37' "$i" > mm9."$i" ; done
#However, you can NOT find REF/ALT field in above files. So we must get it from other files.
#4. download dbSNP mouse with genotype
wget --continue --no-host-directories --no-directories --preserve-permissions ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/genotype/*
#5. unzip
gunzip *.gz
#6. check out these files, should end with ".xml"
ls
Black6GtyFromContig gt_chr13.xml gt_chr17.xml gt_chr2.xml gt_chr6.xml gt_chrAltOnly.xml gt_chrUn.xml
gt_chr10.xml gt_chr14.xml gt_chr18.xml gt_chr3.xml gt_chr7.xml gt_chrMT.xml gt_chrX.xml
gt_chr11.xml gt_chr15.xml gt_chr19.xml gt_chr4.xml gt_chr8.xml gt_chrMulti.xml gt_chrY.xml
gt_chr12.xml gt_chr16.xml gt_chr1.xml gt_chr5.xml gt_chr9.xml gt_chrNotOn.xml README
#7. make a 3 column file: "id REF ALT"
grep ' $//g' | sed 's/rsId=//g' | sed 's/observed=//g' | awk -F '/' '{print $1, $2}' > id.allele.txt
#8. write a simple python script that mapping id from mm9.chr_N.txt to make the final VCF file - "mm9.dbsnp.vcf"
#9. make sure no duplicated IDs
awk '{print $3}' mm9.dbsnp.vcf | sort| uniq -d
13745646
13501994
******************************************************************************
#1. download dbSNP mouse
wget --continue --no-host-directories --no-directories --preserve-permissions ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/chr_rpts/*
#2. unzip
gunzip *.gz
#3. exclude all SNPs that are not mapped to mm9 (MGSCv37) genome.
for i in `ls`; do grep 'MGSCv37' "$i" > mm9."$i" ; done
#However, you can NOT find REF/ALT field in above files. So we must get it from other files.
#4. download dbSNP mouse with genotype
wget --continue --no-host-directories --no-directories --preserve-permissions ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/genotype/*
#5. unzip
gunzip *.gz
#6. check out these files, should end with ".xml"
ls
Black6GtyFromContig gt_chr13.xml gt_chr17.xml gt_chr2.xml gt_chr6.xml gt_chrAltOnly.xml gt_chrUn.xml
gt_chr10.xml gt_chr14.xml gt_chr18.xml gt_chr3.xml gt_chr7.xml gt_chrMT.xml gt_chrX.xml
gt_chr11.xml gt_chr15.xml gt_chr19.xml gt_chr4.xml gt_chr8.xml gt_chrMulti.xml gt_chrY.xml
gt_chr12.xml gt_chr16.xml gt_chr1.xml gt_chr5.xml gt_chr9.xml gt_chrNotOn.xml README
#7. make a 3 column file: "id REF ALT"
grep '
#8. write a simple python script that mapping id from mm9.chr_N.txt to make the final VCF file - "mm9.dbsnp.vcf"
def mm9():
m = {}
for line in file('id.allele.txt'):
line = line.strip()
toks=line.split(' ')
if len(toks)==3:
m[toks[0]] = (toks[1],toks[2])
#chr1 - chr19, chr X,Y and MT only.
nfs = [str(i) for i in range(1,20)]
nfs.append('X')
nfs.append('Y')
nfs.append('MT')
fo = file('mm9.dbsnp.vcf','w')
print >>fo,'##fileformat=VCFv4.0'
print >>fo,'##fileDate=20110712'
print >>fo,'##source=dbSNP'
print >>fo,'##dbSNP_BUILD_ID=132'
print >>fo,'##reference=mm9'
print >>fo,'#CHROM POS ID REF ALT QUAL FILTER INFO'
total = 0
for n in nfs:
sm = {}
fn = os.path.join('mm9.chr_%s.txt'%n)
for line in file(fn):
line = line.strip()
toks=line.split('\t')
try:
rs_id = toks[0]
mapped = toks[1]
withdraw =toks[2]
chr =toks[6]
position =toks[11]
if m.has_key(rs_id) and mapped=='2' and withdraw=='0':
ref,alt = m[rs_id]
#use an arbitratry "ms"+ID as prefix refID (instead of "rs")
sm[int(position)] = [chr,position,'ms'+rs_id,ref,alt,'.','PASS','.']
except Exception,e:
pass
sk = sorted(sm.keys())
for k in sk:
total += 1
print >>fo,'\t'.join(sm[k])
fo.close()
print total
mm9()
#9. make sure no duplicated IDs
awk '{print $3}' mm9.dbsnp.vcf | sort| uniq -d
13745646
13501994
Monday, June 20, 2011
Indel filter
java -Xmx32g -jar $HOME/tool/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R $HOME/data/hg19.fasta \
-o x1.indel.filtered.vcf \
-B:variant,VCF x1.indel.vcf \
-B:mask,VCF ../SNP/x1.snp.vcf \
--maskName SNP \
--filterExpression "QUAL < 30 || SB > -1.0 || QD < 2.0" \
--filterName "QUAL<30;SB>-1.0;QD<2"
SNP Filter
#uses SB >= 0.10 filter
#uses QD < 5.0 filter
#uses HRun >= 4 filter
#uses clustered SNPs (>= 3 SNPs within 10 bp) filter
#masks out SNPs that overlap the filtered indels
java -Xmx32g -jar $HOME/tool/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R $HOME/data/hg19.fasta \
-o x1.snp.filtered.vcf \
-B:variant,VCF x1.snp.vcf \
-B:mask,VCF ../INDEL/x1.indel.vcf \
--maskName InDel \
--clusterSize 3 \
--clusterWindowSize 10 \
--filterExpression "SB >= 0.10 || QD < 5.0 || HRun >= 4" \
--filterName "SB>=0.1;QD<5.0;HRun>=4;Cluster_3in10;MaskIndel"
#copy header and "PASS" items to "x1.snp.passed.vcf"
awk '/^#/ { print $0;next} $7 == "PASS" {print $0} ' x1.snp.filtered.vcf > x1.snp.passed.vcf
Friday, June 17, 2011
Build Genome Index
#novoalign
#normal
novoindex -k 14 -s 1 hg19.nov.normal.nix hg19.fasta
#solid color space
novoindex -k 14 -s 1 -c hg19.nov.color.nix hg19.fasta
#bisulphite
novoindex -k 14 -s 1 -b hg19.nov.bis.nix hg19.fasta
#normal
bwa index -a bwtsw -p hg19.bwa.normal hg19.fasta
#color
bwa index -a bwtsw -c -p hg19.bwa.color hg19.fasta
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/chromosomes/chr*.fa.gz
novoindex -k 14 -s 1 mm9.nov.normal.nix genomes/mm9/fasta/mm9.fasta
novoindex -k 14 -s 1 -c mm9.nov.color.nix genomes/mm9/fasta/mm9.fasta
novoindex -k 14 -s 1 -b mm9.nov.bis.nix genomes/mm9/fasta/mm9.fasta
#normal
./bwa index -a bwtsw -p mm9.bwa.normal genomes/mm9/fasta/mm9.fasta
./bwa index -a bwtsw -c -p mm9.bwa.color genomes/mm9/fasta/mm9.fasta
Call SNP & Indel
java -Xmx32g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R hg19.fasta -B:dbsnp,VCF dbsnp132.human.vcf -glm BOTH -o X1.vcf -I X1.bam -l INFO -stand_call_conf 20 -stand_emit_conf 10
VAT
VAT
http://archive.gersteinlab.org/proj/VAT/
$pwd
$HOME/VAT
#download&install GNU Scientific Library
$wget http://mirror.clarkson.edu/gnu/gsl/gsl-1.15.tar.gz
$sudo tar -zxvf gsl-1.15.tar.gz;cd gsl-1.15; ./configure; make; make install
#download&install gettext
$wget http://ftp.gnu.org/pub/gnu/gettext/gettext-0.18.1.1.tar.gz
$sudo tar -zxvf gettext-0.18.1.1.tar.gz;cd gettext-0.18.1.1;./configure; make; make install
#download GD Library
http://google-desktop-for-linux-mirror.googlecode.com/files/gd-2.0.35.tar.gz
sudo tar -zxvf gd-2.0.35.tar.gz;cd gd/2.0.35/;./configure; make; make install
#libbios
$wget http://homes.gersteinlab.org/people/lh372/VAT/libbios-1.0.0.tar.gz
$sudo tar -zxvf libbios-1.0.0.tar.gz; cd libbios-1.0.0
$sudo export CPPFLAGS="-I$HOME/VAT/libbios-1.0.0/include"
$sudo export LDFLAGS="-L$HOME/VAT/libbios-1.0.0/lib"
$sudo ./configure;make;make install
#twoBitToFa is required for "interval2sequences"
$wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
$chmod +x twoBitToFa
$cp twoBitToFa $HOME/tool/
#download&install VAT package
$wget http://homes.gersteinlab.org/people/lh372/VAT/vat-1.0.0.tar.gz
$tar -zxvf vat-1.0.0.tar.gz; cd vat-1.0.0
$mkdir tmp
$./configure --prefix=$HOME/VAT/vat-1.0.0/tmp;make;make install
$mv tmp/* $HOME/VAT/tool
wget ftp://ftp.sanger.ac.uk/pub/gencode/release_3c/gencode.v3c.annotation.GRCh37.gtf.gz
gunzip gencode.v3c.annotation.GRCh37.gtf.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
export LD_LIBRARY_PATH=$HOME/VAT/lib:$LD_LIBRARY_PATH
cat x2.snp.vcf | snpMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > x2.snp.annotated.vcf
cat x2.indel.vcf | indelMapper gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.interval gencode.v3c.annotation.GRCh37.cds.gtpc.ttpc.fa > x2.indel.annotated.vcf
Thursday, June 16, 2011
An python script to seperate SNP and Indel calls in one VCF
#!/usr/bin/python
def split_snp_indel(fvcf):
basename = fvcf.rstrip('.vcf')
fsnp = file(basename+'.snp.vcf','w')
findel = file(basename+'.indel.vcf','w')
for line in file(fvcf):
line = line.strip()
if line.startswith('#'):
print >>fsnp, line
print >> findel, line
else:
toks = line.split('\t')
ref = toks[3]
alt = toks[4]
if len(ref)> 1 or len(alt)> 1:
print >> findel, line
else:
print >>fsnp, line
if __name__=='__main__':
import sys
split_snp_indel(sys.argv[1])
Wednesday, June 15, 2011
Monday, June 13, 2011
update GATK and PICARD
#!/bin/bash
rm -fr app/gatk/*
svn co https://svn.broadinstitute.org/Sting/trunk GATK
cd GATK; ant; cd ..; cp GATK/dist/* app/gatk/; rm -fr GATK
rm -fr app/picard/*
svn co https://picard.svn.sourceforge.net/svnroot/picard/trunk PICARD
cd PICARD; ant; cd ..; cp PICRAD/dist/* app/picard/; rm -fr PICARD
Monday, June 6, 2011
Wednesday, June 1, 2011
Ambiguous Code in SOLiD Color Space
An Example from a SOLiD Color Space aligned SAM file:
CS:Z:T3.3300220113.2323222031301323303202033330300301210
Here "CS:Z" field gives the color read sequence "T3.33". The first character after the "CS:Z:" is always "T", after that, all valid code should be range from 0 to 3.
0 -this base is same as previous base
1 - this base is the transversion of previous base
2 - this base is the transition of previous base
3 - this base is the complement of previous base
However sometimes we can observe ambiguous code like "." or "4". You may notice the 3rd base is not in 0~3, it is a "." - an ambiguous code indicates an ambiguous base, could be any base of A, C, G and T.
This ambiguous code will fail the GATK's "CountCovariates" tool. It will complain
Error Caused by: org.broadinstitute.sting.utils.exceptions.UserException$MalformedBAM: SAM/BAM file SAMFileReader{/home/u0592675/A232/441_10xF3.mate.dup.sort.realign.bam}
is malformed: Unrecognized color space in SOLID read, color = . Unfortunately this bam file can not be recalibrated without full color space information because of potential reference bias.
In a simple test, around 1% (58297 out of 5275136) alignments has at least one ambiguous code. So here we have two options to eliminate this problem:
1. make a script to change all ambiguous codes into random valid codes. It looks like:
m = re.compile("\.")
m.sub(random.choice(('0','1','2','3')),CS_Z_STRING)
where CS_Z_STRING is like above example.
However, this may bring a serious problem because all bases execpt for the 1st base are depending on previous base. If we choose a wrong code which probability is 3/4, then all the following bases will be wrong.
2. delete the affected alignments
Considering the small percentage (1%) of ambiguous code affected alignments, it may be wise to delete those troublesome lines.
m1 = re.compile("CS:Z:T[0-3]*?[^0-3\s]")
def delete_ambiguous_line(fin,fout):
fo = file(fout,'w')
i = j = 0
for line in file(fin):
line = line.strip()
if line.startswith('@'):
nline = line
else:
if m1.search(line):
j += 1
else:
i += 1
nline = line
print >>fo,nline
print "Fix ambiguous code - Total: %d, Deleted: %d, Percent of Deleted: %d%%" % (i+j,j,j*100/(i+j))
fo.close()
Friday, April 8, 2011
merge sam files by "cat"
tail -n +29 7550X2_1.sam > 1.sam
tail -n +29 7550X2_2.sam > 2.sam
tail -n +29 7550X2_3.sam > 3.sam
tail -n +29 7550X2_4.sam > 4.sam
tail -n +29 7550X2_5.sam > 5.sam
head -n 28 1.sam > header.txt
cat header.txt 1.sam 2.sam 3.sam 4.sam 5.sam > hello.sam
picard-tools-1.42/SortSam.jar INPUT=hello.sam OUTPUT=hello.sorted.sam CREATE_INDEX=false SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.
picard-tools-1.42/SortSam.jar INPUT=hello.sorted.sam OUTPUT=hello.sorted.bam CREATE_INDEX=true SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.
Wednesday, April 6, 2011
novoalign without "-k"(left) and with "-k" (right)
# Hash length: 13 # Hash length: 13
# Step size: 1 # Step size: 1
# Paired Reads: 96409627 # Paired Reads: 96409627
# Pairs Aligned: 62936012 # Pairs Aligned: 62971552
# Read Sequences: 192819254 # Read Sequences: 192819254
# Aligned: 128634931 # Aligned: 128720259
# Unique Alignment: 126260501 # Unique Alignment: 126343058
# Gapped Alignment: 2033737 # Gapped Alignment: 2039288
# Quality Filter: 1530071 # Quality Filter: 1530624
# Homopolymer Filter: 48808 # Homopolymer Filter: 48964
# Elapsed Time: 9737.411 (sec.) # Elapsed Time: 10373.065 (sec.)
# CPU Time: 3654.4 (min.) # CPU Time: 3922.4 (min.)
# Fragment Length Distribution # Fragment Length Distribution
Tuesday, April 5, 2011
A ZS field in Sam File from novoalign
ZS:Z: Novoalign alignment status. Not present for unique alignments.
ZS:Z:R Multiple alignments with similar score were found.
ZS:Z:QC The read was not aligned as it bases qualities were too low or it was a homopolymer read.
ZS:Z:NM No alignment was found.
ZS:Z:QL An alignment was found but it was below the quality threshold.
ZN:i: Number of alignments to read. Only present if there was more than one alignment.
ZO:Z: Indictaes long or short insert fragment for mate pair alignments when short insert has been enabled.
'+' Indicates pair was aligned as a short insert fragment.
'+' Pair was aligned as a long insert fragment.
This tag is only present for Illumina mate pairs when a short fragment length size has been specified with the i option and reads are aligned as a proper pair .
ZH:i: Hairpin score for miRNA a
ZL:i: In miRNA mode this is the alignment location for adjacent opposite strand alignment.
Examples:
ZS:Z:R ZN:i:2 Indicates a read aligned to two locations.
ZS:Z:NM No alignment was found.
ZS:Z:QC The read failed quality checks and was not aligned.
Friday, April 1, 2011
merge
sed -i 's/YING\../YING/g' 7550x1.merged.sam
sed -i 's/novoalign\../novoalign/g' 7550x1.merged.sam
time java -jar -Xmx30g ~/tool/picard-tools-1.41/SortSam.jar INPUT=7550x1.merged.sam OUTPUT=7550X1.merged.bam CREATE_INDEX=true SO=coordinate COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=7500000 TMP_DIR=.
time java -jar -Xmx30g ~/tool/gatk/dist/GenomeAnalysisTK.jar -T UnifiedGenotyper -R chr20.fa -B:dbsnp,VCF dbsnp132.human.vcf -o 7550X1.snps.raw.vcf -I 7550X1.merged.bam -stand_call_conf 50.0 -stand_emit_conf 20.0 -dcov 300
time java -jar -Xmx32g ~/tool/gatk/dist/GenomeAnalysisTK.jar -T VariantFiltration -R chr20.fa -B:variant,VCF 7550X1.snps.raw.moab.vcf -o 7550X1.snps.raw.moab.homosnp.vcf --maskName InDel --clusterWindowSize 10 --filterExpression "AF<1.0 || DP<10 || DP>100" --filterName "AF<1.0 or DP<10"
grep -c "PASS" 7550X1.snps.raw.moab.homosnp.vcf
netstat
#see all unique IP addresses connected to the server
netstat -nat | awk '{ print $5}' | cut -d: -f1 | sed -e '/^$/d' | uniq
#if your server is under a DOS attack
netstat -anp |grep 'tcp\|udp' | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n
Thursday, March 31, 2011
phased genotype
A phased genotype indicates that for heterozygous loci it is known which chromosome each variant
sequence belongs to. That is, if one variant has a Variant_seq=A,C attribute and another variant on the same landmark
feature (chromosome) has a Variant_seq=T,G attribute then the A and T occur together on one copy of the chromosome and the C and G
occur together on the other copy of that chromosome.
Thursday, March 24, 2011
VCF manipulation (to be edited)
#how many raw snps?
$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | wc -l
3492007
#filter dbSNP member and retain PASSed SNPs.
$awk '!/#/ {print $0}' 7550X1.snps.raw.vcf | awk '$3 !~ "rs" {print $0}' | awk '$7 ~ "PASS" {print $0}' > 7550x1.nodbsnp.pass.vcf
#how many novel SNPs?
$wc -l 7550x1.nodbsnp.pass.vcf
312776
$vcf-validator 7550x1.nodbsnp.pass.vcf
#error , no header
$head -n 22 7550X1.snps.raw.vcf > head.txt
$cat head.txt 7550x1.nodbsnp.pass.vcf > 7550x1.vcf
#have to build index!
$bgzip 7550x1.vcf
$tabix -p vcf 7550x.vcf.gz
#the commom snps between x1 and x4 (affected)
$vcf-isec -n =2 7550x1.vcf.gz 7550x4.vcf.gz > x1x4.vcf
$wc -l x1x4.vcf
130480 x1x4.vcf
Tuesday, March 22, 2011
Append .jar to USeq applications
FILES=USeq_7.6.2/Apps/*
for f in $FILES
do
echo "Processing $f file..."
# take action on each file. $f store current file name
#cat $f
mv $f $f'.jar'
done
Monday, March 14, 2011
GNUMAP and novoalign benchmark
Data:
7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz, pair-end, 25000 reads, 101bp per read.
chr20.fa, human genome chromosome 20, build hg19.
Command:
$gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz
Output:
This is GNUMAP, Version 2.2.0, for public and private use.
Command Line Arguments: gnumap-plain -g chr20.fa -o example.output -a .9 -v 1 --illumina 7550X1_1_1.txt,7550X1_1_2.txt
Parameters:
Verbose: 1
Genome file(s): chr20.fa
Output file: example.output
Sequence file(s): 7550X1_1_1.txt,7550X1_1_2.txt
Align score: 0.9
Number of threads: 1
Mer size: 10
Using jump size of 5
Using Default Alignment Scores
Gap score: -4
Maximum Gaps: 3
Hashing the genome.
[0/1] gen_size=0, my_start=0, my_end=0
chr20.fa
Hashing Genome...
Reading: chr20
..............................................
Size of genome: 63025520
[0/-] Converting to Vector...
[0/-] Trying to create hash of size 1048576
[0/-] Finished create hash.
[0/-] Stats: Total hashes is 59499396, Longest hash is 69516, shortest is 1, and average is 56.743046
[0/-] Trying to create a new genome with a size of 63025520...Success!
[0/-] Trying to malloc 7878190 elements for positions array...Success!
[0/-] Finished Vector Conversion
Time to hash: 12 seconds
Matching 2 file(s):
[-/0] Matching 25000 sequences of: 7550X1_1_1.txt
Reads per processor: 128
[0/0] 33% reads complete
[0/0] 66% reads complete
[0/0] 98% reads complete
[-/0] Matching 25000 sequences of: 7550X1_1_2.txt
[0/0] 0% reads complete
[0/0] 33% reads complete
[0/0] 66% reads complete
[0/0] 98% reads complete
[0/0] 98% reads complete
[0/-] Time since start: 4890.28
[0/-] Printing output.
Finished printing to example.output.sgr
#Finished.
# Total Time: 4890.38 seconds.
# Found 49152 sequences.
# Sequences matched: 13921
# Sequences not matched: 35231
# Output written to example.output
Total wait time is 0.000000
The same data using novoalign(for fairness, remove license file to avoid threading and using unzipped fastq files):
$novoalign -o SAM $'@RG\tID:YING\tPL:ILLUMINA\tLB:LB_TEST\tSM:7851\tCN:HCI' -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt >7550x1.sam
# novoalign (2.07.05 - Nov 29 2010 @ 13:34:51) - A short read aligner with qualities.
# (C) 2008 NovoCraft
# Licensed for evaluation, educational, and not-for-profit use only.
# novoalign -o SAM @RG ID:YING PL:ILLUMINA LB:LB_TEST SM:7851 CN:HCI -F ILMFQ -d chr20.nix -f 7550X1_1_1.txt 7550X1_1_2.txt
# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.
# Index Build Version: 2.7
# Hash length: 14
# Step size: 1
# Paired Reads: 25000
# Pairs Aligned: 368
# Read Sequences: 50000
# Aligned: 767
# Unique Alignment: 749
# Gapped Alignment: 18
# Quality Filter: 490
# Homopolymer Filter: 32
# Elapsed Time: 59.363 (sec.)
# CPU Time: 0.9 (min.)
Monday, March 7, 2011
Loss of function annotation
From 1000 genome project:
Functional annotation of SNPs, short indels and large structural variants was determined
with reference to the GENCODE v3b annotation release (Harrow, Denoeud et al. 2006).
Coding SNPs were mapped on to transcripts annotated as “protein_coding” and
containing an annotated START codon, and classified as synonymous, non-synonymous,
nonsense (stop codon-introducing), stop codon-disrupting or splice site-disrupting
(canonical splice sites). Transcripts labeled as NMD (predicted to be subject to nonsensemediated
decay) were not used. Small deletions predicted to cause a frame-shift and
large deletions predicted to disrupt gene function were also analysed.
Nonsense and splice-disrupting SNPs were flagged as likely representing reference error
or annotation artefacts if the inferred loss-of-function (LOF) allele was also the ancestral
state, or if the reference (non-LOF) allele was not observed in any individual in that
population, and were excluded from the per-individual counts in Table 2. Splice-disrupting
SNPs in non-canonical splice sites were also discarded. We did not consider the frameshift
status of splice-disrupting SNPs due to the challenges of inferring the effects of
removal of splice-donor and acceptor sites on exon structure, but rather treated all such
SNPs as likely to affect gene function.
We classified large deletions as gene-disrupting if they fulfilled the following criteria:
1. Removal of >50% of the coding sequence; or
2. Removal of the gene’s transcriptional start site or start codon; or
3. Removal of an odd number of internal splice sites; or
4. Removal of one or more internal coding exons that would be predicted to generate
a frameshift.
For large deletions with imprecise breakpoints, we conservatively required that the
deletions defined by both the inner and outer confidence intervals would have the same
predicted effect on gene function. For cases with microhomology at the break-point we
treated the breakpoint as falling to the right-hand side of the region of microhomology.
We did not perform functional annotation for large duplications due to the challenges of
inferring functional consequences. The numbers stated in the text should thus be
regarded as a lower bound for the number of observed loss-of-function variants per
individual genome.
However, it should also be noted that the proportion of false positive calls in the LOF class
due to sequencing and annotation errors is expected to be substantially higher than the
genome-wide average. This effect is expected as LOF sites show a low level of true
polymorphism due to selective constraint, meaning that a uniform error rate across the
genome will result in a higher proportion of false positive calls compared to other (more
variable) sites.
Enrichment of false positive calls in LOF variants is most evident in the CHB+JPT
samples, which showed a higher per-individual number of LOF SNPs than other
populations despite a comparable number of synonymous variants (Supplementary Table
11), as well as an unusual peak in the derived allele frequency spectrum (Supplementary
Figure 13). This is likely due to a mild elevation in genome-wide false positive rates for
SNPs in this population compared to other samples, which is then highly enriched at
functionally constrained sites.
To lower the number of false positive indel calls we applied more stringent filters to the
subset of indels that were called in the genome-wide set and were predicted to fall into the
LOF class. The stringent filter requires that the range of positions where an indel would
yield the same alternative haplotype sequence as the original called indel (for instance, in
a repeat, the deletion of any repeat unit would give the same alternative haplotype), plus 4
bases of reference sequence on both sides of this region, was covered by at least one
read on the forward strand, and at least one read on the reverse strand, with at most one
mismatch between the read and the alternative haplotype sequence resulting from the
indel (regardless of base-qualities). This filter removed an excess of 1 bp frameshift
insertions seen in CHB+JPT with respect to CEU in the less stringently filtered genomewide
indel call set, although it is expected to remove a significant number of true positive
calls as well. The indels that pass these stringent filters have been annotated in the
project’s VCF files.
Experimental validation and manual reannotation of identified LOF variants is currently
ongoing (manuscript in preparation).
For extrapolating the functional variants identified per individual in the exon project to the
whole genome (Table 2) we used the ratio of the total coding sequence and splice sites in
the exon-capture target regions (1,423,559 bp and 7,513, respectively) to the
corresponding numbers for the GENCODE v3b annotation set as a whole (35,676,620 bp
and 384,439, respectively).
The coordinates and predicted functional consequences of all of the LOF variants
identified in the project are available on the 1000 Genomes FTP site.
Wednesday, February 23, 2011
Tomato tutorial
Suppose you have a folder named "test" which looks like:
$pwd
/home/u12345/test
$ll
-rw-r--r-- 1 hiseq hiseq 2145917 Feb 23 11:05 7550X1_1_1.txt.gz
-rw-r--r-- 1 hiseq hiseq 2212841 Feb 23 11:05 7550X1_1_2.txt.gz
-rw-r--r-- 1 hiseq hiseq 2114459 Feb 23 11:05 7550X1_2_1.txt.gz
-rw-r--r-- 1 hiseq hiseq 2135003 Feb 23 11:05 7550X1_2_2.txt.gz
-rw-r--r-- 1 hiseq hiseq 2129933 Feb 23 11:05 7550X1_3_1.txt.gz
-rw-r--r-- 1 hiseq hiseq 2168390 Feb 23 11:05 7550X1_3_2.txt.gz
-rw-r--r-- 1 hiseq hiseq 2147386 Feb 23 11:08 7550X2_5_1.txt.gz
Now you want to do a alignment against "humen genome build 19" using these fastq files.
Following two steps to do this:
1. Create a "cmd.txt" file in this folder with two lines:
$cat > cmd.txt
#e your.email@gmail.com
@hg19.nix,bam
echo 'hello,world'
date
The first line starts with '#e' gives your emails address if you want to be notified once the job is done. It is optional.
The second line starts with '@' gives a keywords based commands. It will be translated into:
"I want to do a alignment against hg19.nix(a novoalign indexed human genome) using all fastq files in this folder, the output should be in .bam format."
If you want the output in .sam format, use "@hg19.nix,sam".
All the fastq files are automatically paired. For above example, if will be:
7550X1_1_1.txt.gz and 7550X1_1_2.txt.gz
7550X1_2_1.txt.gz and 7550X1_2_2.txt.gz
7550X1_3_1.txt.gz and 7550X1_3_2.txt.gz
7550X2_5_1.txt.gz
so the pseudo commands to be executed on the cluster machines will be like:
novoalign -o SAM -d hg19.nix -f 7550X1_1_1.txt.gz 7550X1_1_2.txt.gz > 7550X1_1.sam 2>7550X1_1.err
novoalign -o SAM -d hg19.nix -f 7550X1_2_1.txt.gz 7550X1_2_2.txt.gz > 7550X1_2.sam 2>7550X1_2.err
novoalign -o SAM -d hg19.nix -f 7550X1_3_1.txt.gz 7550X1_3_2.txt.gz > 7550X1_3.sam 2>7550X1_3.err
novoalign -o SAM -d hg19.nix -f 7550X2_5_1.txt.gz > 7550X2_5.sam 2>7550X2_5.err
echo "hello,world"
date
*All lines that do not start with '#' or '@' are just normal Linux commands (in this case, echo 'hello,world' and date ) you want to execute.
2. Create an empty "b" file to tell the tomato this job is ready to "begin".
>touch b
'''Further Reading'''
The "keyword based commands" was designed to provide an easy way to do simple jobs which share standard procedures.
At now, the keywords only recognize "bam","sam" and any words with ".nix". Next, "snp" and "indel" are under development and test.
For example, a command like:
'''@hg19.nix, snp'''
will be translated into a standard procedure which comprise of many steps (commands) to call snp variants:
'''
align...
sort...
remove duplicates...
re-align around indels...
calibrate quality score...
merge lane samples....
sort...
call snp...
filter snp calls....
'''
However, you can always use your own commands in "cmd.txt" if you want (and you know how):
'''
#e your.email@gmail.com
#i 7550X1_1_1.txt.gz , 7550X1_1_1.txt.gz
novoalign -o SAM $'@RG\tID:USER\tPL:ILLUMINA\tLB:LIBRARY\tSM:7550X1\tCN:HCI' -d hg19.nix -f 7550X1_1_1.txt.gz 7550X1_1_1.txt.gz > 7550X1_1.sam 2>7550X1_1.err
SortSam.jar INPUT=7550X1_1.sam OUTPUT=7550X1_1.bam CREATE_INDEX=true SO=coordinate
GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fasta -I 7550X1_1.bam -B:indels,VCF dbsnp132.human.vcf -o 7550X1_1.intervals
...
'''
*The second line starts with '#i' gives the input files(seperated by comma ",") to be used in your commands. In the case, the tomato will download these specified files only.
If '#i' was not given, then all files under this folder will be downloaded into cluster machine as possible input files.
[to be continued]
Monday, February 7, 2011
Craete GATK compatible dbSNP file
$ lftp ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-All.vcf.gz
$ gunzip 00-All.vcf.gz
$ grep -c '#' 00-All.vcf
57
$ tail -n+58 00-All.vcf | awk '$0 !~ "_random" {print}' | awk '$0 !~ "_hap" {print}'| awk '$0 !~ "PAR" {print}' | awk '{sub(/MT/, "M", $0); print}' | awk '{print "chr" $0}'> dbsnp132.human.body.vcf
$ awk '{print $1}' dbsnp132.human.body.vcf | sort -u
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY
$ head -n 57 00-All.vcf > dbsnp132.human.head.vcf
$ cat dbsnp132.human.head.vcf dbsnp132.human.body.vcf > dbsnp132.human.vcf
$ rm dbsnp132.human.body.vcf dbsnp132.human.head.vcf
Friday, February 4, 2011
Tuesday, February 1, 2011
7550x.tar
gunzip *.gz
cat `ls *.txt | sort -g` > seq.txt
gzip -c seq.txt > 7550X1_B20AAHABXX_1_1.txt.gz
Tuesday, January 11, 2011
Humen Genome hg19 size
3095693983 bp
If we need a coverage of 30x on a 100-bp reads, then we need
3095693983*30/100 = 928708195 reads
Tuesday, January 4, 2011
Subscribe to:
Posts (Atom)