Build a pipeline which:
1. Use novoalign to align some data to HG19.
2. Parse(transform) the outputs of novoalign for genomic variants identification:
2.1 SNP call (use MAQ or SOAP)
2.2 Indel
2.3 inversion, duplication etc.
3. Investigate the affects of genomic variance on gene. Synoymous/ Non-Synoymous mutation? using USeq, etc.
4. Visualization of these effects using IGB, UCSC, etc.
***************************************************************************
#download human genome build, build index for novoalign
>pwd #the path for genome build
/home/galaxy/tools/novoalign-2.06.09/genome/HG19
>wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
>tar -zxvf chromFa.tar.gz
#now we have a list of .fa files for chromosomes.
>rm chromFa.tar.gz
#use novoindex to build a index
>novoindex hg19.novoindex *.fa
#size is ~6.5G
>ll hg19.novoindex
-rwxr--r-- 1 root root 6505422914 2010-08-13 10:50 hg19.novoindex*
#make a test using 25 sequences
>time novoalign -d genome/HG19/hg19.novoindex -f read/test.txt >hello.txt
real 0m47.846s
user 0m0.180s
sys 0m5.220s
#check out the result
>less hello.txt
# novoalign (2.06.09MT - Jun 16 2010 @ 12:36:05) - A short read aligner with qualities.
# (C) 2008 NovoCraft
# Licensed for evaluation, educational, and not-for-profit use only.
# novoalign -d genome/HG19/hg19.novoindex -f read/test.txt
# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.
# Index Build Version: 2.6
# Hash length: 14
# Step size: 3
....
# Read Sequences: 25
# Aligned: 22
# Unique Alignment: 22
# Gapped Alignment: 1
# Quality Filter: 1
# Homopolymer Filter: 0
# Elapsed Time: 0,035s
# Done.
***************************************************************************
#Install MAQ
>pwd
/home/galaxy/tmp
>wget http://sourceforge.net/projects/maq/files/maq/0.7.1/maq-0.7.1.tar.bz2/download
>tar -jxvf maq-0.7.1.tar.bz2
>cd maq-0.7.1
>./configure --prefix=/home/galaxy/tools/maq-0.7.1 && make && make install
***************************************************************************
#Install SOAPsnp
#boost(http://sourceforge.net/projects/boost/) is required to build SOAPsnp
>wget http://sourceforge.net/projects/boost/files/boost/1.43.0/boost_1_43_0.tar.gz/download
>tar -zxvf boost_1_43_0.tar.gz
>cd boost_1_43_0
#move the boost directory to the system path so the gcc can find it.
>mv boost /usr/local/include/
#download SOAPsnp
>wget http://soap.genomics.org.cn/down/SOAPsnp-v1.03.tar.gz
>cd SOAPsnpZ
>make all #this will make a single executable file, soapsnp
>ll soapsnp
-rwxr-xr-x 1 root root 123425 2010-08-13 12:04 soapsnp*
#move it to $PATH
>mv soapsnp /home/galaxy/tools
>soapsnp
SoapSNP
Compulsory Parameters:
-i Input SORTED Soap Result
-d Reference Sequence in fasta format
-o Output consensus file
......
***************************************************************************
#download the sequencing data
>lftp -u USERNAME,PASSWORD ftp://cdts.genomics.org.cn
#do not know which data should be downloaded. So go to GenomEx.
No comments:
Post a Comment