Friday, August 13, 2010

Pipeline for Genomic Variants Identification - Part 1

Build a pipeline which:

1. Use novoalign to align some data to HG19.

2. Parse(transform) the outputs of novoalign for genomic variants identification:
2.1 SNP call (use MAQ or SOAP)
2.2 Indel
2.3 inversion, duplication etc.

3. Investigate the affects of genomic variance on gene. Synoymous/ Non-Synoymous mutation? using USeq, etc.

4. Visualization of these effects using IGB, UCSC, etc.

***************************************************************************
#download human genome build, build index for novoalign
>pwd #the path for genome build
/home/galaxy/tools/novoalign-2.06.09/genome/HG19

>wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
>tar -zxvf chromFa.tar.gz

#now we have a list of .fa files for chromosomes.
>rm chromFa.tar.gz

#use novoindex to build a index
>novoindex hg19.novoindex *.fa

#size is ~6.5G
>ll hg19.novoindex
-rwxr--r-- 1 root root 6505422914 2010-08-13 10:50 hg19.novoindex*

#make a test using 25 sequences
>time novoalign -d genome/HG19/hg19.novoindex -f read/test.txt >hello.txt
real 0m47.846s
user 0m0.180s
sys 0m5.220s

#check out the result
>less hello.txt
# novoalign (2.06.09MT - Jun 16 2010 @ 12:36:05) - A short read aligner with qualities.
# (C) 2008 NovoCraft
# Licensed for evaluation, educational, and not-for-profit use only.
# novoalign -d genome/HG19/hg19.novoindex -f read/test.txt
# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.
# Index Build Version: 2.6
# Hash length: 14
# Step size: 3
....
# Read Sequences: 25
# Aligned: 22
# Unique Alignment: 22
# Gapped Alignment: 1
# Quality Filter: 1
# Homopolymer Filter: 0
# Elapsed Time: 0,035s
# Done.

***************************************************************************
#Install MAQ

>pwd
/home/galaxy/tmp

>wget http://sourceforge.net/projects/maq/files/maq/0.7.1/maq-0.7.1.tar.bz2/download

>tar -jxvf maq-0.7.1.tar.bz2

>cd maq-0.7.1

>./configure --prefix=/home/galaxy/tools/maq-0.7.1 && make && make install



***************************************************************************
#Install SOAPsnp

#boost(http://sourceforge.net/projects/boost/) is required to build SOAPsnp

>wget http://sourceforge.net/projects/boost/files/boost/1.43.0/boost_1_43_0.tar.gz/download
>tar -zxvf boost_1_43_0.tar.gz
>cd boost_1_43_0

#move the boost directory to the system path so the gcc can find it.
>mv boost /usr/local/include/


#download SOAPsnp

>wget http://soap.genomics.org.cn/down/SOAPsnp-v1.03.tar.gz

>cd SOAPsnpZ

>make all #this will make a single executable file, soapsnp

>ll soapsnp
-rwxr-xr-x 1 root root 123425 2010-08-13 12:04 soapsnp*

#move it to $PATH
>mv soapsnp /home/galaxy/tools

>soapsnp
SoapSNP
Compulsory Parameters:
-i Input SORTED Soap Result
-d Reference Sequence in fasta format
-o Output consensus file
......

***************************************************************************
#download the sequencing data

>lftp -u USERNAME,PASSWORD ftp://cdts.genomics.org.cn

#do not know which data should be downloaded. So go to GenomEx.


No comments:

Post a Comment