Next Generation Sequencing and Data Analysis: Pipeline for Genomic Variants Identification

Build a pipeline which:

1. Use novoalign to align some data to HG19.

2. Parse(transform) the outputs of novoalign for genomic variants identification:

2.1 SNP call (use MAQ or SOAP)

2.2 Indel

2.3 inversion, duplication etc.

3. Investigate the affects of genomic variance on gene. Synoymous/ Non-Synoymous mutation? using USeq, etc.

4. Visualization of these effects using IGB, UCSC, etc.

***************************************************************************

#download human genome build, build index for novoalign

>pwd #the path for genome build

/home/galaxy/tools/novoalign-2.06.09/genome/HG19

>wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

>tar -zxvf chromFa.tar.gz

#now we have a list of .fa files for chromosomes.

>rm chromFa.tar.gz

#use novoindex to build a index

>novoindex hg19.novoindex *.fa

#size is ~6.5G

>ll hg19.novoindex

-rwxr--r-- 1 root root 6505422914 2010-08-13 10:50 hg19.novoindex*

#make a test using 25 sequences

>time novoalign -d genome/HG19/hg19.novoindex -f read/test.txt >hello.txt

real 0m47.846s

user 0m0.180s

sys 0m5.220s

#check out the result

>less hello.txt

# novoalign (2.06.09MT - Jun 16 2010 @ 12:36:05) - A short read aligner with qualities.

# Licensed for evaluation, educational, and not-for-profit use only.

# novoalign -d genome/HG19/hg19.novoindex -f read/test.txt

# Interpreting input files as Illumina FASTQ, Cassava Pipeline 1.3.

# Index Build Version: 2.6

# Hash length: 14

# Step size: 3

....

# Read Sequences: 25

# Aligned: 22

# Unique Alignment: 22

# Gapped Alignment: 1

# Quality Filter: 1

# Homopolymer Filter: 0

# Elapsed Time: 0,035s

# Done.

***************************************************************************

#Install MAQ

>pwd

/home/galaxy/tmp

>wget http://sourceforge.net/projects/maq/files/maq/0.7.1/maq-0.7.1.tar.bz2/download

>tar -jxvf maq-0.7.1.tar.bz2

>cd maq-0.7.1

>./configure --prefix=/home/galaxy/tools/maq-0.7.1 && make && make install

***************************************************************************

#Install SOAPsnp

#boost(http://sourceforge.net/projects/boost/) is required to build SOAPsnp

>wget http://sourceforge.net/projects/boost/files/boost/1.43.0/boost_1_43_0.tar.gz/download

>tar -zxvf boost_1_43_0.tar.gz

>cd boost_1_43_0

#move the boost directory to the system path so the gcc can find it.

>mv boost /usr/local/include/

#download SOAPsnp

>wget http://soap.genomics.org.cn/down/SOAPsnp-v1.03.tar.gz

>cd SOAPsnpZ

>make all #this will make a single executable file, soapsnp

>ll soapsnp

-rwxr-xr-x 1 root root 123425 2010-08-13 12:04 soapsnp*

#move it to $PATH

>mv soapsnp /home/galaxy/tools

>soapsnp

SoapSNP

Compulsory Parameters:

-i Input SORTED Soap Result

-d Reference Sequence in fasta format

-o Output consensus file

......

***************************************************************************

#download the sequencing data

>lftp -u USERNAME,PASSWORD ftp://cdts.genomics.org.cn

#do not know which data should be downloaded. So go to GenomEx.

Next Generation Sequencing and Data Analysis

Friday, August 13, 2010

Pipeline for Genomic Variants Identification - Part 1

No comments:

Post a Comment

About Me

Blog Archive