Revising bacterial genome sequences
ReviSeq
ReviSeq
|
|
IntroductionReviSeq is a free tool to generate and to revise a bacterial genome sequence from short sequence reads and a reference sequence. Next-generation sequencing technologies produce a vast amount of sequence data in a very short time. However, data analysis remains limiting and problematic, especially for low complexity repeat sequences and transposon elements due to inherent sequencing errors and short sequence read lengths. To overcome this problem, we have developed ReviSeq using a hybrid method comprised of iterative remapping and local assembly upon a bacterial sequence backbone. License: GPL V3 (http://www.gnu.org/licenses/gpl-3.0.html)
How to use ReviSeq
Table of Contents
Methods used in ReviSeqThe iterative remapping and local assembly approach used by ReviSeq is illustrated in the Figure.
Sequence reads are mapped to the reference genome, S0, by BWA and a new consensus sequence, S1, is generated from the read mapping results. And then, the new consensus sequence, S1, and contig sequences, which enable detection of long insertion and deletion events, are aligned to the reference sequence, S0, by BWASW. A new consensus sequence, S2, is generated using the simple majority voting method from the mapping results of the de novo assembly contigs and S1. For this step, S1 has higher priority than contigs. As a result, if S1 and contig sequences are different at a locus, S1 was chosen unless two de novo assembly contig sequences were consistent. This step allows for detection of large INDEL variants that are not correctly detected by a simple mapping method. Then, the sequencing reads are mapped to the new consensus sequence, S2, by BWA to generate another consensus sequence, S3. If consensus sequences S2 and S3 are different, the reads are mapped to S3 to test whether all reads are correctly aligned to the same positions and a new consensus, S4, is generated from the newly aligned reads. The mapping/comparing is iterated until Si-1 and Si converge or the number of iteration reaches to a given value. Next, partially mapped reads (clipped reads) to Si are counted to search for long INDELs which have not been detected in prior alignment steps. When the number of partially mapped reads is more than a given number at a position in Si, the partially mapped reads are locally assembled by searching for exact matches only and their contig is subsequently aligned to Si. If a long INDEL is detected, Si is modified and used as a new reference sequence and resubmitted to the iterative remapping process until there is no change.
Getting startedReviSeq is available at http://reviseq.sourceforge.net. It requires BWA and SAMtools installed. For SAMTools, 'pileup' sub module should be available. 'pileup' is used to generate variants from the BWASW. 'mpileup' was reported to have a memory problem in generating the pileup format from the alignments of long sequences such as contig or genome sequences. InstallationThe current release runs in the LINUX environment.
$ tar xzf reviseq.[VERSION].tar.gz $ cd reviseq.[VERSION]
Run the ReviSeq pipelineUsers can run the ReviSeq pipeline or individual PERL script. The ReviSeq pipeline contains all basic processes to generate and to revise the bacterial genome sequence of a target sample.
$ [ReviSeq_path]/reviseq_pipeline.pl -o <header of
output> -r <reference FASTA file>
ex) $ [ReviSeq_path]/reviseq_pipeline.pl -o myHeader -r ref.fa -c contigs.fa -lib1 'lib1_1.fastq lib1_2.fastq' or $ [ReviSeq_path]/reviseq_pipeline.pl -o myHeader -r ref.fa -c contigs.fa -lib1 lib1.fq -lib2 'lib2_1.fq lib2_2.fq'
1. InputsReviSeq takes an output header, a reference file, contig sequence files and FASTQ files containing read sequences.
2. OutputsEach step of the pipeline generates several output files whose names start with the output header. "*.align.coverage" files : show the coverage information of mapped reads to the reference. "*.align.gap" files : show regions not covered by any read. "*.align.var" files : show the variants. "*.align.seq" files : contain consensus sequences in which the reference bases have been replaced by the variants.
The final output files are "[output header].final.align.*"
Run each individual PERL scriptUsers can run each individual PERL script to check the output of each step.
1. run_bwa.plIt takes an output header, a reference file and FASTQ files containing read sequences as inputs, and runs BWA. And it generates "[header].sam", "[header].bam", "[header].pileup", "[header].align.coverage", "[header].align.gap", "[header].align.var" and "[header].align.seq" files.
$ [ReviSeq_path]/run_bwa.pl -o <header of
output> -r <reference FASTA file>
ex) $ [ReviSeq_path]/run_bwa.pl -o myHeader -r ref.fa -lib1 'lib1_1.fastq lib1_2.fastq'
2. bwasw_contigs.plIt takes an output SAM file name, a reference file, a consensus sequence file and contig sequence files as inputs, and runs BWASW to align the consensus and contig sequences to the reference sequence. And it generates the output SAM file, "[header].bam", "[header].pileup", "[header].align.coverage", "[header].align.gap", "[header].align.var" and "[header].align.seq" files.
$ [ReviSeq_path]/bwasw_contigs.pl -o <output SAM file>
-r <reference FASTA file>
ex) $ [ReviSeq_path]/bwasw_contigs.pl -o output.sam -r ref.fa -m p1.align.seq -c contigs.fa
3. localAssembly.plIt takes an output file name, a reference file and a SAM file containing read alignments as inputs, searches assembly points and does local assembly for each point. It generates a new sequence file in a FASTA format.
$ [ReviSeq_path]/localAssembly.pl -o <output file name>
-r <reference FASTA file> -s <input SAM file>
ex) $ [ReviSeq_path]/localAssembly.pl -o ref.revised.fa -r ref.fa -s input.sam
4. run_iteration.plIt takes an output header, a reference file, FASTQ files containing read sequences, an iteration starting index i and a number of iteration n as inputs, and runs BWA 'n' times. Each iteration step has an index, "i + [iteration count]" and generates "[header].it_[index].sam", "[header].it_[index].bam", "[header].it_[index].pileup", "[header].it_[index].align.coverage", "[header].it_[index].align.gap", "[header].it_[index].align.var" and "[header].it_[index].align.seq" files.
$ [ReviSeq_path]/run_iteration.pl -o <header of
output> -r <reference FASTA file>
ex) $ [ReviSeq_path]/run_iteration.pl -o myHeader -r ref.fa -i 1 -n 5 -lib1 'lib1_1.fq lib1_2.fq'
5. getProblematicReads.shIt takes a reference file and a SAM file as inputs, extracts abnormal reads (clipped reads, long distance pair reads, matching pair unmapped) to separated SAM files, and generates WIG format files.
$ [ReviSeq_path]/getProblematicReads.sh <reference file> <sam file>
6. twoSeq_comparison.plIt takes an output header, a reference file and a revised genome sequence as inputs, and runs BWASW to align the revised genome sequence to the reference sequence. And it generates the "[header].sam", "[header].bam", "[header].pileup", "[header].align.coverage", "[header].align.gap", "[header].align.var" and "[header].align.seq" files.
$ [ReviSeq_path]/twoSeq_comparison.pl -o <header of output> -r <reference FASTA file> -i <input sequence file>
ex) $ [ReviSeq_path]/twoSeq_comparison.pl -o new.final -r ref.fa -i new.genome.fa |