Introduction

ReviSeq is a free tool to generate and to revise a bacterial genome sequence from short sequence reads and a reference sequence.

Next-generation sequencing technologies produce a vast amount of sequence data in a very short time. However, data analysis remains limiting and problematic, especially for low complexity repeat sequences and transposon elements due to inherent sequencing errors and short sequence read lengths. To overcome this problem, we have developed ReviSeq using a hybrid method comprised of iterative remapping and local assembly upon a bacterial sequence backbone.

License: GPL V3 (http://www.gnu.org/licenses/gpl-3.0.html)

How to use ReviSeq

Table of Contents

Introduction. 1

Methods used in ReviSeq. 1

Getting started. 3

Installation. 3

Run the ReviSeq pipeline. 3

1. Inputs. 3

2. Outputs. 4

Run individual PERL script 4

1. run_bwa.pl 4

2. bwasw_contigs.pl 4

3. localAssembly.pl 5

4. run_iteration.pl 5

5. getProblematicReads.sh. 5

6. twoSeq_comparison.pl 5

Methods used in ReviSeq

The iterative remapping and local assembly approach used by ReviSeq is illustrated in the Figure.

Sequence reads are mapped to the reference genome, S₀, by BWA and a new consensus sequence, S₁, is generated from the read mapping results. And then, the new consensus sequence, S_1, and contig sequences, which enable detection of long insertion and deletion events, are aligned to the reference sequence, S_0, by BWASW. A new consensus sequence, S_2, is generated using the simple majority voting method from the mapping results of the de novo assembly contigs and S₁. For this step, S₁ has higher priority than contigs. As a result, if S₁ and contig sequences are different at a locus, S₁ was chosen unless two de novo assembly contig sequences were consistent. This step allows for detection of large INDEL variants that are not correctly detected by a simple mapping method.

Then, the sequencing reads are mapped to the new consensus sequence, S_2, by BWA to generate another consensus sequence, S₃. If consensus sequences S₂ and S₃ are different, the reads are mapped to S₃ to test whether all reads are correctly aligned to the same positions and a new consensus, S_4, is generated from the newly aligned reads. The mapping/comparing is iterated until S_i_-1 and S_i converge or the number of iteration reaches to a given value.

Next, partially mapped reads (clipped reads) to S_i are counted to search for long INDELs which have not been detected in prior alignment steps. When the number of partially mapped reads is more than a given number at a position in S_i, the partially mapped reads are locally assembled by searching for exact matches only and their contig is subsequently aligned to S_i. If a long INDEL is detected, S_i is modified and used as a new reference sequence and resubmitted to the iterative remapping process until there is no change.

Getting started

ReviSeq is available at http://reviseq.sourceforge.net.

It requires BWA and SAMtools installed. For SAMTools, 'pileup' sub module should be available. 'pileup' is used to generate variants from the BWASW. 'mpileup' was reported to have a memory problem in generating the pileup format from the alignments of long sequences such as contig or genome sequences.

Installation

The current release runs in the LINUX environment.

$ tar xzf reviseq.[VERSION].tar.gz

$ cd reviseq.[VERSION]

Run the ReviSeq pipeline

Users can run the ReviSeq pipeline or individual PERL script. The ReviSeq pipeline contains all basic processes to generate and to revise the bacterial genome sequence of a target sample.

$ [ReviSeq_path]/reviseq_pipeline.pl -o <header of output> -r <reference FASTA file>
-c <contig FASTA file> [-c2 <second contig FASTA file>]
-lib1 <FASTQ files> [-lib2 <FASTQ files>]

ex)

$ [ReviSeq_path]/reviseq_pipeline.pl -o myHeader -r ref.fa -c contigs.fa -lib1 'lib1_1.fastq lib1_2.fastq'

$ [ReviSeq_path]/reviseq_pipeline.pl -o myHeader -r ref.fa -c contigs.fa -lib1 lib1.fq -lib2 'lib2_1.fq lib2_2.fq'

1. Inputs

ReviSeq takes an output header, a reference file, contig sequence files and FASTQ files containing read sequences.

2. Outputs

Each step of the pipeline generates several output files whose names start with the output header.

"*.align.coverage" files : show the coverage information of mapped reads to the reference.

"*.align.gap" files : show regions not covered by any read.

"*.align.var" files : show the variants.

"*.align.seq" files : contain consensus sequences in which the reference bases have been replaced by the variants.

The final output files are "[output header].final.align.*"

Run each individual PERL script

Users can run each individual PERL script to check the output of each step.

1. run_bwa.pl

It takes an output header, a reference file and FASTQ files containing read sequences as inputs, and runs BWA. And it generates "[header].sam", "[header].bam", "[header].pileup", "[header].align.coverage", "[header].align.gap", "[header].align.var" and "[header].align.seq" files.

$ [ReviSeq_path]/run_bwa.pl -o <header of output> -r <reference FASTA file>
-lib1 <FASTQ files> [-lib2 <FASTQ files>]

ex)

$ [ReviSeq_path]/run_bwa.pl -o myHeader -r ref.fa -lib1 'lib1_1.fastq lib1_2.fastq'

2. bwasw_contigs.pl

It takes an output SAM file name, a reference file, a consensus sequence file and contig sequence files as inputs, and runs BWASW to align the consensus and contig sequences to the reference sequence. And it generates the output SAM file, "[header].bam", "[header].pileup", "[header].align.coverage", "[header].align.gap", "[header].align.var" and "[header].align.seq" files.

$ [ReviSeq_path]/bwasw_contigs.pl -o <output SAM file> -r <reference FASTA file>
-m <consensus file from mapping>
-c <contig file from assembly> [-c2 <second contig FASTA file>]

ex)

$ [ReviSeq_path]/bwasw_contigs.pl -o output.sam -r ref.fa -m p1.align.seq -c contigs.fa

3. localAssembly.pl

It takes an output file name, a reference file and a SAM file containing read alignments as inputs, searches assembly points and does local assembly for each point. It generates a new sequence file in a FASTA format.

$ [ReviSeq_path]/localAssembly.pl -o <output file name> -r <reference FASTA file> -s <input SAM file>
[-n number of clipped reads to trigger assembly (default : 5)]

ex)

$ [ReviSeq_path]/localAssembly.pl -o ref.revised.fa -r ref.fa -s input.sam

4. run_iteration.pl

It takes an output header, a reference file, FASTQ files containing read sequences, an iteration starting index i and a number of iteration n as inputs, and runs BWA 'n' times. Each iteration step has an index, "i + [iteration count]" and generates "[header].it_[index].sam", "[header].it_[index].bam", "[header].it_[index].pileup", "[header].it_[index].align.coverage", "[header].it_[index].align.gap", "[header].it_[index].align.var" and "[header].it_[index].align.seq" files.

$ [ReviSeq_path]/run_iteration.pl -o <header of output> -r <reference FASTA file>
-i <iteration starting index> -n <number of iteration>
-lib1 <FASTQ files> [-lib2 <FASTQ files>]

ex)

$ [ReviSeq_path]/run_iteration.pl -o myHeader -r ref.fa -i 1 -n 5 -lib1 'lib1_1.fq lib1_2.fq'

5. getProblematicReads.sh

It takes a reference file and a SAM file as inputs, extracts abnormal reads (clipped reads, long distance pair reads, matching pair unmapped) to separated SAM files, and generates WIG format files.

$ [ReviSeq_path]/getProblematicReads.sh <reference file> <sam file>

6. twoSeq_comparison.pl

It takes an output header, a reference file and a revised genome sequence as inputs, and runs BWASW to align the revised genome sequence to the reference sequence. And it generates the "[header].sam", "[header].bam", "[header].pileup", "[header].align.coverage", "[header].align.gap", "[header].align.var" and "[header].align.seq" files.

$ [ReviSeq_path]/twoSeq_comparison.pl -o <header of output> -r <reference FASTA file> -i <input sequence file>

ex)

$ [ReviSeq_path]/twoSeq_comparison.pl -o new.final -r ref.fa -i new.genome.fa

Revising bacterial genome sequences Revising bacterial genome sequences ReviSeq ReviSeq