circular genome assembly

Long, PCR-free nanopore sequencing reads enable the assembly of complete, reference-quality microbial genome sequences. For the genomes with a complete reference sequence (simulation data and PacBio sequencing data), we applied QUAST (v4.3) [33] to calculate the assembly statistics for all the tested algorithms, including number of contigs, maximum contig length, genome fraction, GC content, number of misassemblies, number of local misassemblies, duplication ratio, number of mismatches per 100kbp, and number of indels per 100kbp. Illumina 2250bp paired-end Miseq reads that covered~250X of the genome were simulated using ART [31], with a mean insert size of 300bp and 10bp standard deviation. According to the announcement of ONT, they are targeting, by a variety of method including a new design of nanopore (R10) and a new basecaller (Guppy), a Q-score of 50 then 60 (one error per megabase) for consensus accuracy enhancement. In this study, we used the 1D ligation sequencing kit (SQK-LSK108) to produce nanopore sequencing reads of more than 6 Gbp for a total of 12 samples and successfully assembled them into circular genomes. In addition, Flye and HINGE were used to assemble all reads, and they could produce only 31 and 27 circular sequences, respectively. (2019)4. All-vs-all alignment of allcir.fa was performed using Nucmer to filter pair alignments between circular contigs with an alignment rate of 0.2, an aligned length of 2500 bp, and an identity of >0.98. et al. Therefore, it accepts PacBio, ONT, Illumina data, or a combination of them. Single-Molecule sequencing: towards clinical applications. For a successful bridging, several high-quality and bona fide alignments that can cover the unsolved repeats as well as a large portion of their flanking regions are required. HINGE produces an assembly along a graph, from which a circular path can be observed for a circular sequence. However, in the long-read-only mode, Unicycler uses minimap and miniasm [21] to assemble the long reads, which will generate contigs with a similar error rate to raw long reads, and the contigs that it produces tend to collapse repeats or segmental duplications. Although the genome fraction (percentage of aligned bases in the reference genome) of B-assembler ranked in the middle (98.446 vs. 99.839 of Unicyclers long-read mode), the duplication ratio (the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome) is 1. c. In this situation, trimming one duplicated sequence is not sufficient to fix the contig because multiple copies of the true sequence remain. Nucleic Acids Res. National Library of Medicine Nevertheless, one may argue the absence of true benchmarks. Y-CL and H-WC implemented the pipeline. J. Clin. Circlator attempts to circularize each contig in turn. (2019). We may request cookies to be set on your device. Therefore, we could expect that our CCBGpipe will help bacteriologist to produce highly accurate complete finished genomes by ONT-only long-read sets. If the file size of the representative contig is smaller than that of the miniasm assembly, then the miniasm assembly is polished by Racon (Vaser et al., 2017) and the Racon-polished assembly is then split into 500-kbp-long synthetic reads with an overlap of 10 kbp. QUAST results on the PacBio bacterial samples Table S4. For the total aligned PCR sequences, B-assembler had the minimum number of mismatches and indels. The polishing procedure repeats three times to achieve a whole genome sequence with the minimal errors. B-assembler performs several additional steps instead of directly merging two rounds of assemblies to achieve a circular genome. Surveillance of antibiotic resistance in Taiwan, 1998. A complete high-quality MinION nanopore assembly of an extensively drug-resistant. The remaining reads, which are either mapped to short contigs (<100,000 bp) or are unmapped, are all retained. First, Canu and Nanopolish both required extensive computation, and approximately 1 day was required on a 16-core server with 96 GB of memory to complete one barcoded bacterial genome. Please note that when assembling just long reads with Unicycler, it uses miniasm for assembling and couples with multiple rounds of Racon for polishing. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Using the ligation methodology, a nanopore sequencing library was constructed using the ligation sequencing kit 1D (SQK-LSK108) and the native barcoding kit (EXP-NBD103) for 12 samples. Finally, when the sequence being assembled is shorter than the length of some reads, it may contain significant misassemblies in the form of multiple duplications of the entire circular molecule (Fig. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. No use, distribution or reproduction is permitted which does not comply with these terms. b), then the contig is circularized. We also calculated the differences (indels and mismatches) between the PCR amplicons and the contigs. 2008; 40(4):28195. Complete, closed bacterial genomes from microbiomes using nanopore sequencing. doi: 10.1016/j.tibtech.2018.07.013, PubMed Abstract | CrossRef Full Text | Google Scholar, Bainomugisa, A., Duarte, T., Lavu, E., Pandey, S., Coulter, C., Marais, B. J., et al. The base accuracy was evaluated using number of mismatches and number of indels per 100kbp. All the ONT reads were sequenced by a MinIon sequencer. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. These 48 sequences were all manually examined with Tablet (Milne et al., 2013) to confirm the uniformity and continuity of sequencing coverage. doi: 10.1093/bib/bby017 [Epub ahead of print]. The sequence identity to the final release was 99.4, 89.0, 98.0, and 98.0% for Canu, miniasm, Flye, and HINGE, respectively. ERR657651 and ERR657671; Bordetella (NCTC13251), accession no. Circular Genome Viewer MicroScope User Doc v3.16.0 - Read the Docs In almost all cases, short-read de novo assembly does not generate a fully finished, complete genome. Sequencing reads are de novo assembled several times by using a sampling strategy to produce circular contigs that have a sequence in common between their start and end. Trends Biotechnol. Please be aware that this might heavily reduce the functionality and appearance of our site. Number of circular contigs produced by miniasm and Canu with different subsets of reads. However, assembly of such genomes can be difficult in the absence of a reference genome, as most de novo assemblers do not account for circularity and produce linear sequences with an arbitrarily defined start and end. False circularization is of particular concern because one cause of assembly fragmentation is the presence of large repeat sequences in the assembled DNA. PacBio circularizing and trimming. Hunt, M., Silva, N.D., Otto, T.D. high Phred Q-scores) can be used to correct some of the lower quality long-read basepairs and perform some gap-filling actions. For PacBio sequencing data, we downloaded 14 bacterial strains [32] which included both Gram-positive and -negative species from the National Collection of Type Cultures (NCTC) 3000 project on the basis of there being high-quality reference genome sequences of the same strains available for comparison (see Additional file 1 Table S5 for species and reference genome accession numbers). (2018) conducted a review and analyzed state-of-the-art tools associated with the genome assembly pipeline. The ONT data used in this study were downloaded from http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Assembling_a_MinION_dataset. This data can be found here: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA459525. It then uses Racon to gap-fill the assembled reads into one or several consensus sequences. Wick R. R., Judd L. M., Gorrie C. L., Holt K. E. (2017a). This can lead to the production of contigs containing the entire sequence of the plasmid two or more times (Fig. After merging, Circlator reduced the total number of contigs in the assemblies to 52, while Minimus2 made more merges, reducing the number of contigs to 48. For the real nanopore data, we inspected read alignments with respect to the contigs to assess the assemblies structural accuracy. ). Ho, M., McDonald, L. C., Lauderdale, T. L., Yeh, L. L., Chen, P. C., and Shiau, Y. R. (1999). PDF De novo assembly and reconstruction of complete circular - Geneious For general audiences who would like to obtain complete bacterial genomes using MinION but are not familiar with bioinformatics skills, we have created a GitHub page1 to described how to install and how to run our pipeline. But this will always prompt you to accept/refuse cookies when revisiting our site. 2015; 12(8):733735. It is evident that the Unicyclers hybrid-read mode is far from optimal in this case. 1; blue sequences). (2017) reported that Racon coupled with miniasm enables accurate genome completion and is an order of magnitude faster (Vaser et al., 2017). If there are small gaps remaining in the assembled genome, the pipeline moves to Step 4. ERR581147 and ERR581145; Enterobacter (NCTC10005), accession no. An individual barcode was added to dA-tailed DNA by using the NEB Blunt/TA Ligase Master Mix (New England BioLabs). GetOrganelle: a fast and versatile toolkit for - Genome Biology ERR772449); Staphylococcus (NCTC13360), accession no. A schematic relationships between assemblies and final release assemblies for (A) barcode01 and (B) barcode10. PMID: 26714481. Sommer DD, Delcher AL, Salzberg SL, Pop M. Minimus: a fast, lightweight genome assembler. Determining the genome sequences of bacteria is critical to conduct human microbiome associated health studies. To remove false positives, the reads that mapped were then mapped to the entire GRCh38 genome using BWA MEM with the same settings. (2018) obtained the complete sequences of 20 multidrug resistance-encoding plasmids from a barcoded MinION flow cell and demonstrated that long-read assembled plasmids possess high-quality skeletons with correct arrangements of various mobile elements. doi: 10.1016/j.jhin.2017.02.020, Todd, S. M., Settlage, R. E., Lahmers, K. K., and Slade, D. J. Each barcode folder contains a fastq file (joinedreads.fastq), an assembly file (assembly.fa), and subfolders containing 4000 fast5 files. Molecular epidemiology of emerging carbapenem resistance in. The sequence identity to the final release was 99.4, 89.0, 98.0, and 98.0% for Canu, miniasm, Flye, and HINGE, respectively. A schematic workflow of our proposed pipeline CCBGpipe is shown in Figure 1. Accessibility The assembly file was produced by minimap2 (Li, 2018) and miniasm (Li, 2016) with the fastq file. We explored the possibility of circularizing a nanopore assembly, which is more challenging than PacBio because of the higher error rate in the reads and assembly contigs, using Escherichia coli ONT MinION data [15]. A., Bradley P., Pankhurst L., Del Ojo Elias C., Loose M., Nilgiriwala K., et al. doi: 10.1128/AAC.02007-18, Cornelis, S., Gansemans, Y., Deleye, L., Deforce, D., and Van Nieuwerburgh, F. (2017). Matches between the reference and assembly are shown in light blue. It was successfully applied to a wide range of species and different technologies and outperformed existing semi-automatic methods. Hunt, M., Silva, N. D., Otto, T. D., Parkhill, J., Keane, J. sharing sensitive information, make sure youre on a federal The fact that Flye and Unicycler generated more than 1 contig as shown in Additional file 1 Fig S4 also supports this. In the 1940s, a Dutch mathematician called Nicolaas de Bruijn became interested in finding the shortest circular string of characters that contains all possible substrings, each of same length, in a given alphabet. Circos is very powerful tool but it required an installation and running in command-line interface, whereas ClicO-FS is a web service, was developed based on Circos, with user interface. Kurtz S., Phillippy A., Delcher A. L., Smoot M., Shumway M., Antonescu C., et al. doi: 10.1093/bib/bbx062, Milne, I., Stephen, G., Bayer, M., Cock, P. J., Pritchard, L., Cardle, L., et al. Am J Hum Genet. Methods 12, 733735. Although many algorithms and tools have been developed for base calling, read mapping, de novo assembly, and polishing, an automated pipeline is not available for one-stop analysis for circular bacterial genome reconstruction. 2009; 10:421. doi:10.1186/1471-2105-10-421. However,mostcurrent TGSassemblerswere specificallydesigned for human or other speciesthat do nothave acircular genome. If no such matches are found, the sequence will keep the default starting point. To the best of our knowledge, this is the first study to complete 12 multiplexing bacterial genomes in a single MinION flow cell run without incorporating any complementary short-read sequencing data. However, the linear contigs produced by assembly programs to represent circular DNA structures can contain errors. Predicted CDSs transcribed in the clockwise direction. Finally, the redundant ends of consensus sequences were trimmed, and circular sequences were rearranged to begin at dnaA/repA or a position with the minimum value of the GC skew by finalize.py. Nucmer matches between the genomes are shown as blue (hits in the same orientation) and pink (hits in opposing orientations). the display of certain parts of an article in other eReaders. Although Canu outputs suggestCircular=yes in the header line for circular sequences, we examined circularity ourselves. A tool to circularize genome assemblies. Although the sampling strategy applied to miniasm could provide 42 circular sequences at the most (Supplementary Figure 1), six sequences were missing as compared with the 48 sequences in Table 1, namely two chromosomal sequences (more than 4 Mbp), two large plasmids (92 and 158 kbp), and one small plasmid (3.8 kbp) in barcode07 and barcode08 and one small plasmid (3 kbp) in barcode11. or two Canu assembly directories (canu.A and canu.B), corrected and trimmed reads in the canu.directory were used for sampling. We ran a de novo assembly of the reads using PBcR, to generate the contigs and corrected reads required as input to Circlator, using the data and instructions at [22] and checkout revision 4642 of the source code (see Additional file 1 for more details). We included the metrics of total number of supplementary alignments and supplementary clusters for the benchmarked tools tested on the real ONT data. Genome Biol. In total, the reference genomes of the sequenced strains comprised 14 chromosomes and 14 plasmids. Use of the Oxford Nanopore MinION sequencer for MLST genotyping of vancomycin-resistant enterococci. 2010; 11:119. doi:10.1186/1471-2105-11-119. Figure 2 shows a comparison of the HGAP assembly, which comprises many copies of the apicoplast sequence, against the output of Circlator. The circular contigs are polished by using raw signals and sequencing reads; then, duplicated sequences are removed to form a linear representation of circular sequences. Table 3. (2018) conducted a review and analyzed state-of-the-art tools associated with the genome assembly pipeline. http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Assembling_a_MinION_dataset. View all Bayliss et al. In all cases, Circlator was compared against the BLAST and Minimus2 circularization methods described in the Methods section. In addition, B-assembler had a more uniform depth distribution when aligning the raw reads back to the assembled genome sequence (Additional file 1 Fig S3), while wtdbg2, Canu, and Unicycler dropped at two ends, and these depths were not consistent with the middle regions (Additional file 1 Fig S3). Performance of neural network basecalling tools for Oxford Nanopore sequencing. It uses Flye [19] as the core assembler for both assembly modes. doi: 10.1093/bioinformatics/btw152, Li, H. (2018). After demultiplexing and base calling using Albacore 2.1.7, a total of 991,300 reads (more than 6 Gbp) were obtained for the 12 barcoded samples. were recorded for all the benchmarked algorithms (Table (Table1).1). Click to enable/disable Google reCaptcha. In summary, we presented a pipeline through which an initial assembly can be rapidly obtained using miniasm and a complete final assembly can be produced using Canu coupled with a sampling strategy. It has low memory usage and a short run time (see Supplementary text, Additional file 2: Table S7, Additional file 1: Table S8, and Additional file 1: Figure S35). When MinION started to generate fast5 reads, files were transferred to a separate Linux server for demultiplexing and base calling by using Albacore 2.1.7 by running extract.py. ERR879377; Clostridium (NCTC13307), accession no. 1, Additional file 1: Figure S1) consists of iteratively merging together contigs followed by running local assemblies of corrected reads that align to contig ends. The arrival of high-yielding, short-read sequencing technologies drastically reduced the time and cost required to generate high-depth whole-genome sequencing data ideal for identification of population variation. Accessed 22 Apr 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. B-assembler has two modes: long-read-only assembly and hybrid reads assembly. In addition, 25 L of lysostaphin (5 mg/mL) and 2 L of RNase A (1 mg/mL) were added to 180 L of an enzymatic lysis buffer for the extraction of DNA from S. aureus. Among the 30-times assemblies of sampling reads, complete circular sequences were found in barcode01barcode06 and barcode12, but small plasmids (3 kbp) were often missing in barcode09barcode12 (Table 3 and Supplementary Figure 1). Therefore, L2 goes through two correction processes with mapped reads using Racon. All the tools, namely minimap2, GraphMap, Canu, miniasm, Nanopolish, and Racon, were utilized in our pipeline, except for Scrappie. Whole-genome sequencing is already providing improved resolution in bacterial epidemiology [6] and allowing in silico prediction of antimicrobial resistance [7]. Using sequencing data produced from a single MinION run, we obtained 48 circular sequences, comprising 12 chromosomes and 36 plasmids of 12 bacteria, including Acinetobacter nosocomialis, Acinetobacter pittii, and Staphylococcus aureus. These cookies collect information that is used either in aggregate form to help us understand how our website is being used or how effective our marketing campaigns are, or to help us customize our website and application for you in order to enhance your experience. S-CK and T-LL provided the strains. Loman NJ, Quinlan AR. Li R., Xie M., Dong N., Lin D., Yang X., Wong M. H. Y., et al. After two iterations of Nanopolish, we performed Racon twice, followed by a final polishing step by using Nanopolish. The difference between long-read-only and hybrid modes is that since the Illumina reads have higher accuracy, B-assembler takes advantage of short reads instead of long reads for polishing and therefore can achieve more accurate assembly results. Nat. You can check these in your browser security settings. In addition to inputting 40 reads into Canu using default settings, we assembled A + B reads (80) with corOutCoverage = 1000, which increased the number of circular sequences to 32. We treat each contig separately because Minimus2 can incorrectly merge parts of different contigs when it is run on all the contig halves pooled together. doi:10.1093/nar/gkq840. 2013; 29(8):10725. Since only Unicycler and B-assembler are the circular-aware bacteria assemblers, it is not surprising to see that only these two software products can correctly identify the starting position of the assembled genome (Additional file 1 Fig S2). Chen F.-J., Huang W.-C., Liao Y.-C., Wang H.-Y., Lai J.-F., Kuo S.-C., et al. Finally, we evaluated Circlator on an assembly of the bacterium Escherichia coli based on Oxford Nanopore data. Considering all the evaluated factors, B-assembler surpassed the other benchmarked tools with the simulated long-read dataset and constructed the most accurate genome sequence. The number of circular contigs for each assembler is summarized in Table 2. a). Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. doi: 10.1186/gb-2013-14-9-r101, Koren, S., and Phillippy, A. M. (2014). Illumina sequencing is routinely used to study microbial genomics because of its low cost and the high accuracy of the sequence reads generated. However, the de novo assembly of short reads (100300 base pairs) results in fragmented assemblies because repetitive sequences in bacterial genomes are invariably longer than the length of a short read and the span of paired-end reads. Circular Genome Map (Circular Genome Map) is a function to compare and draw Features and Contents from arbitrary genomic chromosomes or genomic chromosomes of closely related species and arbitrary numerical data specific to genome position on concentric circles. Long-read assemblers, including Canu (Koren et al., 2017), Flye (Kolmogorov et al., 2018), HINGE (Kamath et al., 2017), and miniasm (Li, 2016), have been developed for de novo assembly and can produce a single contig per DNA molecule for some bacterial genomes. Only reads with a primary match to the mitochondrion were retained, and assembled with PBcR version 8.3rc2 using the options -maxCoverage 1000 -length 500 -partitions 200 genomeSize=16569. BMC Bioinformatics. We therefore introduce a sampling strategy of reads on Minasm and Canu to see how they work. First, Minimus2 is run on the input assembly to merge any overlapping contigs (this is optional, and not part of the original protocol). Article b). Huang W, Li L, Myers JR, Marth GT. In bioinformatics, genome assembly represents the process of putting a large number of short DNA sequences back together to recreate the original chromosomes from which the DNA originated. Genome Res. Circlator is easy to use, with a single call required to run the whole pipeline, and is also modular, so that any stage of the pipeline can be run in isolation. For general audiences who would like to obtain complete bacterial genomes using MinION but are not familiar with bioinformatics skills, we have created a GitHub page1 to described how to install and how to run our pipeline. Whole Genome Sequencing, Assembly and Annotation. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. Results of testing user-defined parameters for all NCTC data sets Table S5. duplication ratio, mis. In addition, only the end-reads that have high mapping quality ( 20) are used for reassembly. Ho M., McDonald L. C., Lauderdale T. L., Yeh L. L., Chen P. C., Shiau Y. R. (1999). Circlator: automated circularization of genome assemblies using long Therefore, it is hard, if not impossible, to solve large repeats.