BacAnt: A Combination Annotation Server for Bacterial DNA Sequences to Identify Antibiotic Resistance Genes, Integrons, and Transposable Elements

Whole genome sequencing (WGS) of bacteria has become a routine method in diagnostic laboratories. One of the clinically most useful advantages of WGS is the ability to predict antimicrobial resistance genes (ARGs) and mobile genetic elements (MGEs) in bacterial sequences. This allows comprehensive investigations of such genetic features but can also be used for epidemiological studies. A plethora of software programs have been developed for the detailed annotation of bacterial DNA sequences, such as rapid annotation using subsystem technology (RAST), Resfinder, ISfinder, INTEGRALL and The Transposon Registry. Unfortunately, to this day, a reliable annotation tool of the combination of ARGs and MGEs is not available, and the generation of genbank files requires much manual input. Here, we present a new webserver which allows the annotation of ARGs, integrons and transposable elements at the same time. The pipeline generates genbank files automatically, which are compatible with Easyfig for comparative genomic analysis. Our BacAnt code and standalone software package are available at https://github.com/xthua/bacant with an accompanying web application at http://bacant.net.


INTRODUCTION
The era of next-generation sequencing (NGS) took off in 2005 with the commercial release of massively parallel pyrosequencing (Margulies et al., 2005). The NGS technology developed rapidly in the past years and has made substantial improvements in terms of quality and yield. With the rapid decrease of sequencing costs, falling by as much as 80% year over year, whole genome GRAPHICAL ABSTRACT | Overall BacAnt's service workflow. The figure is adapted with permissions from IntegronFinder (Cury et al., 2016).
sequencing (WGS) of bacteria has become a routine method in diagnostic laboratories (Didelot et al., 2012). NGS applications include WGS, targeted NGS and metagenomic NGS. Among them, the most common use of WGS is for simultaneous identification, typing, and/or antimicrobial susceptibility prediction of pathogens (Mitchell and Simner, 2019). One of the most exciting advantages of NGS is the ability to predict antimicrobial resistance genes (ARGs) and mobile genetic elements (MGEs) in bacteria, which allows the investigation of both, the organization and structure of such genetic features, and the epidemiology for the distribution of bacterial strains or virulence genes, including the spread and distribution of antibiotic-resistant bacteria as part of surveillance programs (Zhou et al., 2015;Mitchell and Simner, 2019). Every day, a massive number of bacterial genomes is being sequenced using NGS technology in laboratories across the globe, with genomes released at remarkable rates. With this huge amount of data available, it is important to extract project-relevant information easily. However, in publicly available databases, most of bacterial genomes are available as contigs which have been constructed employing auto-annotation algorithms. Over the years, highly efficient methods for bacterial genome annotation have been developed that do not require much user input.
Rapid Annotation using Subsystem Technology (RAST) is a widely used webserver for genome annotations of microbial species (Aziz et al., 2008). Although the performance using RAST-based annotation is very useful, several important limitations remain. For example, RAST will label many Open Reading Frames (ORFs) as "hypothetical proteins, " and the performance to identify ARGs and label them as such, is fairly limited as the algorithm is not tailored toward this purpose. Based on the RAST system, the Pathosystems Resource Integration Center (PATRIC) improved the data collection of ARGs, and provided users a more powerful analysis for both genomes and individual genes (Wattam et al., 2018). Another available annotation server is Resfinder which is managed by the Center for Genomic Epidemiology; it provides a convenient way of identifying acquired ARGs in sequenced bacterial isolates (Zankari et al., 2012). In addition to annotations of ARGs, some databases specifically designed to annotate MGEs such as insertion sequences (ISfinder), integrons (INTEGRALL) and transposable elements (The Transposon Registry) have been created (Siguier et al., 2006;Moura et al., 2009;Tansirichaiya et al., 2019). ISs are abundant mobile elements in bacteria, which are responsible for the mobilization of many genes, including those mediating ARG (Razavi et al., 2020). Such ARGs are often found in the genetic context of specific ISs, while ISs flanking regions are diverse (Razavi et al., 2020). For example, a clear association of ARGs with class 1 integrons can be observed (Partridge et al., 2018). The analysis of which ISs are associated with ARG genes would help to discover novel AMGs (Razavi et al., 2020). In addition, there is a major interest to explore how ARGs spread via MGEs (Che et al., 2021). The early identification of ARGs in bacteria would facilitate surveillance and molecular diagnostics (Razavi et al., 2020). Also, inter/intra-species genetic transfer events of MGEs are responsible for the emergence and rapid spread of resistance (Subedi et al., 2018). Therefore, the knowledge of MGE-associated drug resistance is crucial for the monitoring of resistance in microbial species. Unfortunately, up to now, a rapid annotation tool of the combination of ARGs and MGEs is not available, and the generation of genbank files has to be done manually. Therefore, we created a new program/pipeline called BacAnt, which rapidly and efficiently annotates ARGs, integrons, and transposable elements in a single step and generates a genbank file automatically which is compatible with the Easyfig program for comparative genomic analysis.

Program for Identifying Antibiotic Resistance Genes and Mobile Elements
We created a python program (BacAnt) to identify ARGs and MEGs for bacteria nucleotide sequences with BLAST analysis. We first used the BLASTN program comparing input sequences with the reference database with an e-value 10 −5 . For the detection of integrons, we used Integron_Finder to predict possible integrons and used the BLASTN program comparing the integron sequence with integronDB database for the best match sequence (Cury et al., 2016). We then filtered the raw results by identities and coverage (blast align match length/subject length). All results that pass the identity and coverage filter are retained for further analysis. The default threshold was set to 90% for 2 http://integrall.bio.ua.pt 3 https://transposon.lstmed.ac.uk identities and to 60% for coverage. Finally, we display the filtered results in text and genbank format, while also providing a visual output. Three types of annotations for the sequences are displayed in the same figure to guide the analysis of the genomic sequence.

Website for BacAnt
For the analysis to be performed online, we developed a website which we call http://bacant.net. The pipeline running on the server is based on Python/Django, which allows the user to upload sequence files for the rapid identification of ARGs and MGEs. The output format allows the display of graphic FIGURE 2 | The relationship between the number of drug-resistance genes (ARGs) and insertion sequences (IS) from four species: (A) A. baumannii, (B) E. coli, (C) S. enterica, and (D) S. aureus. The number of ARGs and ISs of each sample were extracted from the results of the BacAnt analysis, and the number was plotted as the horizontal and vertical coordinates, respectively. The scatter plot was created using the ggplot2 package in R (V3.6.2), and a trend line generated.
Frontiers in Microbiology | www.frontiersin.org representations of the results. A demo report can be seen here: http://bacant.net/BacAnt/demo.

Examples Analysis Using BacAnt
The genome sequences of four species downloaded from NCBI, including A. baumannii (2019.11.17), E. coli (2019.5.7), S. enterica (2019.6.4), and S. aureus (2019.6.4) were used to illustrate the capabilities of our program. The accession number of the genomes used in this study were listed in Supplementary Tables 2,3. The raw sequence data were downloaded from the European Nucleotide Archive 4 . Sequence quality was assessed via FastQC v0.11.5 5 , and low-quality sequence data and the adapter sequences were removed with Trimmomatic v0.36 (Bolger et al., 2014). The SPAdes software tool v3.11.0 was used to generate assembled genome with default parameter (Bankevich et al., 2012). The number of ARGs and MGEs of each sample were identified in the BacAnt analysis and plotted as horizontal and vertical coordinates, respectively. Scatter plots were created using the ggplot2 package in R (V3.6.2), together with the trend line (Wickham, 2016).
We created a network diagram by Cytoscape (v3.7.2) using the analysis results of BacAnt from four species genome sequences from NCBI (Shannon et al., 2003). Only ARG pairs with a distance of less than 10 kb and frequency of occurrence no less than 10 are extracted from the data to construct the network map. In addition, we also created a map from ARGs and insertion sequence pairs with a distance of less than 10 kb and the frequency of occurrence no less than 10.

RESULTS AND DISCUSSION
BacAnt is a browser-based platform to annotate DNA sequences, and to visualize the annotation results. When using the web interface, the user first has the choice to upload DNA sequences as Fasta, Seq or GenBank files (Figure 1). The user then has the option to select one or multiple databases, which include ResDB, IntegronDB or TransponsonDB for sequence annotation. After the DNA sequence is submitted, the pythonbased program BacAnt will identify ARGs and MGEs in the bacterial nucleotide sequence.
The output of BacAnt commences with a summary of the annotation, followed by up to three tables including the annotation result from each database; should they have been selected in the first step. The final part of the BacAnt output visualizes an annotation result which is combined from all three databases. All annotation results obtained by running BacAnt, including figures and genbank files, can then be downloaded. The genebank files generated by BacAnt are compatible with Easyfig (Sullivan et al., 2011). As an example we used A. baumannii MDR-ZJ06 (NC_017171.2) to display the result of the annotation by BacAnt (Figure 1). The annotation output of the MDR-ZJ06 strain shows that the isolate harbors 19 ARGs, 124 integrons and 17 transposons.
BacAnt was validated with NCBI AMRFinder using 1,100 selected genomes. The file output "AMR.possible.tsv" in the BacAnt result was used for analysis in NCBI AMRFinder to test which of the programs is able to identify a larger number of ARGs with high accuracy. Both programs reported similar results regarding the number of resistance genes (Supplementary Table 4). However, the number of ARGS in BacAnt is slightly larger than that of NCBI AMRFinder. Some resistance genes were absent in the output of NCBI AMRFinder: aac(6 )-Iaa (NC_003197, aminoglycoside N-acetyltransferase) in S. enterica, bla EC−15 (NG_049081,class C extended-spectrum beta-lactamase EC-15) in E. coli, BcII (NG_056058, BcII family subclass B1 metallo-beta-lactamase in B. cereus. To investigate whether a relationship between ARGs and ISs exists, we extracted the number of ARGs and ISs from the results of the BacAnt analysis of four species genome sequences from NCBI. The results show that the number of ARGs does not correlate with the number of ISs in the four species (Pearson's R 2 < 0.8, Figure 2). Previously, it was reported that at least eight MGEs were detected together with ARGs in A. baumannii (Leal et al., 2020). The plasmids were grouped into three categories based on the DNA transfer machinery: ARGs and ISs pairs with a distance of less than 10 kb and frequency of occurrence no less than 10 were screened to construct the network. Diamond symbols are used to represent ARGs, circles for ISs. The solid line is the gene pair whose frequency is not less than 80% of the total number and labeled as orange, the dotted line represents the gene pair whose frequency is less than 80% of the total number and labeled as sky blue, and the width of the line indicates the frequency. Different colors are assigned to the antibiotics corresponding to the drug-resistant genes, the insertion sequences are displayed in gray.
conjugative, mobilizable and non-mobilizable (Smillie et al., 2010). Che et al. (2021) showed that most ARG genes are located in conjugative plasmids, which -together with ISs-play the most important role in mediating the horizontal transfer of ARGs. When the relationship between ARG and ISs were investigated, we did not analyze the location of ARGs which might explain why no significant correlation between the two was observed in our study.
We also calculated the pairwise association between ISs and ARGs. The results were subjected to a permutation test to differentiate between statistically significant associations and random chance (Razavi et al., 2020). Only statistically significant associations (P < 0.001) of ISs and ARGs were analyzed (Supplementary Table 5). We identified commonly found ARG pairs that were in close proximity to ISs (<10 kb apart), which allows the detection of gene cassettes that may play an important role in evolution, regulation and ARG exchange. ARG cassettes are generally small (2-7) and specific to the species we investigated (Figure 3). The ARG cassettes for A. baumannii and E. coli were larger and more stable than the cassettes in S. aureus, which is consistent with a previously published observation (Chng et al., 2020). In the case of A. baumannii, we identified two ARG cassettes, with one containing the genes mph(E), msr(E), armA, aadA1, sul1, aac(6 )-Ib, and catB8. The ARG cassette which contained mph(E), msr(E), and armA was described previously (Chng et al., 2020). For E. coli, the program identified a stable small cassette that shows overlap with that of A. baumannii, including sul2, aph(3 )-Ib and aph(6)-Id. When genomes of S. aureus were analyzed, the program BacAnt found two ARG cassettes; the first one contained the genes blaPC1 and blaR1, while the second one encoded for mecR1 and erm(A). We also identified commonly found ARG-IS pairs that were in close proximity to ISs (<10 kb apart). For the ARG-IS pairs, ISVsa3 containing the genes aph(6)-Id, aph(3 )-Ib and tet(B) comprised the top three ARG-IS pairs in A. baumannii ( Figure 4A). For ARGs number, IS26, ISAba1, and ISVsa3 were the top three active ISs ( Figure 5A). IS26 was the most abundant IS in E. coli and S. enterica (Figures 4B,C, 5B,C). In S. aureus, mecA with diverse IS (including IS257-3, IS431mec, IS257R1, IS257-1, IS257R2, IS431L, and IS431R) are the top seven ARG-IS pairs (Figures 4D, 5D). The result of this study confirmed that IS6 family elements IS26 and IS257 play an important role in the dissemination of ARGs in A. baumannii, E. coli, S. enterica, and S. aureus (Partridge et al., 2018). As previously reported, we also observed notable differences between important MGEs in A. baumannii, E. coli, S. enterica, and S. aureus (Partridge et al., 2018).
We also analyzed the physical organization of the ARGs-IS pairs. Although the ARGs are not part of the IS, we observed a correlation of the distances between both elements in the analyzed bacterial genomes. ARGs-IS pairs occupied specific distances in A. baumannii, with the exception of sul1 in IS26 which displayed several, and more broader distance distributions ( Figure 6A). In contrast, the sul1 gene embedded in ISEc29 displayed distances that were more specific. In E. coli, only mph(A) showed a narrow distance distribution, while the other ARG-IS pairs show less correlation as the distances between the elements are less defined (Figure 6B). In S. enterica, only aph(3 )-Ib and ISVsa3| floR exhibited clear positions with narrow distributions. Interestingly, the distances between IS26 and folR were more widely distributed (Figure 6C). In S. aureus, mecA, mecR1 and mecI_of_mecA exhibit specific distance distributions ( Figure 6D). Our observations appear to indicate that the distances of ARG-IS pairs appear to often be specific, and this correlation does not appear to be defined by either the IS or the ARG alone.
Using BacAnt, we also explored the relationship between ARGs and transposons. Tn6292 was the most commonly  observed transposon containing ARGs in A. baumannii, E. coli and S. enterica (Figures 7A-C). In S. aureus the most prevalent transposon with ARGs was identified to be Tn552 (Figure 7D). Tn6292 belongs to the Tn3-family and harbored an IS26 at the right end (Chen et al., 2020). Tn6292 also contained a quinolone resistance region qnrS1 (Feng et al., 2016). Multidrug-resistance bacteria containing Tn6292 are commonly observed in China (Li et al., 2018), and possibly accelerate the emergence and spread of multidrug-resistant pathogens. Tn552 belonged to Tn7 family, comprised of BlaZ, BlaR1, and BlaI proteins. BlaR1 is the sensor protein for the extracellular β-lactam antibiotics. The overproduction of the beta-lactamase BlaZ were responsible of β-lactam resistance. Tn552-like element was thought as the origin of the all β-lactamase genes in staphylococci (Gregory et al., 1997).
In order to be able to extract the maximum amount of information from whole genome sequence data, we need the improve annotation and analysis methods for MGEs (Partridge et al., 2018). In this work, we created a webserver that is easy to use and allows the annotation of ARGs, integron, and transposable elements at the same time. The pipeline generates genbank files automatically, which are compatible with easyfig for comparative genomic analysis,which will accelerate the bioinformatics analysis of ARG-related sequences.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/ Supplementary Material, further inquiries can be directed to the corresponding author/s.