Utilizing Metagenomic Data and Bioinformatic Tools for Elucidating Antibiotic Resistance Genes in Environment

Antibiotics resistance genes (ARGs) are mainly caused by the extensive use and abuse of antibiotics and have become a global public health concern. Owing to the development of high-throughput sequencing, metagenomic sequencing has been widely applied to profile the composition of ARGs, investigate their distribution pattern, and track their sources in diverse environments. However, the lack of a detailed transmission mechanism of ARGs limits the management of its pollution. Hence, it’s essential to introduce how to utilize the metagenomic data to obtain an in-depth understanding of the distribution pattern and transmission of ARGs. This review provides an assessment of metagenomic data utilization in ARG studies and summarizes current bioinformatic tools and databases, including ARGs-OAP, ARG analyzer, DeepARG, CARD, and SARG, for profiling the composition of ARGs and tracking the source of ARGs. Several bioinformatic tools and databases were then benchmarked. Our results showed that although SARG is a good database, the application of two or more bioinformatic tools and databases could provide a comprehensive view of ARG profiles in diverse environmental samples. Finally, several perspectives were proposed for future studies to obtain an in-depth understanding of ARGs based on metagenomic data. Our review of the utilization of metagenomic data together with bioinformatic tools and databases in ARG studies could provide insights on exploring the profiles and transmission mechanism of ARG in different environments that mitigate the spread of ARGs and manage the ARGs pollution.


INTRODUCTION
Since the discovery of penicillin, researchers have opened the modern era of the innovation, development, and application of antibiotics in human society. At present, antibiotics are used as medicine for humans and animals and widely applied in animal husbandry, agriculture, and aquaculture (Manage, 2018). However, with the intense use and abuse of antibiotics for human and agricultural purposes, antibiotics are continuously discharged into different environments, particularly those with limited sewage treatment capacity, resulting in a substantial increase of antibiotic residue in different environmental niches (Carvalho and Santos, 2016;Qiao et al., 2018). These residual antibiotics increase the risk of antibiotic resistance and produce antibiotic resistance genes (ARGs) that could be transferred to various microorganisms. This phenomenon is not new  and has attracted global concern, particularly its spread and transmission mechanism (Holmes et al., 2016). To date, antibiotics and their effects on different environment niches, for example, the emergence and spread of ARGs, have become an urgent and growing global public health threat in environmental science (Sanderson et al., 2016;Yang et al., 2018;Iwu et al., 2020). Hence, many researchers paid attention to ARGs to investigate their distribution and transmission.
With the successes of investigation on ARGs, researchers have identified the composition of ARGs and explored their distribution in different environment niches. For example, a total of 139, 442, and 491 ARG subtypes were identified in sediments from the Yamuna River, sediments from an urban river in Beijing (Chaobai River), and activated sludge reactors, respectively (Chen et al., 2019a;Zhao et al., 2019;Das et al., 2020). Based on these published studies, we found that many studies have focused on the composition of ARGs and their dynamics; however, only a few studies investigated the transmission of ARGs; the transmission route for ARGs is poorly characterized (Zhou et al., 2018;Chao et al., 2019;Vrancianu et al., 2020). A comparison on the occurrence and abundance of ARGs and microbiota in healthy humans and sewage treatment systems in a Chinese village identified 53 ARGs and 28 bacteria genera in all samples; this result supports the idea that bacteria could carry and transfer several ARGs to humans and the environment (Zhou et al., 2018). Furthermore, different mechanisms of horizontal gene transfer, including conjugation, transduction, and transformation, were also found to contribute to the accumulation and transmission of ARGs in bacteria (Chen et al., 2019b;Li et al., 2019;Vrancianu et al., 2020). Given the scarcity of studies exploring the transmission of ARGs, their detailed transmission mechanism remains elusive.
Recently, many sequencing techniques have been developed and applied in ARG studies. Owing to its advantages, highthroughput sequencing has been widely applied in microbiome studies to detect ARGs and is expected to solve the problem of transmission and proliferation mechanism of ARGs in different environments. With the growing number of microbiome studies focusing on ARGs, many metagenomic datasets, bioinformatic tools, and associated ARG databases have been generated for ARG analysis. By using these tools and databases, researchers have profiled ARG composition in different environments and deepened the understanding of ARGs. However, some urgent scientific questions remain unanswered, such as which bioinformatic tools and ARG databases are suitable for detecting potential ARGs? In addition, the various equations for ARG abundance calculation make it impossible to directly compare the results of different ARG studies. Besides, details on the transmission and management of ARGs remain elusive. Therefore, given that antimicrobial resistance is still a crucial and urgent threat to human's health and the environment, a summary of the methods and prospects for ARG studies is essential. Therefore, this review first summarized the popular and latest bioinformatics methods for analyzing the metagenomic data generated by next-generation sequencing and third-generation sequencing, including bioinformatic tools, ARG databases, and MGE databases. And then several bioinformatic tools and databases were benchmarked to evaluate their benefits and drawbacks. Finally, several critical comments and perspectives were proposed for future ARG studies to obtain an in-depth understanding of ARGs based on metagenomic data.

METAGENOMIC DATA IN ANTIBIOTICS RESISTANCE GENE STUDIES
In past decades, researchers have proved that microbiota plays an important role in maintaining human health (Marchesi et al., 2016;Valdes et al., 2018) and participating in biogeochemical circulation (Carnevali et al., 2021). Owing to its advantages, highthroughput sequencing has been widely applied in ARG studies. With the successful investigation on microbial communities in diverse environments, massive metagenomic datasets have been produced to investigate the taxonomical and functional compositions of microbial communities, obtain an in-depth understanding of functional traits, such as nitrogen cycle (Jansson and Hofmockel, 2018;Miao and Liu, 2018) and ARGs (Stalder et al., 2019;Xiang et al., 2020), and explore the driving factors for the dynamic changes of functional traits (Pan et al., 2020). For example, based on metagenomic datasets, ARG profiles in different environments have been investigated and explored, such as activated sludge under high selective pressure with different antibiotics (Zhao et al., 2019) and seed activated sludge collected from a municipal wastewater treatment plant and five experiment groups with different antibiotics (Zhao et al., 2020) and a deep subtropical lake (Carnevali et al., 2021). These studies revealed that metagenomic sequencing creates an opportunity for capturing the majority of ARGs and their potential hosts. In addition, metagenomic analysis can reveal the transmission of ARGs and the risk of resistome (including ARGs) (Manaia, 2017;Yin et al., 2019;Qian et al., 2021a). In summary, proper utilization of metagenomic data can effectively provide an in-depth understanding of ARGs in the environment, particularly their transmission and risks.

BIOINFORMATIC TOOLS USED FOR DETECTING POTENTIAL ANTIBIOTICS RESISTANCE GENES BASED ON METAGENOMIC DATA
With the increasing of metagenomic datasets from nextgeneration sequencing and third-generation sequencing, many bioinformatic tools have been developed to conduct analyses at different aspects. In general, the methodological approaches of the whole metagenomic dataset can be divided into two types, namely, assembly-based and read-based (non-assembly, Figure 1A) (Boolchandani et al., 2019;Harris et al., 2019). With these strategies, several bioinformatics tools, including online tools, have been developed for identifying the ARGs and detecting new ARGs ( Figure 1).  (Li et al., 2021a) were developed and have been widely applied to detect potential ARGs from the gene datasets predicted from metagenomic contigs ( Figure 1B). Together with the ARG database, ARGs-OAP was designed as an online pipeline to fast annotate and classify ARG-like sequences from metagenomic data (Yang et al., 2016). Compared with the version 1.0 of ARGs-OAP, the latest version was updated and added with the Hidden Markov Model algorithm for the enhancement characterization and quantification of ARGs in metagenomic datasets based on the 16S rRNA gene and the average coverage of essential single-copy marker genes . Similarly, to solve the most challenging topics and provide a guide for diverse research in ARG studies, including the risk, evolution, and emergence of ARGs, a comprehensive profile of the distribution of ARGs on an ARGs online searching platform (ARGs-OSP) was constructed based on the distribution of potential ARGs in 55,000 bacterial genomes, 16,000 bacterial plasmid sequences, 3,000 bacterial integrin sequences and 850 metagenomes . Furthermore, PathoFact was designed and developed to solve the virulence factors (VFs) and ARGs of pathogenic microorganisms; this an easy-to-use, modular, and reproducible tool can predict VFs, bacterial toxins, and ARG from metagenomic data with high accuracy (de Nies et al., 2021). Moreover, on the basis of an updated database, ARGA was developed to assess the primer of ARGs and identify and annotate ARGs from environmental metagenomes (Wei et al., 2019). It should be noted that the identification of potential ARGs usually depends on the search results. The selection strategy is to choose the best hit among the search results; however, this strategy can produce a high rate of false negatives. As a solution, DeepARG with two deep learning models (Arango-Argoty et al., 2018) and HMD-ARG (Li et al., 2021a) were constructed for ARG detection.

Frontiers in Environmental
In contrast, only a few read-based bioinformatic tools were developed for ARG detection. For example, one deep learning model of DeepARG, namely, DeepARG-SS, was designed to analyze the short-read sequences in metagenomes (Arango-Argoty et al., 2018). Moreover, it's well-known that Oxford Nanopore sequencing can produce ultra-long read sequencing reads; however, the identified ARGs can be analyzed at reads level. As a solution, ARGpore  and NanoARG (Arango-Argoty et al., 2019) were constructed ( Figure 1B). Specifically, ARGpore was designed to detect ARGs and their hosts by utilizing BLAST, HMMER, and UBLAST . NanoARG was constructed as a web service to identify the ARGs from the long reads generated by Oxford Nanopore sequencing and provide the identification of metal resistance genes, mobile genetic elements (MGEs), and sequences with high similarity to known pathogens (Arango-Argoty et al., 2019).
In summary, diverse bioinformatic tools have been constructed and developed with different strategies. These bioinformatic tools can efficiently detect ARGs in different Frontiers in Environmental Science | www.frontiersin.org October 2021 | Volume 9 | Article 757365 environments to meet the requirements of ARG analysis. With the use of ARG profiles in environmental metagenomes, downstream analyses on co-occurrence patterns among ARGs, the arrangement of ARGs and MGEs, and host identification of ARGs can be performed to enhance the understanding of ARGs in diverse environments.

BIOINFORMATIC DATABASES USED FOR IDENTIFYING ANTIBIOTICS RESISTANCE GENES AND MGES
The identification of potential ARGs depends on the search results against the database. Therefore, ARG databases are very important because they determine the accuracy and completeness of ARGs in environmental metagenomes. To date, several ARGs databases have been constructed for ARG detection ( Figure 1C) (Liu and Pop, 2009). This database was widely applied to detect potential ARGs but is now abandoned because of the lack of updates. As a solution, CARD was rigorously constructed and developed in 2013. This database integrates disparate molecular and sequence data, provides a unique organizing principle (antibiotic resistance ontology and antimicrobial resistance gene detection models), and can quickly and effectively detect putative ARGs (McArthur et al., 2013). CARD is currently a bioinformatic database and a compressive platform for identifying resistance genes, including their products and associated phenotypes (https:// card.mcmaster.ca/). In 2016, SARG was constructed with a hierarchical structure (type-subtype-reference sequence) by integrating the two most commonly used ARG databases ARDB and CARD, removing their redundant sequences, and re-selecting the query sequences based on the similarity of sequences; this database can identify ARG sequences through similarity search (Yang et al., 2016) and has been widely used in ARG studies (Zhao et al., 2019;Zhao et al., 2020). The latest version of SARG (v2.0) has tripled the sequences of the first version, improved the coverage of ARG detection, and annotated the high-throughput raw reads by using a similarity search strategy in diverse environmental metagenomes (Zhao et al., 2020). Based on ARDB, an updated SDARG, including 1,260,069 protein sequences and 1,164,479 nucleotide sequences from 448 types of ARGs belonging to 18 categories of antibiotics, was constructed and used in ARGA (Wei et al., 2019). Moreover, as a companion database to DeepARG, DeepARG-DB was designed to improve the quality of the model (Arango-Argoty et al., 2018). These ARG databases provide choices for researchers to comprehensively detect ARGs in environmental metagenomes.
Together, these diverse ARG and MGE databases provide a powerful resource for identifying ARGs and MGEs, exploring the distribution of ARGs, investigating the relationship between ARGs and MGEs, and obtaining an in-depth understanding of ARG transmission that benefits their management.
Comparison of ARG profiles in various environmental niches identified with different bioinformatic tools and databases revealed inconsistency in the kinds and total number of ARGs ( Figure 2). For the sediment of lake and activated sludge samples, the number of ARG types identified with ARG-ANNOT, CARD, and Resfinder was higher than that with DeepARG-DB but fewer than that with SARG (v2.0, Figure 2A). Similarly, the total number of ARGs identified with SARG (v2.0) was the highest among all databases ( Figure 2B). Further comparison of ARG profiles in sample SRR14610228 revealed the differences in the intersections of ARG profiles detected with two, three, four, and five bioinformatic tools and databases ( Figure 2C). All these results suggested that although SARG (v2.0) is a good database for identifying potential ARGs, the application of two or more bioinformatic tools and databases could provide comprehensive ARG profiles in different environmental samples.

BIOINFORMATIC TOOLS FOR TRACKING THE ANTIBIOTICS RESISTANCE GENE SOURCE
Considering the tight linkage between ARGs in the environment, the ARG source is important to their transmission and management. Therefore, an ARG source tracking platform must be urgently developed. In the past 2 decades, many researchers have realized the importance of tracking the source of ARG and thus developed many bioinformatic tools or frameworks. Among which, a series of bioinformatic tools or frameworks were developed and proposed for tracking ARG pollution from different sources, such as SourceTracker (Knights et al., 2011) and its application in metagenomic datasets (Meta-SourceTracker) (McGhee et al., 2020), Microbial Source Tracking (MST) , Meta-Prism , and FEAST (Shenhav et al., 2019). However, only a few tools and applications were used to track the genetic location of the host of ARGs (host-tracking of ARGs), such as PlasFlow (Krawczyk et al., 2018). Specifically, among these source-tracking tools, some including SourceTracker and MST can be used to precisely track ARG pollution from different sources. For example, based on deep-sequencing marker genes, such as 16S rRNA, SourceTracker was designed and constructed with a Bayesian classification model; this tool uses Gibbs sampling to determine the possibility and predict the source of samples (Knights et al., 2011) and has been widely applied to determine the source of ARG pollution in diverse environment samples (Hu et al., 2020;Chen et al., 2019c). Moreover, on the basis of a machine-learning classification strategy with ARG abundance profiles, MST was developed and constructed as a source-tracking platform that can precisely track ARG pollution from different sources, such as feces of humans and animals, wastewater treatment plants, and other natural environments Li et al., 2020), which is available at https://smile.hku. hk/SARGs/. Additionally, based on the genome signatures of sequences from 9,565 bacterial plasmid and chromosomes, PlasFlow with a deep neural network model was developed to predict the bacterial plasmid sequences or chromosomes in metagenomic contigs with high classification accuracy (Krawczyk et al., 2018) and then assist in the tracking of the genetic location and taxonomy of ARG host. To date, the accurate host-tracking of ARGs remains a challenge in ARG studies. Nevertheless, numerous studies using these bioinformatic tools have been conducted to determine the source of ARG pollution and host-tracking of ARGs and explore the distribution pattern of ARGs in diverse environment samples (Ma et al., 2017;Chen et al. , 2019c;Dang et al., 2020;Raza et al., 2021;. Undoubtedly, tracking the source of ARG, including the source of ARG pollution and the host-tracking of ARGs, is important to obtain an in-depth understanding of ARG transmission and provide suggestions for managing ARGs in natural environments.

FUTURE PERSPECTIVES IN ANTIBIOTICS RESISTANCE GENE STUDIES
ARG pollution caused by the overuse of antibiotics has increased in diverse natural environments and has become a global concern about human health. At present, the metagenomic dataset produced by high-throughput techniques was popularly applied in ARG studies. Numerous investigations have been conducted to explore the distribution, transmission, source of ARG, and the key factors driving ARGs (Li et al., 2015;Zhao et al., 2020;Li et al., 2021b). Although an increasing number of studies have been conducted, the lack of in-depth understanding of ARGs limits the management and elimination of ARG pollution. Hence, we provide several critical perspectives about research methods and data analysis in ARG studies to deep mining the knowledge of ARGs. First, the deep mining of metagenomic datasets is essential, especially in studying the transmission and host-tracking of ARGs. Current metagenomic analyses are mainly focused on the detection of ARGs, the co-occurrence network among ARGs, and the relationship between ARGs and MGEs. However, the content of the analysis is nearly ending, and the metagenomic dataset is not fully utilized. Hence, a comprehensive analysis of metagenomic data is essential to expand the understanding of ARGs. For example, investigating the arrangement of ARGs and their relationship with adjacent genes and MGEs is a potential approach to reveal the transmission of ARGs. Moreover, on the basis of metagenome binning results, the taxonomy of ARGcarrying contigs can be accurately identified, and the key challenge of annotating the taxonomical source of ARG can be solved, thereby benefiting the host-tracking of ARGs.
Second, a formula or standard for calculating the ARGs abundance should be unified. Recent calculation methods of ARGs abundance are diverse, such as the transcripts per kilobase of exon model per million mapped reads (TPM) (Jing and Yan, 2020), reads per kilobase of exon per million mapped reads (RPKM) (Sekizuka et al., 2020), one read in one million reads (parts per million, ppm) (Zhang et al., 2015), (number) copy of ARG per copy of 16S rRNA gene (Li et al., 2015), and abundance (coverage, ×/Gb) (Zhao et al., 2020). This condition limits the intuitive comparison of the profiles and risks of ARGs in various environments. Designing and appointing a unified formula for calculating ARG abundance are necessary to estimate ARG pollution and its risks in different environments.
Third, the application of third-generation sequencing in ARG studies can expand the understanding of ARGs, especially their genetic location and hosts. Third-generation sequencing techniques, such as Pacific Biosciences (PacBio) and Nanopore sequencing techniques, should be applied to profile ARG composition in diverse environments, explore the occurrence pattern of ARGs, and track their source. These techniques can generate long reads and obtain large genomes that can span most repetitive sequences and benefit the taxonomical identification of ARGs (Ye et al., 2016;Qian et al., 2021b). For example, the profile, genetic location, and hosts of ARGs, particularly the potential ARG-carrying pathogens, were investigated and explored throughout the wastewater treatment process by using the combination of Nanopore and Illumina sequencing; this work established a baseline analysis framework to explore ARGs in environmental niche and expanded the knowledge of resistome in wastewater treatment plants (Che et al., 2019). Several shortcomings, such as the cost of sequencing and the extract method of high-quality DNA, limit the use of third-generation sequencing in current ARG studies.
Finally, the findings should be verified in a wet laboratory. Current ARG studies mainly collected samples from natural environments, and the results or conclusions are untested and un-verified in a wet laboratory. Hence, proper experimental works should be designed and conducted to simulate the natural environment in the laboratory and verify the pattern of ARGs under these conditions. The results will have substantial implications for estimating ARG pollution and managing the related risks.

CONCLUSION
This review summarized current bioinformatic approaches and databases for identifying potential ARGs in metagenomic data. In particular, several bioinformatic tools and databases were benchmarked to estimate their advantages in detecting ARGs in different environmental niches. Several suggestions were also proposed to expand the analysis content of ARG studies. Together, by accumulating and updating current bioinformatic tools for analyzing metagenomic datasets and ARG and MGE databases, source-tracking tools for ARGs, and providing perspectives for future ARG studies, this comprehensive review provides a holistic assessment of the application of metagenomic data in ARG studies. The findings provide insights into the transmission of ARGs and pave the way for establishing priority in managing ARG pollution.

AUTHOR CONTRIBUTIONS
MH and ZW designed the study. ZP, YM, MH, and ZW wrote the initial draft of the manuscript. All authors read, modified, and approved the final manuscript.