A Comparison of Microbial Genome Web Portals

Microbial genome web portals have a broad range of capabilities that address a number of information-finding and analysis needs for scientists. This article compares the capabilities of the major microbial genome web portals to aid researchers in determining which portal(s) are best suited to their needs. We assessed both the bioinformatics tools and the data content of BioCyc, KEGG, Ensembl Bacteria, KBase, IMG, and PATRIC. For each portal, our assessment compared and tallied the available capabilities. The strengths of BioCyc include its genomic and metabolic tools, multi-search capabilities, table-based analysis tools, regulatory network tools and data, omics data analysis tools, breadth of data content, and large amount of curated data. The strengths of KEGG include its genomic and metabolic tools. The strengths of Ensembl Bacteria include its genomic tools and large number of genomes. The strengths of KBase include its genomic tools and metabolic models. The strengths of IMG include its genomic tools, multi-search capabilities, large number of genomes, table-based analysis tools, and breadth of data content. The strengths of PATRIC include its large number of genomes, table-based analysis tools, metabolic models, and breadth of data content.


INTRODUCTION
A number of web portals provide the scientific community with access to the thousands of microbial genomes that have been sequenced to date. This article compares the capabilities of the major microbial genome web portals to aid researchers in determining which portal(s) best serve their information-finding and analytical needs.
The power that a genome web portal provides to its users is a function of what data the portal contains, and of the types of software tools the portal provides to users for querying, visualizing, and analyzing the data. Query tools enable researchers to find what they are looking for. Visualization tools speed the understanding of the information that is found. Analysis tools enable extraction of new relationships from the data.
We assess the data content of each portal both according to the types of data it provides (e.g., does it provide regulatory network information, protein localization data, or Gene Ontology annotations?), and according to the number of genomes it provides. We assess the software tools provided by each portal in several major areas: genomics tools, metabolic tools, advanced search and analysis tools, web services, table-based analysis, and user accounts. Omics data analysis capabilities are also assessed, but are distributed among the preceding areas. In each area, we enumerate multiple software capabilities, such as the ability to paint omics data onto pathway diagrams. We must emphasize that many of the portals include a significant number of other capabilities that we consider to be outside the scope of a microbial-genome web portal, and that are therefore not within the purview of this study. The Results section examines the comparison criteria in detail; for a higher level summary of the results, see the Discussion section.
Search tools are a particularly important part of a portal because they determine the user's ability to find information of interest; therefore, we provide detailed comparisons of the search tools that each portal provides for finding genes, proteins, DNA and RNA sites, metabolites, and pathways. We call these multisearch tools because they enable the user to search multiple database (DB) fields in combination.
Although user friendliness is a critical aspect of any website, it is extremely difficult to assess objectively. We have assessed a small number of relatively objective user friendliness criteria, such as the types of user documentation available, the presence of explanatory tooltips (small information windows that appear when the user hovers over regions of the screen), and the speed of the site's gene page.
Our criteria for inclusion in the comparison were portals with a perceived high level of usage, large number of genomes, a relatively rich collection of tools, and sites that are actively maintained and developed. The portals we compare are BioCyc Caspi et al. (2018)  Related portals that are not included in this comparison are Entrez Genomes (whose capabilities are similar to Ensembl Bacteria), MicroScope Vallenet et al. (2017) (which uses Pathway Tools for its metabolic component and therefore has the same metabolic functionality as BioCyc), ModelSEED Henry et al. (2010) (which is a metabolic model portal, not a genome portal), the SEED Overbeek et al. (2014) (which has been inactive for a number of years and was subsumed by the PATRIC project), MicrobesOnline Dehal et al. (2010), iMicrobe (https://www.imicrobe.us/-a portal for metagenomes and transcriptomes, not for single genomes), and Microme (http://www.microme.eu/-the Microme website largely shut down as of January 2018).

Summary of the Portals
Here we introduce each portal. Note that some portals have some capabilities that are not covered in this comparison. For each portal we provide a hyperlink to a sample gene page.

BioCyc
BioCyc Caspi et al. (2016) and Karp et al. (2017) is a microbial genome web portal that integrates sequenced genomes with curated information from the biological literature, with information imported from other biological DBs, and with computational inferences. BioCyc data include metabolic pathways, regulatory networks, and gene essentiality data. BioCyc provides extensive query and visualization tools, as well as tools for omics data analysis, metabolic path searching, and for running metabolic models. We omit discussion of many BioCyc comparative genomics and metabolic operations under its Analysis → Comparative Analysis menu. Scientists can use the Pathway Tools software associated with BioCyc to perform metabolic reconstructions and create BioCyc-like DBs for inhouse genome data.
BioCyc contains information curated from 89,500 publications. The curated information includes experimentally determined gene functions and Gene Ontology terms, experimentally studied metabolic pathways, and experimentally determined parameters such as enzyme kinetics data and enzyme activators and inhibitors. Curated information also includes textual mini-reviews that summarize information about genes, pathways, and regulation, with citations to the primary literature. The large amount of curated information within BioCyc is unique with respect to other genome portals.

KBase
KBase is an environment for systems biology research that provides more than 160 applications to support user-driven analysis of a variety of data ranging from raw reads to fully assembled and annotated genomes, and metabolic models. In addition to its genome-portal capabilities, KBase Arkin et al. (2016) enables users to assemble and annotate genomes, to analyze transcriptomics data, and to create metabolic models for organisms with sequenced genomes. Once a model is created, it can be analyzed using phylogenetic, expression analysis, and comparative tools. KBase also allows users to integrate custom code into their analysis pipeline and enables addition of external applications by their developers using a software development kit (SDK). Its other major aim is to support reproducible computational experiments, on models, that can be published and shared with other users. Home page: https://kbase.us/ Sample gene page (full): https://narrative.kbase.us/#dataview/ 35926/2/1?sub=Feature&subid=b2699 Sample gene page (short): https://tinyurl.com/y8twmntz Bulk download site: The KBase website says that a bulk download site is coming soon.

IMG
The Integrated Microbial Genomes (IMG) system is a resource for annotation and analysis of sequence data, integrated with environmental and other metadata to support genome and microbiome comparisons. In addition to being the vehicle for release of the data generated by the DOE Joint Genome Institute, it provides a suite of analytical and visualization tools available to explore and mine the data for biological inference. Custom data marts dedicated to specific research topics like synthesis of secondary metabolite (IMG-ABC) or viral eco-genomics (IMG/VR), are also included. Users can submit their own data and metadata for integration in the system.

PATRIC
PATRIC is designed to support the biomedical research community's work on bacterial infectious diseases via integration of vital pathogen information with data and analysis tools. Data is integrated across sources, data types, molecular entities, and organisms. Data types include genomics, transcriptomics, protein-protein interactions, 3D protein structures, sequence typing data, and metadata. It supports both genome assembly and annotation (RAST), and RNA-seq data analysis via a job submission system.

RESULTS
We assessed the software and data content capabilities of each portal according to a number of topic areas, such as genomicsrelated tools and metabolism-related tools. We chose topic areas that we considered to be core elements of a microbial genome information portal-that is, a web site that counts among its primary missions providing users with data and knowledge regarding sequenced microbial genomes. A number of the portals contain functionality outside of that mission, for example, some portals contain software tools for annotating microbial genomes (e.g., performing assembly and gene-function prediction). We did not include such functionality because we considered it outside the scope of a microbial genome information portal. In many cases, we added new criteria within a topic area (meaning rows within our comparison tables) as we learned about each portal, such as adding the ability of Ensembl Bacteria to predict the effects of sequence variants. Our choice of criteria is validated by the fact that many of the criteria are shared among some or many of the portals.
For several of the topic areas, we provide multiple tables to assess software capabilities, with one or two tables focusing on DB search capabilities and another table focusing on other capabilities in that area. For example, Tables 2, 3 describe genomics multi-search tools, and Table 1 describe other genomics software tools.

Genomics Tools
Genomics tools enable researchers to query, analyze, and compare genome-related information within an organism DB. Table 1 assesses most genomics tools; Tables 2, 3 describe genomics multi-search tools.
An explanation of the rows within Table 1 is as follows.
• Genome Browser: Can a user browse a chromosome at different zoom levels to see the genomic features present?
-Are operons, promoters, and transcription-factor binding sites depicted in the genome browser? -Is the nucleotide sequence depicted in the genome browser? -Customizable Tracks: Can a user add additional tracks to the genome browser, which show user-supplied data? -Comparative, by Orthologs: Can a user compare chromosome regions from several genomes side-by-side, with orthologous genes indicated? -Genome Poster: Can the portal generate a printable, detailed, wall-sized poster of the entire genome, e.g., one that depicts every gene in the genome?
• Retrieve Gene Sequence: Can a user retrieve the nucleotide sequence of a gene? • Retrieve Replicon Sequence: Can a user retrieve the nucleotide sequence of a specified region of a replicon? "Partial" means that the tool provides some but not all of the indicated functionality. a KEGG does have a rudimentary tool for this purpose, but it is not based on a zoomable genome browser. b PATRIC supports construction of trees from an arbitrary set of in-group and out-group genomes.
• Retrieve Protein Sequence: Can a user retrieve the aminoacid sequence of a protein? • Nucleotide Sequence Alignment Viewer: Can a user compare the nucleotide sequence of a gene with orthologs from other organisms? • Protein Sequence Alignment Viewer: Can a user compare the amino-acid sequence of a protein with orthologs from other organisms? • Protein Phylogenetic Tree Analysis: Can a user construct a phylogenetic tree from a set of protein sequences? • Sequence Searching by BLAST: Is searching for a sequence in a genome by BLAST supported? • Sequence Pattern Search: Is sequence searching by short sequence patterns supported?
• Sequence Cassette Search: Is sequence searching by protein family recognition patterns supported?  Does the portal support multi-searches for genes and gene products based on the data fields or criteria listed? "Publication" means the ability to search for a gene based on a publication cited in the pathway entry. "Scaffold Length" means the ability to search for a gene based on the length of the scaffold it resides on. "Protein Family Assignment" means the ability to search for a gene based on what protein families it is assigned to (e.g., Pfam or TIGRFAM family). "Is Partial" means search for partial (truncated) proteins.

Metabolic Tools
Metabolic tools enable researchers to query, analyze, and compare information about metabolic pathways and reactions within an organism DB, to run metabolic models, and to analyze high-throughput data in the context of metabolic networks.   Does the portal support multi-searches for chemical compounds based on the data fields or criteria listed? "Ontology" means the ability to search for compounds based on a chemical ontology (classification). a This search will find pages of antimicrobial compounds.
• Automated Metabolic Reconstruction: Starting from a functionally annotated genome, can the metabolic reaction network (and pathways) be inferred in an automated fashion?  Does the portal support multi-searches for pathways based on the data fields or criteria listed? "Ontology" means the ability to search for pathways based on a pathway ontology (classification).
knock-outs perturb the network, and to predict gene essentiality? • Chokepoint Analysis: Can the site compute chokepoint reactions (possible drug targets) in the full metabolic reaction network? A chokepoint reaction is a reaction that either uniquely consumes a specific reactant or uniquely produces a specific product in the metabolic network. can the site compute an optimal series of known reactions (routes) that converts the starting metabolite to the ending metabolite? • Path Prediction Tool: Given a starting chemical compound, can the site predict a series of previously unknown enzymecatalyzed reactions that will act upon the input compound and the products of previous reactions? • Assign EC Number: Can the portal compute an appropriate Enzyme Commission number for a user-provided reaction?

Regulation Tools
BioCyc has a number of regulatory informatics tools that are not provided by any of the portals. We list those tools here rather than providing a table.
• BioCyc includes a regulatory-network browser that depicts the full transcriptional regulatory network of the organism. The network diagram can be queried interactively and painted with transcriptomics data. • The BioCyc transcription-unit page depicts operon structure including promoters, transcription factor binding sites, and terminators, the evidence for each, and describes regulatory interactions between these sites and associated transcription factors and small RNA regulators. • BioCyc generates diagrams that summarize all regulatory influences on a gene, including regulation of transcription, translation, and of the gene product.
• BioCyc depicts transcription-factor regulons as diagrams of all operons regulated by a transcription factor. • BioCyc can depict regulatory influences on metabolism by highlighting the regulon of a transcription factor on the BioCyc metabolic map diagram. • BioCyc SmartTables can list the regulators or regulatees of each gene within a SmartTable. • BioCyc can generate a report comparing the regulatory networks of two or more organisms.

Advanced Search and Analysis
These tools (see Table 7) enable researchers to perform complex searches and analyses, to retrieve data via web services and bulk downloads, and to create and manipulate user accounts. An explanation of the rows within Table 7 is as follows.
•  exist, or for programmatically calling Kbase native apps to automate large scale analyses. -PATRIC provides a downloadable command line interpreter application that allows interactive submission of DB queries using a query language.
• User Account: Are user accounts available for logging in, and for storing data and preferences? "Opt/Req" means accounts are optional for some operations and required for other operations. • Custom Notifications: Does the portal enable the user to register to be notified of curation updates in biological areas of interest to the user? • Bulk Download Formats: What formats are supported by the portal for large scale data downloads? The websites for bulk downloads are provided in section 1.1.

Table-Based Analysis Tools
Table-based analysis tools enable users to define lists of genes, proteins, metabolites, or pathways that are stored within the portal, and can be displayed, analyzed, manipulated, and shared with other users. These tools are called SmartTables by BioCyc and are called Carts by IMG. A typical series of SmartTable operations are to define a SmartTable containing a list of genes (such as from a transcriptomics experiment); to configure which DB properties are displayed for each gene within the SmartTable (such as displaying the gene name, accession number, product name, and genome map position); performing a set operation on the SmartTable such as taking the intersection with another gene SmartTable; and transforming the gene SmartTable to say a SmartTable of the metabolic pathways containing those genes, or the set of transcriptional regulators for those genes. KBase does not have a tables mechanism, but it does have a data sharing mechanism called narratives, which is not table-based.
Table-based capabilities are summarized within Table 8; an explanation of its rows is as follows. 2.6. Data Content Among the Portals Table 9 describes the types and quantities of data present in each web portal. An explanation of the rows within the Table 9 is as follows.
• Genomes (Bact./Arch.): How many bacterial genomes (organisms) does the portal provide access to? Only bacteria and archaea are counted here, although some resources provide eukaryotic and viral genomes. BioCyc genomes are sourced from RefSeq, GenBank, and from the Human Microbiome Project. KEGG genomes are sourced from     Table 11 summarizes the number of capabilities present in each portal. In each row of Table 11 we have summed the counts in the column for each portal from the specified tables, with each "YES" counted as 1, each "partial" counted as 1/2, and each "no" counted as 0. These data are also presented in Figure 1. BioCyc received the highest tally (88). IMG (54) and PATRIC (53.5) were essentially tied for second. KEGG, KBase, and Ensembl Bacteria ranked fourth, fifth, and sixty with tallies of 32, 29.5, and 16, respectively.

DISCUSSION
BioCyc has the most extensive multi-search capabilities, with IMG in second place; these portals provide users with the most extensive capabilities for finding desired information.
IMG has the most genomics capabilities, with PATRIC and BioCyc second and third. Ensembl Bacteria has the fewest genomics capabilities. BioCyc and IMG have the most powerful gene/protein multi-search capabilities. BioCyc has the most extensive capabilities for DNA/RNA site multi-searches.
BioCyc has the most extensive metabolic capabilities. KEGG ranks second; it lacks metabolic modeling capabilities, and it lacks network analysis tools such as dead-end metabolite analysis and chokepoint analysis. BioCyc has the most extensive metabolic multi-search capabilities, with IMG second. Table-analysis tools make extensive data analysis capabilities available to users that in many cases would otherwise require assistance from a programmer. BioCyc has the most extensive  Row "Genome" summarizes the major capabilities for genomics tools present in Table 1. Row "Metabolic" summarizes the major capabilities for metabolic tools present in Table 4. Row "Regulatory" summarizes the regulatory capabilities discussed in section 2.3. Row "Advanced" summarizes the major capabilities for advanced tools present in Table 7. Row "Tables" summarizes table-driven analysis capabilities for each portal present in Table 8. Row "Multi-Search" summarizes the number of multi-search capabilities for each portal present in Tables 2, 3, 5, 6. Row "Data Types" summarizes the number of datatypes provided by each portal present in Table 9, from row "Genome Metadata" downward. Row "Totals" sums each column but excludes the Multi-Search row because Multi-Search operations tend to be much smaller than operations in other categories.
FIGURE 1 | Spider plot of the data in Table 11, excluding the Multi-Search row to enhance resolution. PATRIC has the largest number of genomes, with KBase and IMB ranked second and third, respectively; KEGG has the smallest number of genomes. Most of the PATRIC genomes were assembled from wholegenome shotgun data and thus are expected to be of lower quality-only 11,803 PATRIC bacterial genomes are complete genomes.
KEGG provides the fastest loading gene pages; BioCyc pages are the second fastest. Pages for KBase, Ensembl Bacteria, and IMG are significantly slower. PATRIC gene pages are the slowest, loading 13.96 times slower than KEGG gene pages.
BioCyc contains the most extensive analysis capabilities for metabolomics and transcriptomics data, including painting omics data onto individual pathways, multipathway diagrams, and zoomable metabolic maps; enrichment analysis for GO terms, regulation, and pathways; and an Omics Dashboard.
BioCyc contains extensive unique content not included in any of the other portals including regulatory network data, data on growth under different nutrient conditions, experimental gene essentiality data, reaction atom mappings (also present in KEGG), and thousands of textbook page equivalents of mini-review summaries. KEGG is particularly lacking a diverse range of datatypes, for example, KEGG lacks protein features, localization information, GO terms, and evidence codes.

CONCLUSIONS
Microbial genome web portals have a broad range of capabilities, and are quite variable in terms of what capabilities they provide. We assessed the capabilities of BioCyc, KEGG, Ensembl Bacteria, KBase, IMG, and PATRIC. BioCyc provided the most capabilities overall in terms of bioinformatics tools and breadth of data content; it also provides a level of curated data content (curated from 89,000 publications) that far exceeds that within the other sites. IMG ranked second overall, second in bioinformatics tools, and second in number of genomes. KEGG ranked third overall, PATRIC ranked fourth, KBase ranked fifth, and Ensembl Bacteria ranked sixth. IMG provided the most extensive genome-related tools, with BioCyc a close second. BioCyc provided the most extensive metabolic tools, with KEGG ranked second. Ensembl Bacteria provided no metabolic tools. PATRIC provided the largest number of genomes. BioCyc provided extensive regulatory network tools (and data) that are not present in any of the other portals. BioCyc provided the most extensive SmartTable tools and the most extensive omics data analysis tools.

AUTHOR CONTRIBUTIONS
PK directed the project and wrote much of the manuscript. NI, MK, NK, ML, PM, WO, SP, and RS researched the portals and contributed to the manuscript.