PGAweb: A Web Server for Bacterial Pan-Genome Analysis

An astronomical increase in microbial genome data in recent years has led to strong demand for bioinformatic tools for pan-genome analysis within and across species. Here, we present PGAweb, a user-friendly, web-based tool for bacterial pan-genome analysis, which is composed of two main pan-genome analysis modules, PGAP and PGAP-X. PGAweb provides key interactive and customizable functions that include orthologous clustering, pan-genome profiling, sequence variation and evolution analysis, and functional classification. PGAweb presents features of genomic structural dynamics and sequence diversity with different visualization methods that are helpful for intuitively understanding the dynamics and evolution of bacterial genomes. PGAweb has an intuitive interface with one-click setting of parameters and is freely available at http://PGAweb.vlcc.cn/.


INTRODUCTION
The advancement of sequencing technology and assembly tools has facilitated fast acquisition of genome sequences of various species over the past two decades. Tettelin et al. (2005) coined the concept of the pan-genome in. The pan-genome encompasses the entire repertoire of genes accessible to a studied phylogenetic clade or a given species, which is divided into the core genome (genes shared by all strains), dispensable genes (genes shared only by a subset of the strains), and strain-specific (unique) genes. Pan-genomics has far-reaching significance (Vernikos et al., 2015;Xiao et al., 2015), especially in bacterial research, and has been applied to many fields, such as species evolution (Lefebure and Stanhope, 2007;Rasko et al., 2008), investigation of pathogenic and drug resistance mechanisms (D'Auria et al., 2010;Hu et al., 2011;Zhang et al., 2011), vaccine development (Maione et al., 2005;Serruto et al., 2009), host and environment adaptability (Saenz et al., 2007;Sugawara et al., 2013;Tian et al., 2016), and population structures (Donati et al., 2010).
To make bacterial pan-genome analysis convenient and efficient, several tools, online databases, and web servers have been developed over the past two decades (Vernikos et al., 2015;Xiao et al., 2015). PGAP (Zhao et al., 2012), GET_HOMOLOGUES (Contreras-Moreira and Vinuesa, 2013), ITEP (Benedict et al., 2014), PanGP (Zhao et al., 2014), Roary (Page et al., 2015), and panX (Ding et al., 2018) were introduced as pan-genome analysis tools for clustering orthologous genes, identifying single nucleotide polymorphisms (SNPs), constructing phylogenies, function-based searching or analysis, and gene curation. However, they all are command-line tools demanding basic programming skills and with lack of in-depth inference and visualization of results. PGAT (Brittnacher et al., 2011), as a web-based database, provides many useful functions, such as predicting protein families, identifying SNPs, and providing analysis tools for Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and other genome annotations, but a limited number of species are integrated and users' own data cannot be uploaded and analyzed. PanWeb (Pantoja et al., 2017), PANNOTATOR (Santos et al., 2013), and PanCGHweb (Bayjanov et al., 2010) are web servers for pan-genome analysis, but they all suffer from some flaws, such as limited functions or working only with microarray data.
In order to perform bacterial pan-genome analysis more efficiently, we have developed a user-friendly online tool, PGAweb, based on our previously published software, PGAP (Zhao et al., 2012), PanGP (Zhao et al., 2014), and PGAP-X (Zhao et al., 2018). PGAP and PanGP have been reported as the wellused toolkits for pan-genome analysis (Vernikos et al., 2015;Xiao et al., 2015;Zekic et al., 2018). To enhance interpretation and visualization of genome dynamics, we have integrated our newly developed software PGAP-X into PGAweb (Zhao et al., 2018). Based on these three tools, PGAweb serves as a fast pan-genome analysis web server with some great improvements. In brief, PGAweb has a user-friendly, web-based interface, one-click input data submission, and smooth and efficient data analysis, not only providing diverse methods for identifying orthologous genes, but also presenting genomic diversity with different visualization methods and genomic dynamics information.

METHODS AND IMPLEMENTATION
PGAweb consists of two core modules, PGAP and PGAP-X. PGAP supports five main analytic functions including orthologous clustering, pan-genome profiling, sequence variation analysis, species phylogeny, and gene functional classification. PGAP-X performs four comparative genomics assignments: genome-wide sequence alignment, orthologous clustering, pangenome profiling, and sequence variation analysis (Figure 1). Compared with PGAP, PGAP-X introduces a novel algorithm for orthologous cluster identification, which can distinguish paralogs based on conserved genomic location from a genome alignment. PGAP-X also supports a visualization function for genome-wide sequence alignment, genome structure variation, and orthologous mapping based on sequence conservation. Users are able to choose one of the modules according to their research objectives.

PGAP Module
In this module, two different methods, MultiParanoid (MP) and GeneFamily (GF), are provided for orthologous clustering. In the MP method, first, orthologs between each pair of strains are identified by InParanoid (Remm et al., 2001;Ostlund et al., 2010); MP is then invoked to search for gene clusters among multiple strains based on the result of pairwise orthologous clusters from Inparanoid (Alexeyenko et al., 2006). With the GF method, all the protein sequences from the input data are first mixed together and an all vs. all sequence alignment is performed by the BLASTALL program; then, the MCL algorithm (Enright et al., 2002;Van Dongen, 2008) is directly invoked to cluster the filtered BLAST results. To detect the relationships between the numbers of genomes and pan-genome size or core gene clusters, pan-genome profiles are calculated based on orthologous clustering from all strains with the Heaps' law or exponential model (Tettelin et al., 2005;Rasko et al., 2008). Genetic variation analysis and gene functional classification are also performed in this module. For each gene cluster, mutations, and indels are detected by protein sequence alignment of each cluster using MAFFT (Katoh and Toh, 2010) and then the aligned amino acid sequences are converted into corresponding codonbased nucleotide sequences. Functional classification analysis of gene clusters includes description of each cluster's function and Clusters of Orthologous Groups (COG) classification (Tatusov et al., 2000) according to uploaded annotation files. The function description with highest frequency excluding "hypothetical protein" for all genes in each cluster is taken as the description of the cluster. For further elucidating genomic diversity, species evolution analysis is performed by two algorithms [Neighborjoining and Unweighted Pair Group Method with Arithmetic Mean (UPGMA)] using PHYLIP (Felsenstein, 1989) based on two types of data: gene gain/loss matrix or SNPs in core gene clusters. The maximum-likelihood algorithm is only used in the analysis based on SNPs in core gene clusters.

PGAP-X Module
In this module, a new in-house algorithm is used for orthologous gene clustering (Zhao et al., 2018). Whole-genome sequences or pseudo-chromosome sequences are aligned by progressiveMauve (Darling et al., 2010). In visualizing alignment results, colored blocks are used to display homologous DNA fragments across strains, which highlight conserved regions, strain-specific blocks, and recombination events including insertions, inversions, and deletions on a large scale. The distribution of orthologous genes in each conserved block is displayed in different colors that represent degrees of conservation for each gene. A pan-genome profile is calculated based on the same methods as in the PGAP module. Genetic variation analysis among pairwise or multiple genomes is presented. In this part, two vital parameters, variation frequency (f ), and variation number (n) in 1 kb regions, are used to filter out high-substitution regions. Regions with the following traits will be identified as high-substitution regions: (a) no fewer than n substitution sites in the region; (b) variation frequency in the region is more than f ; (c) the interval between any two substitution sites is less than 1/f. The high-substitution regions are marked on the genome structure. For pairwise variation analysis, a reference genome and a query genome should be selected.

Implementation
The web server is compatible with many platforms, including Safari, Firefox, Opera, Chrome, and Internet Explorer, by using Bootstrap Framework 1 as the front-end program. It adopts FIGURE 1 | Organization of PGAweb. All analysis steps are grouped into three parts: data input, data analysis, and output-visualization (shown with different color codes and separated by dashed lines). The input data of each module will be checked. The filtered data will be used in the subsequent analysis. Green boxes report the analysis functions of each module. For the output, PGAweb provides one text file corresponding to a graph (genetic variation analysis in the PGAP module only provides tabular text).
HTML5 and CSS3 protocols and uses D3.js 2 and Echarts 3 for result interaction. The server back-end adopts an Express Framework 4 for Node.js MongoDB 5 is used for the storage of related information. Docker 6 is used as a container for packaging codes and software, which improves compatibility, portability, and safety. This website is open to all users, and login is not required.

WORKFLOW OF PGAWEB
Step 1: Module Selection The general workflow of PGAweb is shown in Figure 2. The first step is module selection for pan-genome analysis according to data type and research objectives. Two different pan-genome analysis modules are integrated in PGAweb. The PGAP module supports both draft and complete genomes and provides five main results, including orthologous clustering, pan-genome profiling, sequence variation analysis, species phylogeny, and gene functional classification. Most results generated by the PGAP module are text files. Detailed information can be found in these text files for in-depth pan-genome analysis. The PGAP-X module only supports analysis of complete genomes. The PGAP-X module's capabilities include genome alignment, orthologous gene identification, genome variation detection, and pan-genome profile analysis. PGAP-X provides an in-house algorithm that can distinguish paralogous genes from orthologous genes based on conserved genomic location. The visualization function for genome structure conservation and gene distribution is integrated in the PGAP-X module.
Step 2: Input Data Upload and Parameter Setting This step contains four procedures (Figure 2). First, specific input data for pan-genome analysis should be selected and uploaded. The PGAP module requires three types of files for each strain: nucleotide and protein sequences as well as annotations. Four input formats are selectable: (1) NCBI new data format: .cds_from_genomic.fna.gz, protein.faa.gz, and .feature_table.txt.gz extensions. These files can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/. (2) NCBI old data format: .ffn, .faa, and .ptt extensions. These files can be downloaded at ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/ old_genbank/. (3) PGAP raw data format: .nuc, .pep, and .function extensions. These data may come from the user's own laboratory..nuc files store nucleotide sequences in FASTA format; .pep files store corresponding protein sequences in FASTA format; .function files store annotations with three columns including gene name, COG classification, and function description. (4) NCBI GenBank (full) format: .gb extension. In addition, each protein sequence should have its corresponding nucleotide sequence and annotation information, while the length of the nucleotide sequence minus 3 (excluded stop codon) must be equal to three times the corresponding protein length. All uploaded files are auto-transformed to corresponding PGAP raw data format and criterion checking is performed. The PGAP-X module needs a complete genome sequence and annotation information for each strain. The input files should have .fna and .ptt extensions. More detailed description is available at http://pgaweb.vlcc.cn/doc. The second procedure is the selection of the orthologous gene identification method and pan-genome analysis functions. In the PGAP module, either the GF method or the MP method can be chosen for orthologous gene identification. The GF method is fast, but less accurate compared with the MP method. The GF method is suitable choice for handling more than 50 genomes with consideration of calculation time. In the PGAP-X module, there is only one method for orthologous gene identification. After selection of gene identification method, users need to select the desired functions for pan-genome analysis. Homologous gene clustering for the PGAP module and genome alignment for the PGAP-X module is required. Third, parameters for selected functions need to be set. Finally, users may provide e-mail addresses for obtaining notification of finished results and click the submit button for running PGAweb.
Step 3: Output Description The output of the PGAP module is composed of five parts: orthologous clustering, pan-genome profile, evolutionary analysis, sequence variation analysis, and functional classification. For orthologous clustering, the results include an intuitive figure showing statistical gene distribution by conservation and two downloadable tabular files (1.Gene_Distribution_By_Conservation.txt and 1.Orthologs_Cluster.txt). For the pan-genome profile, the relationships between the numbers of genomes and pan-genome size or core gene clusters are calculated. Whether a pan-genome is open or closed can be easily inferred from visual results. Pan-genome profiles provide indirect indications of lifestyle and opportunity for multiple mechanisms of DNA exchange with the global microbial gene pool of the specific species. Multiple algorithms for bacterial evolutionary analysis are provided to improve the accuracy of phylogenetic inference, especially for closely related species. All phylogenetic trees calculated by PHYLIP are output as .tree files, and the user can use visualization software to view the phylogenetic trees. Files with .outfile extension give users a simple view of the phylogenetic trees.
For sequence variation analysis, information on positions with variation, amino acid types, nucleotide types, and variation types (indels, and synonymous and non-synonymous mutations) are presented in tabular file 3.CDS.variation.txt. All statistical results of functional classification of core genes, dispensable genes, and strain-specific genes are dynamically displayed and compared. This result is useful for speculation of strain-specific physiological function and metabolic mechanism. Moreover, the actual gene lists (5.Orthologs_Core_Cluster_COG_Distribution.list. txt, 5.Orthologs_Dispensable_Cluster_COG_Distribution.list.txt, and 5.Orthologs_specific_Cluster_COG_Distribution.list.txt) used in the functional classification can be downloaded for in-depth analyzes.
Four parts of the pan-genome analysis results are showed in the PGAP-X output. Complete genome or pseudo-chromosome sequences are aligned in PGAP-X. In visualizing the alignment result, colored blocks are used to display homologous DNA fragments across strains, which highlight conserved regions, strain-specific blocks, and recombination events including insertions, inversions, and deletions on a large scale. The orthologous gene distribution in each conserved block is displayed in different colors that represent degrees of conservation for each gene. Genomic diversity of the bacterial population and differences in gene content are well depicted, and even horizontal gene transfer (HGT) or homologous recombination events are easily identified. Meanwhile, multiplegenome substitution and pairwise substitution regions are analyzed in PGAP-X. The high-substitution regions are marked on the genome structure with blue bars. For pairwise substitution region analysis, a reference strain must be selected; all sites of variation among paired genomes can be detected and reported in output text files but only high-substitution regions of selected strains are displayed. The influence of selective pressure on each genomic region or gene can be intuitively showed. Some highly variable regions may be involved in pathogenicity islands (PAIs) or related to strain adaptability. This may be used for SNP-based barcode design and gene selection for multilocus sequence typing (MLST). For the pan-genome profile, the result is the same as shown in the PGAP module.

CASE STUDY
To evaluate the performance of PGAweb, we carried out a case study for Streptococcus suis. The two modules, PGAP and PGAP-X, were used with default parameters (GF method was selected to identify orthologous genes in PGAP). The PGAweb results are described as follows.
The analysis revealed that S. suis has an open pan-genome, whose pan-genome increases in number by 110 genes when a newly sequenced genome is added (Figures 3A,B). The phylogenetic results give us a clear idea of the phylogenetic relationship among different serotypes of the 13 S. suis strains. Taking one of the phylogenetic trees (UPGMA tree based on the presence or absence of orthologous genes) as an example (Figure 3C), the 13 strains are clearly grouped into two subgroups and all the serotype 2 strains are tightly grouped in the same clade. Serotype 1, 3, 7, and 9 strains could be assigned to a common clade. Serotype 14 and 1/2 strains were more closely related to the serotype 2 strains than to the other four serotypes. This result is consistent with earlier findings by Zhang et al. (2011). From genome structure conservation analysis (Figure 3D), the 89K PAI exists in three serotype 2 strains isolated in China (NC_009442, NC_009443, and NC_012924), while the other two Chinese serotype 2 strains (NC_017617 and NC_017622) harbored on different branches of the phylogenetic tree ( Figure 3C) do not contain this PAI. This result confirms an earlier observation that the 89K PAI is prevalent and specific for Chinese serotype 2 virulent strains (Chen et al., 2007). The genes on the PAI are dispensable genes with lower-grade conservation ( Figure 3E). This suggests that these genes located in a cluster might have been introduced through HGT or homologous recombination with other closely related species .

DISCUSSION
Since the term of microbial pan-genome was introduced in 2005, several stand-alone and web-based tools have been developed for pan-genome analyses (Supplementary Table S1). However, stand-alone tools always require basic programming skills for installation and operation. We developed PGAweb for in-depth inference and visualization in bacterial pan-genome analysis. PGAweb has a user-friendly, web-based interface, one-click input data submission, and smooth and efficient data analysis. No installation, command lines, or informatics skills are necessary for using PGAweb, making it very efficient for users, especially experimental scientists. PGAweb provides various interactive and customizable functions including orthologous clustering, pan-genome profiling, sequence variation, evolutionary analysis, and functional classification. Different visualization methods are used for enhanced genomic structural dynamics and sequence diversity display. These are helpful for intuitively understanding the dynamics and evolution of bacterial genomes. According to the pan-genome concept, three genomic data-sets is the minimum number for pan-genome analysis. In the case of technical and resource limitations, PGAweb is only suitable for performing analyzes on the pan-genome of a moderate number of strains (no more than 50 strains). The key issue for large scale pan-genome analysis is the input data uploading with limited network speed. We estimate that the input data size will be exceed 500 Mb for 50 strains pan-genome analysis, which will cost users a very long time for input data uploading. Furthermore, many factors may influence the runtime of PGAweb, such as the number input data sets, genome sizes, selected method, etc. The running time grows exponentially with the increase in number of input data sets due to the computational complexity (Supplementary Figure S1). The current version of PGAweb still has some shortcomings, which only supports the pan-genome analysis of bacterial and archaeal genomes so far. With the development of sequencing technology, the pan-genome concept has also been widely utilized to perform comparative genomic analysis in fungi (Dunn et al., 2012) and plants (Cao et al., 2011;Li et al., 2014;Golicz et al., 2016). In a future version, we plan to develop a new algorithm to support the analysis of eukaryotes, especially plants and animals, and integrate more efficient methods for large scale pan-genome analysis.

CONCLUSION
Here, we present a fast and freely available online server, PGAweb, with integration of our previously published software, PGAP, PanGP, and PGAP-X. In brief, PGAweb has a userfriendly web interface, one-click input data submission, and smooth and efficient data analysis, and provides not only diverse methods for identification of orthologous genes but also powerful tools to display genome dynamics and structure characteristics. Case study indicates that PGAweb can intuitively reflect the characteristics of bacterial pan-genomes such as an open or closed pan-genome. Orthologous clustering, gene distribution based on conservation, genome structure conservation, and genetic variation, as well as their visualization, are all helpful for biologists to study bacterial genome dynamics to identify pathogenic and drug-resistant genes and their sequence contexts.

AUTHOR CONTRIBUTIONS
JX, ZJ, and JAW conceived and designed the web server. XC performed the server front-end and back-end. YAZ carried out the case studies. YOZ, CS, MY, JNW, QL, BZ, and MC contributed to the web server testing. YAZ, ZZ, and JX wrote the paper. JY polished the language and gave many constructive suggestions.