Homology-based reconstruction of regulatory networks for bacterial and archaeal genomes

Gene regulation is a key process for all microorganisms, as it allows them to adapt to different environmental stimuli. However, despite the relevance of gene expression control, for only a handful of organisms is there related information about genome regulation. In this work, we inferred the gene regulatory networks (GRNs) of bacterial and archaeal genomes by comparisons with six organisms with well-known regulatory interactions. The references we used are: Escherichia coli K-12 MG1655, Bacillus subtilis 168, Mycobacterium tuberculosis, Pseudomonas aeruginosa PAO1, Salmonella enterica subsp. enterica serovar typhimurium LT2, and Staphylococcus aureus N315. To this end, the inferences were achieved in two steps. First, the six model organisms were contrasted in an all-vs-all comparison of known interactions based on Transcription Factor (TF)-Target Gene (TG) orthology relationships and Transcription Unit (TU) assignments. In the second step, we used a guilt-by-association approach to infer the GRNs for 12,230 bacterial and 649 archaeal genomes based on TF-TG orthology relationships of the six bacterial models determined in the first step. Finally, we discuss examples to show the most relevant results obtained from these inferences. A web server with all the predicted GRNs is available at https://regulatorynetworks.unam.mx/ or http://132.247.46.6/.


Introduction
Bacterial and archaeal organisms respond to diverse stimuli via the subtle mechanism of regulation of gene expression at the transcriptional level, and this involves DNA-binding proteins known as transcription factors (TFs). These proteins act by interacting with specific sites, usually upstream of the transcription start site, inducing or blocking access of the RNA polymerase to the promoter. In general, when a TF binds at a site that overlaps the promoter region of a gene, the system is repressed; when the binding site is upstream of the promoter, the system is activated (Browning and Busby, 2016). In addition, this regulatory system is coordinated with the sensing of endogenous or exogenous stimuli by these regulatory proteins, i.e., they have the ability to sense diverse conditions for the cell to contend against environmental changes. For instance, in the bacterium Escherichia coli K-12, approximately three-quarters of TFs respond directly to extracellular signals through phosphorylation and binding to small molecules, such as allolactose or maltose (Balderas-Martínez et al., 2013).
In this context, the regulatory system can be conceptualized as a circuit, where one TF can regulate multiple Target Genes (TGs) and multiple genes can be regulated by one or diverse TFs, all of them assembled into a gene regulatory network (GRN). In GRNs, nodes represent genes and the connections between them indicate that the TF-encoding gene regulates another gene; this type of network can be represented by directed graphs (Karlebach and Shamir, 2008).
To date, GRNs have been determined for only a few bacterial models from three different phyla: Proteobacteria, including Escherichia coli K-12, Salmonella enterica subsp. enterica serovar typhimurium LT2, and Pseudomonas aeruginosa PAO1; Firmicutes, including Bacillus subtilis 168 and Staphylococcus aureus N315; and Actinobacteria, including Mycobacterium tuberculosis. The lack of GRNs for most microorganisms is due to the fact that reconstruction depends largely on experimental data. Therefore, the inference or expansion of regulatory relationships between TFs and their TGs in organisms beyond the bacterial models will allow us to understand diverse biological processes, such as cell growth, response to environmental changes, or cell division, among others.
In this regard, various approaches have been explored to reconstruct regulatory networks in bacteria, such as RegPrecise (Novichkov et al., 2010), with a large amount of information available for regulons of diverse organisms, or the work of Castro-Melchor et al. (2010) based on the transcript and functional similarities to infer regulatory networks in Streptomyces coelicolor, among others. However, the main limitations of these reconstructions are associated with the experimental information data.
Hence, to determine the GRNs in bacterial and archaeal genomes with no information on their regulatory interactions, we mapped orthologous interactions among the six bacterial models to identify novel TF-TG interactions. Next, we used a guiltby-association approach to infer the GRNs for 12,230 bacterial and 649 archaeal genomes, based on TF-TG orthology relationships of six bacterial species with well-known regulatory interactions and Transcription Unit (TU) assignments (i.e., operonic organization). The "guilt-by-association" principle has been applied to deduce functional relationships (Oliver, 2000), and used to predict gene function in various types of biological networks, for example in virulence factors of the bacterial pathogen, Aeromonas veronii (Li et al., 2021). The reconstructed networks were evaluated in terms of their topological properties, identifying TFs as hubs, modules, and co-regulated genes. Thus, our approach allowed us to confer a degree of accuracy regarding the existence of each inferred interaction. Therefore, the predicted interactions must be considered as a starting point to further exploration, both in silico and experimentally. We suggest that posterior analysis must consider the identification of DNA-binding sites upstream the probable regulated gene or a functional analysis with Gene Ontology and global expression profiles, as it has been already suggested in other cellular systems beyond bacteria and archaea (Chen, 2017). Finally, a web server with all the predicted GRNs is available to the scientific community at https://regulatorynetworks. unam.mx/ or http://132.247.46.6/.

Data and methodology
Genomes used for reference The information for six bacterial genomes used in this work was downloaded from either the NCBI server or RegulonDB: E. coli K-12 MG1655 (NC_000913.3, GCF_000005845.2), B. subtilis 168 (GCF_000009045.1), P. aeruginosa PAO1 (GCF_000006765.1), S. typhimurium LT2 (GCF_000006945.2), S. aureus N315 (GCF_000009645.1), and M. tuberculosis (GCF_000195955.2). For each genome, the FASTA sequence was obtained from the "gbff/gbk" files parsed with an ad hoc program (Supplementary material, ParserGBK.py), to add the appropriate label in the header: NCBI gene ID, local gene ID, gene name, product description, and organism name. Sequences with missing information were annotated as "NODATA. " In addition, the 12,230 genomes of bacteria and 649 archaeal genomes were downloaded from the NCBI RefSeq genome database on May 18, 2021, to infer their GRNs.

Ortholog identification
The protein sequences from each model organism were used as reference to identify the orthologs in an all-vs-all genomes fashion using the program Proteinortho (Lechner et al., 2011) with the following parameters: E-value ≤10 5 , coverage ≥70%, and identity of ≥25%, as previously described for the identification of TFs (Flores-Bautista et al., 2020).

Transcription units
The predictions of Transcription Units (TUs) or operons were obtained using the method described by Moreno-Hagelsieb and Collado-Vides (2002). In brief, the predictions were based on the transcription direction and the intergenic distance (shorter intergenic distances and in the same direction for genes in the same TU).

Inference of GRNs
The reference genomes were used to scan the 12,230 bacterial and 649 archaeal genomes to identify their orthologs and map their interactions considering the following criteria: If the orthologs of the TF and its TG of the model organism were found in a new genome, the interaction was assigned using guilt by association. In a second step, predicted TUs were used to expand the TF-TG interactions as follows: If the first gene of the orthologous TG in an organism corresponded to the first gene in the TU, the other genes belonging to the TU were associated with the same TF. Finally, each network was integrated using all the ortholog assignments with the six reference GRNs. All the network interactions can be inferred by running the script pipeline.sh, provided as Supplementary material and Figure 1.

Regulatory modules
The GRNs were analyzed by using Cytoscape (Shannon et al., 2003;Otasek et al., 2019) to obtain their degree, clustering coefficient, and other centrality metrics. Hubs were obtained by using networkX from python (Hagberg et al., 2008). In addition, to identify transcriptional co-regulators and modules in a GRN, the CoReg software was used. In brief, CoReg calculates gene similarities based on the number of common neighbors of any two genes in the network (Song et al., 2017).

Web server
The GRNs inferred for all the bacterial and archaeal genomes are available through the web server at https://regulatorynetworks. unam.mx/, which is built on HTML5, JQuery, and Php languages, while the data are stored in a MySQL database. For the data display, we use the Cytoscape JS (Franz et al., 2016) framework due its capabilities to represent nodes and edges of the network with determined properties, allowing users to change forms, colors, and layer visualization of the network.

Method performance and statistical analysis
GRNs were compared using two different approaches to establish the reliability of the approach, one based on the ability to recover edges and the second focusing on the ability to recover network motifs (Milo et al., 2002) by comparing the six reference networks with networks for the same genomes generated using our approach.
First, based on the orthologous annotations made with Proteinortho, we created a GRN for each of the reference bacteria using a naming convention that ensures that genes that are orthologous among them share the same name in each of the six GRNs used as reference. Then, we created networks for each reference organism based on the regulations transferred from the other five GRNs, again using the consensus gene names. GRNs with consensus gene names were then compared, following two procedures implemented in LoTo (Martin et al., 2017). We employed binary classification metrics to evaluate the similarities between pairs of GRNs as follows: Edges present in both compared networks are considered true positives (TPs),  To establish a baseline and determine whether the results from our approach are significant versus what can be expected by chance, we also created a protocol to determine the expectancy of a transferred TF-gene regulation by chance. We randomized the names TFs for the whole inferred networks 10,000 times to calculate expected TP, FP, TN, and FN values by comparing these randomized networks against their reference counterparts. This protocol ensures comparisons of random networks with the same characteristics, e.g., edges, TFs, and genes, against their actual reference. We then employed a G-test as implemented in SciPy (Virtanen et al., 2020) to determine whether the observed number of edges considered TP, FP, TN, and FN can be from the same distribution as that observed for the predicted networks without randomization.

Identification of new interactions in bacterial models
In order to evaluate and expand the GRNs of the six model organisms, the number of TFs, TGs, and their interactions was determined. To do this, we downloaded six GRNs, and their interactions were displayed by using Cytoscape. In this work, we considered TFs as those proteins that activate or repress gene Flux diagram showing the inference of the GRNs. Six bacterial models were used to infer the GRNs in 12,230 bacterial and 649 archaeal genomes. If the pair A (TF) -B (TG) in a reference genome is identified (by orthology) in a new genome A′ (TF) -B′ (TG), the interaction is assigned. In addition, if the TG identified in the new genome is the first one in the TU, the interaction is extended to the other gene(s). One interaction in a new genome can be derived from one or more bacterial models. Finally, the reconstructed networks were evaluated in terms of their topological properties.
Frontiers in Microbiology 05 frontiersin.org expression but do not belong to the transcriptional basal machinery; therefore, sigma factors, antiterminators, terminators, and sensor proteins, among other proteins, were excluded from the resulting data set (Martínez-Núñez et al., 2013). i.e., 76% of the complete collection is concentrated in few organisms; in contrast, 24% of the collection is distributed among 78 different prokaryotes. This contrast in the information is also evident in more general databases, such as UniProtKB/Swiss-Prot, where E. coli K-12 is the bacterial organism with more proteins deposited and curated manually in the database. 4 To determine the number of interactions shared between the six model organisms, we first used the program Proteinortho to assign orthology relationships between all proteins in the proteome of each bacterium. Once orthologous proteins were determined, we inferred regulatory interactions between organisms based on the presence of an orthologous TF and an orthologous target of that TF in the model GRN. In the second step, the interactions were expanded by using the TU assignments, as described in Materials and Methods. This comparison showed that E. coli and S. typhimurium LT2 share a high number of interactions, because of their phylogenetic closeness. In contrast, the actinobacterium M. tuberculosis is the organism with the lowest number of shared interactions with the other bacterial models as a consequence of its phylogenetic distance; only 12% (in average) of its interactions are shared with other bacteria (see Supplementary material).

uniprot.org
In order to infer new interactions among the six bacterial genomes, they were compared and their interactions were assigned based on the presence of the TF-TG orthologous pairs. In this regard, Table 1 shows the number of new assignments and their proportion per organism. From this analysis, we found between 776 and 2,499 new interactions, with S. typhimurium LT2 and P. aeruginosa the organisms determined to have more new interactions inferred. These larger numbers for S. typhimurium LT2 and P. aeruginosa are probably a consequence of their phylogenetic closeness with E. coli K-12 (Fukushima et al., 2002) in comparison to the other organisms used as models. It is important that some regulatory interactions were found in more than one organism; therefore, the sum of the rows may not correspond to the total number of new interactions, as is the case for the regulator PhoB (NP_414933.1) of E. coli K-12, which regulates the cytochrome bd-I ubiquinol oxidase subunit (NP_415262.1), as inferred from the interactions previously described in the B. subtilis and M. tuberculosis networks.

Performance estimation of the approach
Regarding the reliability of interactions predicted by our approach, we compared networks with only TF-TG interactions derived from homology relationships for each of the six species with the respective reference GRNs. The comparisons were made by considering this to be a binary classification problem, and thus, edges (and graphlets) in both the reference network and the predicted GRN are TPs, edges only in the reference are FNs, and edges only in the predicted network are FPs. These results, shown in Table 2 for single edges and in Table 3 for graphlets, indicate a varying range of values depending on the compared bacteria. For recall (R), the rate of recovered TF-TG interactions ranged from 0.028 for M. tuberculosis to 0.52 for S. enterica, whereas precision (P), which indicates the likelihood that the existence of an edge is correctly predicted, ranged from 0.20 for P. aeruginosa to 0.69 for S. enterica. These results are significantly different from those expected by chance, as shown by the very low p-values obtained with the G-test. When the same metrics for the presence and absence of graphlets were used (Table 3), we found a similar trend for each model GRN but with lower values for each metric. Lower values for Precision (P), Recall (R), and F1 were calculated using the true positive (TP), false positive (FP), and false negative edges (FN). P-value of the G-test indicates the significance of the differences between the averaged counts of TP, FP, TN, and FN in the 10,000 randomizations of the inferred networks and the results shown in the table.
Frontiers in Microbiology 06 frontiersin.org Columns as are follows: Genome name; columns 1, 3, 5, and 7 indicate the Targets, TFs, nodes, and number of interactions identified in the original networks; columns 2, 4, 6, and 8 indicate the Targets, TFs, nodes, and number of interactions identified in the extended networks.
the metrics calculated with graphlets are expected, since a single edge that differs between two networks often affects various graphlets.

The expanded GRNs identified new TF-TG interactions
Based on the expanded networks, we identified new TF-TG interactions described in Table 4 that must be exhaustively analyzed. In this regard, we found an increase in the number of targets, TFs, nodes, and interactions for all the bacterial and archaeal extended networks. For instance, for M. tuberculosis H37Rv, there was an increase of 776 new interactions (305 new TGs and 31 new TFs), whereas for B. subtilis, 1821 new interactions (36 new TFs and 553 new TGs) were identified. Therefore, we performed a literature search to find evidence to support our predictions. Based on these searches, and considering the 1,676 new interactions for E. coli K-12 (56 new TFs and 570 new TGs), we identified that 179 of these interactions have been described in the literature (Supplementary material); however, they are not deposited in RegulonDB. In particular, we found that the interaction of SoxS and ompW in the GRN of E. coli and inferred from S. enterica has been experimentally described. In E. coli, ompW is regulated by three TFs, as described in RegulonDB; however, we found that it could be also regulated by SoxS (Zhang et al., 2020) in a negative fashion.
We also found a new interaction, where CpxR could be regulating tar gene. This TF together CsgD has been described in bacterial adhesion, and belongs to the stationary-phase response (Santos-Zavaleta et al., 2019b). Experimental evidence suggests that CpxR and CsgD repress the transcription of fliA, flgM, and tar (Dudin et al., 2014), in addition to bglg and bglb (Mattéotti et al., 2011), and PdhR and lipA (Kaleta et al., 2010). These regulatory interactions identified by our orthologs inferences have not been documented in RegulonDB.

Web server
The GRNs inferred in all the bacterial and archaeal genomes are available through a web server whose interface is shown in Figure 2. The GRN of user-selected organisms are shown in an embedded interactive display that through a very intuitive mousebased interface allows the user to select subnetworks and different types of regulatory interactions. Edge and node colors can also be redefined, as well as the layout used in the network visualization, depending on their properties. Additionally, the user can display and visualize information related to Genes (name, protein ID, initial and end coordinates, and strand), and edges among nodes representing genes, including information about whether this is a new or known edge and the organism from which it was derived. Additionally, if information is available, by clicking on the node name or protein identifier, you can access the NCBI/Uniprot page related to the gene of interest.
Entire GRNs or used defined subnetworks can be downloaded in standard format for further inspection with tools such as Cytoscape, that in addition, connect our tool to the whole array of apps already available for this visualization tool. For more information and a more detailed description of both the input and

Conclusion
In this work, we have expanded the GRNs for six model organisms, by considering orthologous inference and TU assignments. This inference is based on the assumption that orthologous TFs generally regulate the expression of orthologous TGs (Yu et al., 2004;Galán-Vásquez et al., 2011;Lenz et al., 2020;Soberanes-Gutiérrez et al., 2021). The inferred interactions were included in the GRN, and their topological properties were calculated. In a second step, we inferred the GRNs for 12, 879 genomes, based on TF-TG orthology relationships of six bacterial species with well-known regulatory interactions and TU assignments. We discuss some examples to show the most relevant results obtained from this inference, and topological metrics are calculated for these networks. Therefore, our approach to reconstruct regulatory networks is a valuable resource of regulatory interactions occurring within bacteria and archaea cellular domains, and it may integrate with global expression data available for these organisms in order to improve global interaction data models. From an evolutionary perspective, the dynamics to expand or modify the repertoire of cellular functions that transcription factors control involves: (a) transcriptional rewiring whereby the promoters of orthologous genes in related species differ in the presence or absence of a binding site(s) for a conserved transcription factor(s); (b) embedding horizontally acquired genes under regulation of an ancestral transcription factor; (c) restructuring of the promoters controlled by a transcription factor; and (d) modifications in the transcription factors themselves (Perez and Groisman, 2009;Pérez-Rueda et al., 2009). In this context, the inference of archaeal GRNs was based under the hypothesis that bacteria and archaea share a common ancestry in terms of their TFs, with posterior divergence (Bell and Jackson, 2001;Minezaki et al., 2005), whereas the origin of the ancestral basal transcriptional machinery cannot be ascertained, and it could have been bacterial or archaeal-eukaryal type. For instance, 53% of the total repertoire of archaeal TFs exhibit at least one homologue in bacterial genomes. In particular, archaea and clostridia share a common set of TFs classified in diverse evolutionary families (Kyrpides and Woese, 1998;Bell, 2005;Pérez-Rueda and Janga, 2010), different to the families shared with several Actinobacteria and some Gammaproteobacteria. This reinforces the notion that TFs of bacteria and archaea share a common ancestry and highlight a close relationship between the TFs from archaea and Firmicutes. In addition, bacteria and archaea share an operonic organization (Seitzer et al., 2020;Sueda et al., 2021). Thus, the Online interface of the "regulatory networks" server storing the publicly available database. Diverse options are available for the user: a description of the system, a page to download the raw data, and the core section of the web to filter a GRN (purple box). To visualize load a network, the user can select the Gene Regulatory Network of the organism of interest in the "load a network" panel. In the Select network box, the user can Start selecting the name of the organism and click on the Load button to visualize the network on the right window (red box). This action will load the graph (black box) and node/edges properties (cyan box). Diverse layouts can be applied to visualize the network and specific nodes/edges to generate a new subgraph (green box) can be selected. As the graph visualization could be modified, the user can center/fit the network (White buttons) or reset the current visualization (Yellow button). Finally, for displayed nodes and edges, the user can download this network (Green button). In the example, in the right panel, the network is associated with the transcription factor DnaA (diamond) and their Target Genes (circles). Edges represent the transcription direction (when it is available). In the low panel, the TGs under the regulation of the TF are shown: NCBI ID, gene name, protein ID, start and end position, and strands. For more details of the web application, please refer to the Supplementary material.
Frontiers in Microbiology 08 frontiersin.org experimental information concerning GRN in archaea is limited. For instance, the GRN of Pyrococcus furiosus shows seven regulons and 279 genes, which represent 13.5% (279 genes) of the total genes in this archaeon (Denis et al., 2018). Therefore, inferences of GRN are central to explore in detail the organisms included in this cellular domain. Finally, we have provided readers with a website where all the networks can be analyzed and downloaded.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding authors.