Zetaproteobacteria Pan-Genome Reveals Candidate Gene Cluster for Twisted Stalk Biosynthesis and Export

Twisted stalks are morphologically unique bacterial extracellular organo-metallic structures containing Fe(III) oxyhydroxides that are produced by microaerophilic Fe(II)-oxidizers belonging to the Betaproteobacteria and Zetaproteobacteria. Understanding the underlying genetic and physiological mechanisms of stalk formation is of great interest based on their potential as novel biogenic nanomaterials and their relevance as putative biomarkers for microbial Fe(II) oxidation on ancient Earth. Despite the recognition of these special biominerals for over 150 years, the genetic foundation for the stalk phenotype has remained unresolved. Here we present a candidate gene cluster for the biosynthesis and secretion of the stalk organic matrix that we identified with a trait-based analyses of a pan-genome comprising 16 Zetaproteobacteria isolate genomes. The “stalk formation in Zetaproteobacteria” (sfz) cluster comprises six genes (sfz1-sfz6), of which sfz1 and sfz2 were predicted with functions in exopolysaccharide synthesis, regulation, and export, sfz4 and sfz6 with functions in cell wall synthesis manipulation and carbohydrate hydrolysis, and sfz3 and sfz5 with unknown functions. The stalk-forming Betaproteobacteria Ferriphaselus R-1 and OYT-1, as well as dread-forming Zetaproteobacteria Mariprofundus aestuarium CP-5 and Mariprofundus ferrinatatus CP-8 contain distant sfz gene homologs, whereas stalk-less Zetaproteobacteria and Betaproteobacteria lack the entire gene cluster. Our pan-genome analysis further revealed a significant enrichment of clusters of orthologous groups (COGs) across all Zetaproteobacteria isolate genomes that are associated with the regulation of a switch between sessile and motile growth controlled by the intracellular signaling molecule c-di-GMP. Potential interactions between stalk-former unique transcription factor genes, sfz genes, and c-di-GMP point toward a c-di-GMP regulated surface attachment function of stalks during sessile growth.


INTRODUCTION
Twisted stalks are organo-mineral composites that were first recognized in the early nineteenth century for their distinctive ribbon-shaped morphology (Ehrenberg, 1836). They are formed by neutrophilic microaerophilic chemolithoautotrophs that obtain energy from the enzymatic oxidation of Fe(II) to Fe(III) with oxygen (O 2 ) as the terminal electron acceptor (Emerson and Revsbech, 1994;Emerson and Moyer, 1997), and that classify into the Betaproteobacteria (Kato et al., 2013(Kato et al., , 2015 and Zetaproteobacteria . Only four isolates of the Gallionellaceae family within the ecologically and physiologically diverse Betaproteobacteria have been characterized as freshwater Fe(II)-oxidizers to date, with two of them lacking the stalk phenotype Kato et al., 2015). Contrastingly, all previously reported cultured Zetaproteobacteria are characterized as marine Fe(II)-oxidizers , with two isolates reported as being additionally capable of H 2 oxidation (Mori et al., 2017). The full metabolic potential of the Zetaproteobacteria is yet to be uncovered (Field et al., 2015;Brooks and Field, 2020) and so are the evolutionary grounds for the rise of the stalk formation phenotype in two distinct Proteobacteria classes.
Stalk formation has been recognized as a characteristic morphological trait of Fe(II)-oxidizing bacteria since the 1800s (Ehrenberg, 1836). It has only been relatively recently that successful isolation of Fe(II)-oxidizing Zeta-and Betaproteobacteria has enabled detailed morphological and chemical characterization of their twisted stalk products. For instance, time lapse microscopy of the Zetaproteobacterium Mariprofundus ferrooxydans PV-1 revealed single cells excreting multiple nanometer-thin fibers at the cell concavity that elongate at a rate of 2.2 µm h −1 as cells grow and oxidize Fe(II). A contemporaneous rotation of the cell was suggested to cause the helical morphology of stalks (Chan et al., 2011).
A combination of spectroscopy and microscopy approaches applied on stalks of isolate cultures and in environmental Fe-mat samples resolved the stalk nanofibrils to consist of acidic polysaccharides with carboxylic functional groups (Chan et al., 2009(Chan et al., , 2011. The organic matrix is proposed to adsorb Fe(III) originating from biotic Fe(II) oxidation (Chan et al., 2004), resulting in its encrustation with amorphous Fe(III) oxyhydroxide (FeOOH) during initial stalk growth. As the stalks age, amorphous FeOOH becomes metastable ferrihydrite which is overprinted with the FeOOH lepidocrocite (Chan et al., 2011). Based on these observations, stalks have been suggested to serve as biomarkers for primordial microaerophilic Fe(II) oxidation in the geologic record (Krepski et al., 2013;Picard et al., 2015;Chan et al., 2016;Dodd et al., 2017;Little et al., 2021), that may be of astrobiological interest as well.
Despite substantial progress on the chemical composition and nano-scale morphology of stalks, the underlying genetic machinery of this unique bacterial product remains unresolved. Certain Fe(II)-oxidizing Beta-and Zetaproteobacteria, i.e., Sideroxydans lithotrophicus ES-1, Gallionella capsiferriformans ES-2, and Ghiorsea bivora TAG-1 and SV-108, were reported not to form twisted stalks but to produce amorphous Fe(III) oxyhydroxides instead Mori et al., 2017). Other Zetaproteobacteria isolates, i.e., Mariprofundus aestuarium CP-5 and Mariprofundus ferrinatatus CP-8, were demonstrated to excrete shorter and thicker exopolymers encrusted in Fe(III) oxyhydroxides that differ morphologically from stalks, known as "dread"-biominerals due to their unique morphology (Chiu et al., 2017). Collectively, these observations imply that stalks are not essential for the survival of Fe(II)oxidizing bacteria. However, since stalk-biosynthesis must incur some bioenergetic cost, it is logical that they confer physiological and/or ecological advantages over Fe(II)-oxidizers that form amorphous Fe(III) oxides. Proposed functions include anchoring of the cell in environments with favorable gradients of Fe(II) and O 2 , and prevention of cell encrustation in Fe(III) through templating its sorption to an organic matrix (Chan et al., 2011).
Identifying the genes that encode stalk formation is a vital step in deciphering the evolutionary and physiological relevance of this unique bacterial trait. The organic nature of the stalk fibrils together with their secretion from the cell surface suggest an interaction of multiple genes with different functions, including carbohydrate synthesis, secretory pathways, and cellular export. Furthermore, we hypothesize that there may be a reciprocal regulation with Fe(II) oxidation pathways which requires special attention in the search for candidate genes involved in stalk biomineralization. Kato et al. (2015) previously reported a cluster of four genes with putative involvement in stalk formation that are shared between stalk-forming freshwater Fe(II)-oxidizers and the Zetaproteobacteria M. ferrooxydans PV-1, M34, and Mariprofundus sp. EKF-M39. Of these, three genes were identified to be similar to the exopolysaccharide synthesis encoding xagBCD genes of plant-pathogenic Xanthomonas species that attach to their targets with a xanthan-based biofilm (Tao et al., 2010;Zhang et al., 2013). The fourth gene was annotated as a BcsB-like cellulose synthase regulator protein (Kato et al., 2015). Due to the lack of a systematic comparison of all currently available Zetaproteobacteria isolate genomes, we chose to search for genes connected to stalk formation with a pan-genome approach and subsequently compared our results to those of Kato et al. (2015). Here, we describe the size and trajectory of a pan-genome that comprises 16 genomes of cultured Zetaproteobacteria isolates. Our data reveals substantial genetic diversity within the Zetaproteobacteria, and provides insight into gene clusters that are likely involved in stalk generation. These results on stalk formation align with, and significantly extend those from Kato et al. (2015), while also pointing toward a potential signaling cascade that may regulate a switch between sessile and motile growth in Zetaproteobacteria.

Calculation of Average Amino Acid Identities Among Zetaproteobacteria
We identified the degree of genomic similarity between all Zetaproteobacteria and Betaproteobacteria genomes included in our study by calculating their average amino acid identity (AAI) using CompareM 2 with default parameters of 30% sequence identity and 70% sequence alignment length. We computed Bray-Curtis dissimilarity and average linkage hierarchical clustering using the R package vegan (Oksanen et al., 2016), and visualized our results with the R libraries gplots (Warnes et al., 2016), Heatplus (Ploner, 2019), and RColorBrewer (Neuwirth, 2014).

Pan-Genome Computation
We used functionally annotated genome files (see section "Functional Annotation of Zetaproteobacteria Genomes") of 16 Zetaproteobacteria isolates ( Table 1) for computing the size and trajectory of 7 pan-genomes that differed in minimum blastp percentage identities between 30 and 90% (Altschul et al., 1990) using the Roary software (v. 3.11.2; Page et al., 2015). By comparing multiple pan-genomes computed with a range of different blastp cutoffs, we aimed at a subsequent unbiased screening for potential stalk formation genes (see section "Identification of Stalk-Associated Genes"), which involved the consideration that stalk formation genes may or may not be highly homologous. Our strategy also allowed us to search for potential stalk formation gene clusters composed of multiple genes with different conservation levels. Briefly, Roary converted coding regions identified by Prokka (Seemann, 2014) into protein sequences, removed partial sequences, and iteratively pre-clustered coding regions with CD-HIT (Fu et al., 2012). After pairwise sequence alignment with blastp (Altschul et al., 1990), Roary clustered sequences with the Markov cluster algorithm (MCL) (Enright et al., 2002) with an MCL inflation value of 1.5 and subsequently merged pre-clustering results from CD-HIT with those from MCL (Page et al., 2015). We visualized the calculated pangenome size and trajectory with R using the ggplot2 package (Wickham et al., 2016).

Functional Analyses of Pan-Genome Subdivisions and Individual Genomes
We analyzed the pan-genomes in three categories as defined by Koonin and Wolf (2008), i.e. (1) core (genes present in all 16 genomes), (2) accessory (genes present in 2-15 genomes), and (3) strain-specific genes (genes present in only 1 genome). After extracting core, shell, and unique genes from the pan-genome computed at 50% minimal blastp identity using the "query_pan_genome -a * .gff " script of Roary (Page et al., 2015), we analyzed the distribution and enrichment of clusters of orthologous groups (COGs) in each pangenome subdivision and in individual genomes by assigning COG categories to query protein sequences using cdd2cog (Leimbach, 2016).

Identification of Stalk-Associated Genes
Using the "query_pan_genome -a difference" function in Roary (Page et al., 2015), we extracted genes uniquely shared among all 11 stalk-former genomes that at the same time were absent in the genomes of stalk-less isolates from previously computed pan-genomes at blastp cutoffs between 30 and 90% ( Table 1). We manually checked the query coverage of stalk-former unique genes using blastp (>70% amino acid identity). Protein coding sequences that were initially annotated as hypothetical proteins by Prokka were further analyzed for functional information using EggNOG mapper (Huerta-Cepas et al., 2019) and InterProScan (Quevillon et al., 2005). For stalk-former unique genes annotated as hypothetical proteins, we aligned the homologous amino acid sequences from all 11 stalk-forming Zetaproteobacteria using Muscle (Edgar, 2004) and visualized our results with Jalview2 (Waterhouse et al., 2009) to double check the query coverage and to search for highly conserved regions and identical amino acid residues. Finally, we analyzed whether stalk-former unique genes clustered within same gene cassettes according to continuous gene ID's that Roary automatically assigned during pan-genome computation.
Using Gene Graphics (Harrison et al., 2018), we confirmed gene clustering and generated synteny plots of putative stalk formation genes in the genomes they were found in, given that they were available on the National Center for Biotechnology Information (NCBI) database.

SEM Sample Preparation and Imaging of Cell-Stalk-Aggregates
We obtained an active culture of Mariprofundus micogutta ET2 from the Japan Collection of Microbes (Ibaraki, Japan) that we cultivated on zero-valent iron plates with artificial seawater medium at 5% headspace oxygen after Laufer et al. (2016) for 5-8 days prior to analyses with an Olympus BX60 epifluorescence microscope (Olympus, Japan) and a Qicam 1394 (Qimaging, Canada). Similarly, we cultivated Mariprofundus sp. DIS-1 on zero-valent iron plates with artificial seawater medium after Laufer et al. (2016) for 5 days prior to harvesting 1 mL culture for sample preparation.
To ensure good preservation of biological and mineralogical structures, samples were split up for different treatments: either glutaraldehyde fixation followed by stepwise dehydration or airdrying. Biological samples were fixed in 2.5% glutaraldehyde on ice for 3 h. Samples were then washed three times with ultrapure water, pelleted at 2,800 g for 1 min, and mounted on poly-L-lysine (0.1% w/v aqueous solution) coated glass slides. Samples were then dehydrated sequentially with 30, 50, 70, and 95% ethanol for 5 min each, followed by two times 100% ethanol for 30 min each. Afterwards, samples were treated twice with hexamethyldisilazan for 30 s each. After final air drying, samples were stored in a dry chamber at room temperature.
Non-fixed samples for preservation of mineral structures were washed three times with ultrapure water, concentrated by centrifugation at 2,800 g for 1 min, mounted on poly-Llysine glass slides, and air-dried afterward. All sample-slides were applied onto carbon-taped aluminum stubs and sputter-coated for 120 s at 4 × 10 −2 bar to achieve a 12 nm platinum layer using a Bal-Tec SCD005 sputter coater (Baltic praeparation, Germany). Imaging was performed with a JSM-6500F field emission scanning electron microscope (JEOL, Germany) with a Schottky field emitter at 5kV acceleration voltage and a working distance of approximately 10 mm at the center for Light-Matter Interaction, Sensors and Analytics (LISA+, University of Tuebingen, Germany).

Data Availability
The Genbank accession ID's for sfz1-sfz6 in M. ferrooxydans PV-1 are EUA54616-EUA54621, which we updated with the product annotations "stalk formation protein in Zetaproteobacteria 1-6, " respectively. Amino acid sequences of Sfz1 PV −1 -Sfz6 PV −1 and Genbank accession ID's are available in the SI.

RESULTS AND DISCUSSION
The diverse range of geographically distinct Fe(II)-rich environments inhabited by Zetaproteobacteria, including terrestrial, freshwater, coastal, and deep-sea environments ( Table 1), hints toward their adaptive nature. Although the number of isolates is relatively limited across all these environments, there was no obvious differentiation of morphological type to a specific habitat-type (Table 1). We refer to the dread and amorphous oxide phenotypes collectively as "stalk-less" hereafter.

Pan-Genome Reveals High Genetic Diversity Among Zetaproteobacteria
The pan-genome is defined as the entire repertoire of genes accessible to a bacterial clade, subdivided into conserved "core" genes (shared between all n genomes), variable "accessory" genes (shared between two and n-1 genomes), and "strain-specific" genes (present in only one genome) (Tettelin et al., 2005;Koonin and Wolf, 2008;Vernikos et al., 2015). The relative proportions of these three subdivisions characterize a clade's genetic and functional richness and its capacity to acquire exogeneous DNA, and define the trajectory of its pan-genome (Rouli et al., 2015). Roary regression curve models predicted the Zetaproteobacteria pan-genome to be open with infinite growth in total gene numbers and decreasing core gene numbers as new genomes are added (Figure 1). Accessory and strain-specific genes made up the majority of the Zetaproteobacteria pan-genome, whereas the core genome was relatively small with only 0.04-10.3% of the total gene count (90-30% blastp identity, respectively, Supplementary Table 1).
Open pan-genomes generally reflect sympatric lifestyles, i.e., life in mixed communities with high rates of horizontal gene transfer (Rouli et al., 2015;Vernikos et al., 2015), and are associated with niche versatility (Tettelin et al., 2005). Our results are consistent with the previously reported high genomic diversity across Zetaproteobacteria based on metagenomic data (Fullerton et al., 2017;McAllister et al., 2019) and indicate that Zetaproteobacteria may undergo considerable genetic exchange with other community members. We could not identify a correlation between source environment and the proportion of strain-specific and shared genes in isolate genomes (Supplementary Table 2) and so cannot conclude that a specific habitat may favor higher gene acquisition rates in Zetaproteobacteria. The high genomic diversity among currently available Zetaproteobacteria isolates is further reflected in amino acid identities (Supplementary Figure 1) that classify only deep sea members M. ferrooxydans PV-1, JV-1, and M34 (98.1 ± 5.4% to 99.9 ± 1.0%), and coastal estuarine isolates M. erugo P3 and P7 (98.9 ± 5.0%; Supplementary Figure 1) at the species rank (AAI of > 95%; Medlar et al., 2018) as previously reported by Garrison et al. (2019) and McAllister et al. (2019McAllister et al. ( , 2020.

Candidate Gene Cluster for Stalk Synthesis and Export Is Shared With Stalk-Forming Betaproteobacteria
We used comparative genomics as an approach to identify the genetic basis for stalk formation where each Zetaproteobacteria isolate genome was assigned to either the stalk-former, dreadformer, or amorphous-oxide-former trait groups (Table 1). While the assignment was clear for most isolate genomes from the documentation in the literature (Chan et al., 2011;Field et al., 2015;Mumford et al., 2016;Chiu et al., 2017;Mori et al., 2017;Beam et al., 2018) or from microscopic observations in our laboratory (Figures 2A-F), the trait was more ambiguous for strain Mariprofundus micogutta ET2 (JCM 30585 T ), a member of the Zetaproteobacteria operational taxonomy unit 18 (ZOTU 18). Makita et al. (2017) reported strain ET2 to form thin Fe-and C-rich extracellular filaments that differ morphologically from stalks. Using light microscopy, we identified stalk production in an active culture of M. micogutta ET2 (Figures 2G,H), although the majority of cells did not appear associated with obvious stalk structures. Hence, we assigned ET2 to the "stalk-forming" trait group that overall comprised 11 genomes (PV-1, JV-1, M34, ET2, DIS-1, EKF-M39, EBB1, CSS1, SR1, P3, and P7), while we assigned 5 genomes to the "stalk-less" (CP5, CP8, TAG-1, SV-108, and ECHO1) trait group during our analyses.
We extracted all genes from the Zetaproteobacteria pangenome that are shared among stalk-former genomes and lacking in stalk-less Zetaproteobacteria for our analysis. Our search revealed 82 genes unique to stalk-forming Zetaproteobacteria at minimal blastp identities between 30 and 80% (Supplementary Table 3). We excluded genes that were identified above 90% minimal blastp identity from our analysis as they were only shared between 8 or less stalk-formers. Prokka annotated 34 stalk-former unique genes with functions in flagellar basal-body biosynthesis, motility, chemotaxis, and terminal electron transfer with cbb3-type cytochromes (Supplementary Table 3; see sections "Stalk-Former Unique Genes Emphasize a Functional Connection to c-di-GMP Signaling" and "Stalk-Former Unique Cytochromes Specific to Stalk-Forming Zetaproteobacteria" for further details). The other 48 genes were predicted as hypothetical proteins among which we found 6 conserved genes that cluster together with a shared synteny in all stalk-former genomes.
The high level of conservation above 50% amino acid identity ( Table 2) and its lack in stalk-less Zetaproteobacteria emphasize the potential for this gene cluster as a candidate for the biosynthesis and export of the stalk organic matrix, which we therefore refer to as the stalk formation in Zetaproteobacteria (sfz) cluster with genes sfz1-sfz6 hereafter. Protein motif analysis with InterProScan (Quevillon et al., 2005) predicted domains with functions in exopolysaccharide synthesis and export, cell wall synthesis inhibition, and carbohydrate hydrolysis in Sfz1, Sfz2, and Sfz4, respectively, that we describe in detail below (see sections "Sfz1 Is a Putative c-di-GMP-Dependent Regulator of Exopolysaccharide Synthesis" and "Sfz2 Putatively Catalyzes Exopolysaccharide Synthesis and Export"). We could not identify known domains and putative functions for Sfz3, Sfz5, and Sfz6 (see section "Sfz4 and Sfz6 Contain Domains for Carbohydrate Cleavage and Cell Wall Synthesis Inhibition").
The appearance of twisted stalks in Zeta-and Betaproteobacteria indicates that this trait evolved from shared ancestral genes encoding stalk biosynthesis and export. We elucidated the presence and degree of conservation of the sfz cluster in stalk-forming Betaproteobacteria using blastp (Altschul et al., 1990), and found homologs of sfz1-sfz4 with a shared synteny in the stalk-formers Ferriphaselus amnicola OYT-1 and Ferriphaselus sp. R-1 (Figure 3; blastp identities below 40%). We refer to sfz homologs in Betaproteobacteria as sfb1-sfb4 (stalk formation in Betaproteobacteria) hereafter. As with the sfz cluster in Zetaproteobacteria, we found no sfb genes in the stalk-less Fe(II)-oxidizing Betaproteobacteria Gallionella capsiferriformans ES-2 and Sideroxydans lithotrophicus ES-1. Our blastp search further revealed the dread-forming Zetaproteobacteria M. aestuarium CP-5 and M. ferrinatatus FIGURE 1 | Size and trajectory of the Zetaproteobacteria pan-genome with the number of total, strain-specific, and conserved genes as genomes were added up to a total of 16. Data is shown for a pan-genome computed at a minimal blastp identity of 80%.
CP-8 to comprise less conserved homologs of sfz1-sfz6 (blastp identities below 40%, Figure 3), which may point toward two distinct phenotypes among Zetaproteobacteria that are rooted in the same gene cluster. Our findings align with a previous report of the sfb and sfz genes by Kato et al. (2015) as the "xag-like" gene cassette (CDS9-CDS12), that they found to be shared between freshwater stalk-formers and marine M. ferrooxydans PV-1, M34, and Mariprofundus sp. EKF-M39. Based on this limited genomic data, the authors surmised that respective genes might be involved in stalk formation. Our results here significantly expand the number of genomes, especially among the Zetaproteobacteria, and provide additional comparative analysis between stalkforming and stalk-less strains. Thus, our present data support the proposal of Kato et al. and underline the strong potential of this gene cassette as a candidate gene cluster for stalk exopolysaccharide synthesis. Interestingly, Betaproteobacteria lack homologs for sfz5 and sfz6, indicating that only sfz1-sfz4 and sfb1-sfb4 may be essential for the stalk phenotype, whereas the roles for sfz5 and sfz6 in stalk-formation may be more ambiguous.
While other stalk-former unique genes with conservation levels above 80% blastp identity such as HP_80_1 to HP_80_3 (see Supplementary Table 3) generally seem possible candidates for stalk formation, we found no homologs of respective genes in stalk-forming Betaproteobacteria. Moreover, our analysis revealed said genes to cluster within gene cassettes of distinct functions such as flagellar or cytochrome biosynthesis, pointing rather toward related functions instead of stalk formation. In general, exopolysaccharide synthesis strategies can vary significantly among bacteria, and are encoded by different sets of genes that can be more or less conserved in sequence and operon structure across taxa (for a review see Schmid et al., 2015). Thus, it is possible that additional accessory genes play a structural role in stalk-formation in the strains they are found in, but that other strains may use different structural polysaccharides in stalk-formation.

Sfz1 Is a Putative c-di-GMP-Dependent Regulator of Exopolysaccharide Synthesis
Our protein motif analyses predicted the Sfz1 amino acid sequence (60% in stalk-formers) to contain two non-cytoplasmic BcsB-like domains flanked by an N-terminal transmembrane signal peptide domain and a C-terminal transmembrane domain.  We found the N-terminal part of Sfz1 to be less conserved than the C-terminal fraction that contains 76 identical residues in all stalk-formers (Supplementary Figure 2). M. ferrooxydans JV-1 only contains the second part of sfz1, which likely is an artifact as the gene is split over two contigs in its genome. BcsB is the periplasmic membrane-anchored regulatory subunit of the dimeric cellulose synthase enzyme in Acetobacter xylinum (Wong et al., 1990) that regulates the polymerization of uridine 5 -disphosphate glucose (UDPglucose) to cellulose, an exopolysaccharide that builds the structural basis of a variety of biofilms. The reaction is catalyzed by the family II glycosyltransferase BcsA upon activation by the BcsB subunit, which depends on the binding of the positive effector cyclic dimeric guanylate monophosphate (c-di-GMP). High intracellular c-di-GMP levels were shown to stimulate exopolysaccharide synthesis and biofilm formation as part of surface attachment and sessile growth in different bacteria, including Pseudomonas aeruginosa, Escherichia coli, Salmonella enterica serovar Typhimurium, and Shewanella oneidensis MR-1. In contrast, low c-di-GMP levels induce cell detachment and motility (Simm et al., 2004;Merritt et al., 2007;Kuchma et al., 2015).
Generally, c-di-GMP is known to be a central signaling molecule for a variety of signaling cascades, and it particularly controls the switch between motile and sessile growth in bacteria, where biofilm formation is up-or downregulated for surface attachment or detachment (for a review see D' Argenio and Miller, 2004;Valentini and Filloux, 2016). An interaction between Sfz1 and c-di-GMP could point toward a surface holdfast function of stalks as previously proposed by Singer et al. (2011), and our pan-genome data provides additional hints for such a functional relationship (see section "Stalk-Former Unique Genes Emphasize a Functional Connection to c-di-GMP Signaling"). Our detailed analysis of Sfz1 indicates it could encode the conserved c-di-GMP binding motifs RxxD (glutamine-x-x-aspartate; Supplementary Figure 2, residues 295-298 and 830-833; Chou and Galperin, 2016) and RxxxR (glutamine-x-x-x-glutamine; Supplementary Figure 2, residues 227-231 and 326-330, Morgan et al., 2014). Sfb1 similarly contained c-di-GMP binding sites with an additional RxxD motif compared to Sfz1 (Supplementary Figure 2, residues 270-273), whereas it lacked the C-terminal RxxD motif in Sfz1 (residues 830-833) and contained aspartic acid instead of arginine at the RxxxR site at residues 326-330.
Functional exopolysaccharide synthesis would require at least a second gene that encodes a glycosyltransferase function for polysaccharide polymerization similar to BcsA, which we identified to be the case for sfz2 (see section "Putatively Catalyzes Exopolysaccharide Synthesis and Export"). Interestingly, the BcsA and BcsB couple was previously proposed to carry out both the synthesis and translocation of exopolysaccharides, as BcsA can, in addition to its glycosyltransferase activity, also form a channel across the cell membrane that allows for the export of the synthesized polysaccharide chain (Morgan et al., 2013;Omadjela et al., 2013). Whether Sfz1 and Sfz2 can carry out a similar function in stalk-formers remains to be explored.
The N-terminal part of Sfz2 may involve a translocation function, as EpsE is a cytosolic, hexameric ATPase that causes asymmetric conformational rearrangements in the T2SS inner membrane platform through ATP hydrolysis, promoting the extension of secreted exopolymers (Korotkov et al., 2012). Guttenplan et al. (2010) proposed EpsE in B. subtilis to be a bifunctional enzyme where it acts either as a family II glycosyltransferase (Garinot-Schneider et al., 2000), or as a molecular clutch that controls the flagellar rotor and inhibits cellular motility (Blair et al., 2008). Due to its gene architecture Sfz2 may encode a similar function, although a translocation function seems questionable. Functional exopolymer secretion via T2SS was demonstrated to require at least 12 different proteins that are generally encoded on a single operon (Sandkvist, 2001) and no other eps homologs were identified in the sfz cluster or its gene neighborhood (see section "Sfz Gene Neighborhood").
Importantly, eps gene expression was demonstrated to be upregulated by c-di-GMP in Vibrio cholerae (Sloup et al., 2017). As in Sfz1, we found the c-di-GMP binding motifs RxxxR (Supplementary Figure 2, residues 522-526, Morgan et al., 2014) and RxxD (Supplementary Figure 2, residues 97-100 and 325-328) in Sfz2 and two additional RxxD motifs in Sfb2 (residues 282-285 and 297-300), which further emphasizes a possible connection between the sfz and sfb gene clusters and a c-di-GMP regulated switch into sessile or motile growth in stalkformers (see section "Stalk-Former Unique Genes Emphasize a Functional Connection to c-di-GMP Signaling" for further details). Together, the conservation within Sfz2 with these different motifs involved in exopolysaccharide production and environmental response is circumstantial evidence for similar functions in stalk formation; however experimental work will be necessary to confirm these functions.

Sfz4 and Sfz6 Contain Domains for Carbohydrate Cleavage and Cell Wall Synthesis Inhibition
The 323-345 aa long Sfz4 sequence (50% minimal blastp identity) is predicted with a function as a family 10 glycosyl hydrolase. Glycosyl hydrolases cleave glycosidic bonds between carbohydrates, and family 10 enzymes have been specifically reported with exopolysaccharide-degrading activities as for instance in xylanase, endo-1,3-beta-xylanase, and cellobiohydrolase enzymes (Henrissat, 1991;Do et al., 2013). Cell wall degrading glycosyl hydrolases were demonstrated with roles in the assembly of supramolecular membrane-spanning structures such as secretion systems, and phenotypes of cells with non-functional cell wall hydrolases included impaired biofilm formation, surface attachment, or chemotaxis (Nambu et al., 1999;Mercier et al., 2002;Vollmer et al., 2008;Vermassen et al., 2019). A possible role of Sfz4 in stalk formation could be in hydrolyzing the stalk exopolysaccharide for cell detachment as part of a transition from sessile to motile growth, given that stalks function as surface holdfasts.
Sfz6 forms an exception to the other sfz genes in that it is distantly located from the sfz cluster in Zetaproteobacteria sp. SR-1 and EKF-M39. Since stalk-forming Betaproteobacteria lack a homolog for sfz6, it is presumably not essential for stalk formation but rather functions as an accessory gene complementing essential stalk formation genes. The 1,415-2,270 aa long Sfz6 protein sequence (50% blastp identity) contains an N-terminal SH3 (sarcoma homology-3) domain and a C-terminal CotH domain. Among a variety of different cellular functions, SH3 domains were suggested to function as cell wall binding sites during cell wall hydrolysis (Vermassen et al., 2019). In contrast, the function of CotH is presumably determined by additional adjacent protein domains. For instance, cellulosebinding domains may indicate a role of CotH in the formation of the cellulose-degrading multi-enzyme cellulosome complex or generic interactions with the cell wall (Nguyen et al., 2016). Generally, a domain architecture with a combination of SH3 and CotH seems unique and due to the wide variety of functions encoded by SH3 domains, we cannot confer a potential function for sfz6.

Sfz Gene Neighborhood
A cassette of five genes that is directly adjacent to the sfz cluster is conserved across stalk-forming Zetaproteobacteria. Their annotated functions are an ATP-ADP antiporter, a phosphomannose-isomerase, a N-acetylmuramoyl-L-alanine amidase, a MsbA-like ABC-type protein exporter, and a protein of unknown function (Figure 3), respectively. Phosphomannose-isomerases interconvert mannose-6phosphate and fructose-6-phosphate, and were demonstrated to be involved in the synthesis of the bacterial exopolysaccharides xanthane and alginate (Shinabarger et al., 1991;Köplin et al., 1992), whereas N-acetylmuramoyl-L-alanine amidases cleave amide bonds in the net-like peptidoglycan structure and degrade it (Wilmes et al., 2017;Vermassen et al., 2019). None of the five genes are unique to stalk-formers, and Zetaproteobacteria sp. CSS1 and EBB1 contain only the phosphomannoseisomerase and the ATP-ADP antiporter genes in their sfz gene cluster neighborhood. Dread-formers contain four of the five conserved genes in the sfz gene neighborhood as well, whereas stalk-forming Betaproteobacteria lack the entire cassette. Altogether, their functional connection to carbohydrate metabolism and close association with the sfz cluster point toward a potential involvement in stalk formation, possibly as complementary genes.
Generally, our sfz/sfb gene neighborhood analysis revealed these gene cassettes could confer functions in carbohydrate metabolism, chemotaxis, extracellular export, electron transfer, and c-di-GMP related signaling. There is variation in the gene architecture of these cassettes across all stalk-formers (Supplementary Figure 4). It is possible that this diversity could reflect species-specific adjustments to stalk formation and functional roles that are niche-driven; however, the functional assignments are deduced from homologous genes in other bacteria, and the true phenotypic ramifications for different stalkformers will require further elucidation.

Stalk-Former Unique Genes Emphasize a Functional Connection to c-di-GMP Signaling
Zetaproteobacteria are under constant pressure to compete against abiotic Fe(II) oxidation by oxygen Druschel et al., 2008). Therefore, sensing and quickly responding to geochemical changes as well as sustaining stable growth in environments with optimal Fe(II) and O 2 gradients are essential attributes for their survival. Our pan-genome analysis revealed a general enrichment of COG categories associated with c-di-GMP signaling (CheY-like response regulator domains, GGDEF, AAAtype ATPase, HD-GYP, and PAS/PAC domains, Supplementary Table 4) across stalk-forming and stalk-less Zetaproteobacteria. The signaling molecule c-di-GMP upregulates biofilm formation and sessile growth at high intracellular concentrations, whereas declining c-di-GMP concentrations result in increasing surface detachment and motile growth (Figure 4). While these results point toward a pronounced importance of switching between motile and sessile growth in stalk-forming and stalk-less Zetaproteobacteria (Figure 4), our domain analysis of sfz1-6 and additional stalk-former unique genes highlights a particular emphasis of such a switching response in stalk-formers.
For instance, we found two highly conserved transcriptional regulators unique to stalk-formers that both neighbor additional stalk-former unique genes (Supplementary Figure 4) and that favor cellular motility, i.e., atoC and csrA (Supplementary Table 3; NCBI accession EAU54699.1 and EAU54987.1 in M. ferrooxydans PV-1, respectively). Both genes are located in the close neighborhood of c-di-GMP signaling genes with GGDEFand HD-GYP domains that are known to antagonistically regulate intracellular levels of c-di-GMP (Figure 4). GGDEFcontaining diguanylate cyclases catalyze c-di-GMP synthesis and hence an increase in its concentrations (Tal et al., 1998), while HD-GYP-containing phosphodiesterase enzymes catalyze c-di-GMP degradation (Ryan et al., 2009).
CsrA was reported to negatively regulate the expression of GGDEF-containing proteins that upregulate intracellular c-di-GMP levels (Jonas et al., 2008), thereby inhibiting the expression of the exopolysaccharide gene pgaA in Campylobacter jejuni (Huertas-Rosales et al., 2017). CsrA-mutant cells were demonstrated to show poor swarming ability and lowered resistance to oxidative stress (Fields and Thompson, 2008). AtoC is a transcriptional regulator that was demonstrated to induce the expression of the flagellar flhDC and fliAZY operons in E. coli. Cells deficient in atoC were reported to be unresponsive to chemoattractants and showed a general lack of cellular motility. AtoC-dependent regulation of motility and chemotaxis also required an upregulation of CheY (Theodorou et al., 2012), a cytoplasmic chemotaxis response regulator that receives environmental signals and controls cellular swarming trajectories upon phosphorylation. By binding to the flagellar motor protein FliM (Clegg and Koshland, 1984;Bray and Bourret, 1995), phosphorylated CheY induces a clockwise rotation of the flagellum and causes a decrease in torque and swarming speed. In contrast, de-phosphorylated CheY does not bind FliM, which maintains the cellular swarming direction and speed with a counterclockwise flagellar rotation (Figure 4; Scharf et al., 1998;Yuan et al., 2010;Nesper et al., 2017). We found that the stalkformer unique genes comprise the flagellar motor protein gene fliM (NCBI accession EAU54562.1 for M. ferrooxydans PV-1) within a gene cluster containing other stalk-former unique genes annotated with functions in flagellar motility (Supplementary Table 3 and Supplementary Figure 4), which may be related to a specialized switching response of the flagellum in stalkformers, and could indicate a functional connection between stalk formation and chemotaxis regulation.
Its gene expression was shown to be induced by high intracellular levels of c-di-GMP, although the effect of higher zraR transcription has not been resolved yet (Méndez-Ortiz et al., 2006).

Stalk-Former Unique Cytochromes Specific to Stalk-Forming Zetaproteobacteria
In addition to genes with potential functional links to c-di-GMP regulation, we found six stalk-former unique genes with annotations as cbb3-type cytochrome subunits and as fixGH genes (Supplementary Table 3; NCBI accession EAU53376-EAU53382 in M. ferrooxydans PV-1, respectively). The latter are putatively required for the assembly of cbb3-type cytochromes (Preisig et al., 1996) and cluster together in stalk-formers (Supplementary Figure 4).
Cbb3-type cytochromes are terminal oxidases for aerobic respiration with a high affinity for oxygen (Pitcher and Watmough, 2004) that were reported of being primarily expressed under O 2 -limited conditions (Mandon et al., 1994). They were found among the most highly expressed proteins in the proteome of M. ferrooxydans PV-1 ( Barco et al., 2015), which is reasonable with respect to the microaerophilic metabolism of Zetaproteobacteria (Field et al., 2015). All Zetaproteobacteria contain cbb3-type cytochrome c genes in their genomes, and the stalk-former unique subunits may be an adaptation required for sensing optimal oxygen levels as part of microaerobic growth during stalk formation.
Generally, Per-Arnt-Sim motifs (PAS) and PAS-associated C-terminal motifs (PAC) were prevalent domains among the strain-specific genes of stalk-forming and stalk-less Zetaproteobacteria (Supplementary Table 4). PAS/PAC domains are known to respond to environmental signals such as oxygen concentrations, light, and redox potential (Taylor and Zhulin, 1999). The PAC motif was demonstrated to control flagellar rotation during aerotaxis of E. coli upon sensing an emitted signal by the PAS domain depending on the surrounding redox conditions (Rebbapragada et al., 1997) with an involvement of CheY (Bibikov et al., 2000). Given their abundance, PAS/PAC domains may determine the swarming trajectories of Zetaproteobacteria during aero-and/or chemotaxis by a similar mechanism (Figure 4).

CONCLUSION
Our pan-genome analysis revealed a high genomic diversity among currently available Zetaproteobacteria isolates, and a highly conserved gene cluster in stalk-forming Zetaproteobacteria that is shared at lower conservation levels with stalk-forming Betaproteobacteria and dread-forming Zetaproteobacteria. The sfz and sfb clusters are absent in stalkless Zetaproteobacteria and Betaproteobacteria, and as previously postulated by Kato et al. (2015), they are likely candidates for encoding the twisted stalk trait. The presence of conserved protein domains for the binding of the signaling messenger c-di-GMP in sfz and sfb genes together with stalk-former unique genes that putatively encode transcription factors regulating a c-di-GMP dependent signaling cascade for a switch between sessile and motile growth point toward a functional connection between stalk formation and surface attachment. Singer et al. (2011) previously proposed a model according to which stalks serve as holdfasts to anchor cells to surfaces during sessile growth phases. In such a scenario, stalks could provide an advantage over amorphous biofilms as cells can grow within gradients while remaining surface-attached (Chan et al., 2011(Chan et al., , 2016. By comparison, stalk-less cells forming amorphous biofilms would have to frequently detach to swarm toward more favorable conditions. Generally, the fact that stalk-formers bear highly conserved unique genes that putatively control cellular motility and surface attachment is supportive of a holdfast function of stalks, though the actual involvement of respective genes in such processes and the extent to which they are linked to stalk formation requires detailed investigation. Our observations reported here require further investigation with lab-based experiments involving gene deletion/mutagenesis, differential transcriptomics and gene expression analyses to uncover whether the sfz and sfb gene clusters indeed encode stalk formation in Zetaproteobacteria and Betaproteobacteria, and to identify whether stalks are functionally linked to surface attachment. In particular, the potential for stalk-less Zetaproteobacteria to harbor alternative stalk formation genes that may be nonoperative requires investigation with laboratory experiments.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

AUTHOR CONTRIBUTIONS
EK, DE, OB, and CC contributed conception and design of the study. EK generated the data, performed data analyses, and wrote the first draft of the manuscript. TB conducted SEM sample analyses and supported EK in SEM sample preparation. OB supported EK in workflow design, script writing, and computational data analyses. All authors contributed to manuscript revision, and read and approved the submitted version.