Comparative Analysis of the IclR-Family of Bacterial Transcription Factors and Their DNA-Binding Motifs: Structure, Positioning, Co-Evolution, Regulon Content

The IclR-family is a large group of transcription factors (TFs) regulating various biological processes in diverse bacteria. Using comparative genomics techniques, we have identified binding motifs of IclR-family TFs, reconstructed regulons and analyzed their content, finding co-occurrences between the regulated COGs (clusters of orthologous genes), useful for future functional characterizations of TFs and their regulated genes. We describe two main types of IclR-family motifs, similar in sequence but different in the arrangement of the half-sites (boxes), with GKTYCRYW3–4RYGRAMC and TGRAACAN1–2TGTTYCA consensuses, and also predict that TFs in 32 orthologous groups have binding sites comprised of three boxes with alternating direction, which implies two possible alternative modes of dimerization of TFs. We identified trends in site positioning relative to the translational gene start, and show that TFs in 94 orthologous groups bind tandem sites with 18–22 nucleotides between their centers. We predict protein–DNA contacts via the correlation analysis of nucleotides in binding sites and amino acids of the DNA-binding domain of TFs, and show that the majority of interacting positions and predicted contacts are similar for both types of motifs and conform well both to available experimental data and to general protein–DNA interaction trends.


INTRODUCTION
Interactions between DNA and proteins are crucial for many important biological processes, including replication, reparation, and the main mechanism of transcription regulation, binding of transcription factors (TFs) to specific DNA sequences (Ofran et al., 2007). Genes encoding TFs comprise a large fraction of bacterial genomes (up to 10%) (Pérez-Rueda and Collado-Vides, 2000;Rodionov, 2007), and their structure and DNA-binding specificity are often unknown (Ofran et al., 2007). Therefore, uncovering the mechanisms of protein-DNA interaction is an important problem of molecular and computational biology.
Empirical rules of the protein-DNA recognition reflect chemical and physical properties of amino acid residues and base pairs, such as their partial charge interactions, amino acid side chain flexibility, etc. (Lustig and Jernigan, 1995). Specific interactions with DNA are mainly formed by amino-acid side-chain atoms (Morozov and Siggia, 2007), and important and favorable contacts are usually hydrogen bonds (due to their high specificity and directional character) and acid-base interactions, since there are relatively few non-polar atoms in the DNA grooves (Seeman et al., 1976;Lustig and Jernigan, 1995;Marabotti et al., 2008), and regions of protein-DNA contacts are rich in polar residues forming electrostatic and hydrogen bonds (Gromiha and Fukui, 2011). However, other types of contacts, e.g., hydrophobic interactions, may also be important (Mirny and Gelfand, 2002), and these interaction trends are not universal, mainly since protein-DNA contacts may depend on the structural context and, in particular, on the structural family of DNA-binding proteins (Morozov et al., 2005;Rohs et al., 2010). For example, contacts between the protein and the DNA sugar-phosphate backbone presumably play a minor role in determining the specificity, but may have an influence on the positioning and orientation of TF recognition elements, thus providing a structural framework for the proper interaction (Luscombe and Thornton, 2002;Rohs et al., 2010).
Conservation of base pairs in a motif is significantly correlated with the number of contacts they make with the TF (Mirny and Gelfand, 2002;Morozov and Siggia, 2007). Base pairs forming more contacts tend to be more conserved in evolution, because these amino acid-base pair interactions may stabilize the protein-DNA complex, which makes changes in these positions detrimental (Mirny and Gelfand, 2002). Calculation of the mutual information may be used for prediction of amino acidbase contacts for particular TF families, allowing one to make structural predictions given only the sequences; such predictions can be further verified experimentally (Mirny and Gelfand, 2002;Mahony et al., 2007;Desai et al., 2009;Huang et al., 2009;Camas et al., 2010;Ravcheev et al., 2012).

Structure of Binding Motifs
Different types of DNA-binding domains recognize distinct motifs, while DNA-binding proteins from the same family generally tend to recognize sites similar in length, symmetry, and specificity (Morozov and Siggia, 2007;Badis et al., 2009;Ravcheev et al., 2014;Korostelev et al., 2016;Suvorova and Gelfand, 2019). Within each family of TFs, the structure and fold of the DNA-binding domain and its mode of interaction with the binding motif are usually conserved, resulting in a certain common pattern of protein-DNA contacts (Morozov and Siggia, 2007). However, even proteins with very high (up to 60-70%) amino acid sequence identity might recognize different DNA motifs (Badis et al., 2009;Kazakov et al., 2013;Ravcheev et al., 2014).

Goals
We use the comparative genomics approach to identify binding motifs and reconstruct regulons (i.e., all genes/operons regulated by a TF in a given genome) and regulogs (combined regulons of a group of orthologous TFs in different genomes) for TFs from the IclR-family. Using these data, we attempt to further characterize functional roles of IclR-family TFs, reveal tendencies in their binding site structure and localization, and predict the most favorable protein-DNA contacts.

Main Tools and Resources
Genomes were obtained from GenBank (Benson et al., 1999). Homologs of TFs were identified by PSI-BLAST (E-value cutoff, 10 −20 ) (Altschul et al., 1997), and orthologs were identified by construction of phylogenetic trees for identified homologs and analysis of their genomic context (e.g., co-localization with genes of a certain metabolic pathway in most genomes). The genomic context was analyzed using MicrobesOnline (Dehal et al., 2010). Amino acid and nucleotide sequence alignments were performed using the MAFFT service (default parameters) (Katoh et al., 2019). Phylogenetic trees were built using PhyML (default parameters) (Dereeper et al., 2008) and visualized with Dendroscope (Huson et al., 2007).
Molmil was used for the visualization of PDB data (Bekker et al., 2016).

Phylogenetic Footprinting
Candidate binding sites were identified (or confirmed if they have been previously predicted) by phylogenetic footprinting (Rodionov, 2007). We manually analyzed alignments of upstream regions of orthologous genes presumably belonging to the respective regulon (genes encoding TFs, as they are often auto-regulated, and genes co-localized with them) (Gelfand et al., 2000;Tan et al., 2005;Martínez-Antonio et al., 2008) and identified consecutive conserved nucleotides, relying on the assumption that binding sites are more conserved than surrounding intergenic regions. These conserved regions, i.e., predicted binding sites, were then used as training sets for construction of nucleotide position weight matrices (PWMs) for each TF by the SignalX program as previously described (Gelfand et al., 2000). PWMs were then used for exhaustive scan of genomes possessing the corresponding TFs in search for additional candidate binding sites (and regulon members) as described further.
All identified binding sites are given in Supplementary Table 1, spreadsheets "group 1" and "group 2."

Reconstruction and Analysis of Regulons and Regulogs
Computational search for candidate binding sites in upstream gene regions [400 nucleotides (nt) upstream and 50 nt downstream relative to the annotated translational gene start] was performed using the built PWMs and the GenomeExplorer program package . Score thresholds for the identification of sites were selected so that candidate sites upstream of functionally relevant genes were accepted, while the fraction of genes preceded by candidate sites did not exceed 5% in studied genomes. Weaker sites (with scores 10% less than the threshold) were also accepted if their positions were similar to positions of stronger sites upstream of orthologous genes and there were no stronger competing sites in the intergenic region. New candidate members were assigned to a regulon if they were preceded by candidate binding sites in several genomes, the exact number of genomes depending on the number of sequenced genomes in a taxonomy unit. The reconstructed regulons were extended to include all genes in putative operons, the latter defined as the strings of genes transcribed in the same direction, with intergenic distances not exceeding 200 nt, when such organization persisted in several genomes. All genes comprising putative regulated operons were included into functional analysis of regulons.
To analyze positioning of sites relative to gene start, coordinates of site centers were calculated to account for differences in the motif lengths. In case of even-length sites, coordinates of site centers were rounded to the smaller whole number. The relevant data are given in Supplementary Table 2.
Content of the studied regulogs (combined regulons for each orthologous group of TFs) was analyzed using the BiBit algorithm for biclustering 1 of data reflecting regulatory interactions, in order to reveal frequently co-regulated genes and to identify orthologous groups of TFs most similar in the regulog composition. Only genes with unambiguously assigned COG (clusters of orthologous genes, Galperin et al., 2019) or PFAM IDs were considered in this analysis. The data are given in Supplementary Table 3.

Correlation Analysis
We restricted our study to IclR-family TFs (COG1414) from completely sequenced genomes present in the MicrobesOnline database (Supplementary Table 1, spreadsheet "list of genomes"). Only TFs predicted to have palindromic binding motifs satisfying either of two identified IclR-family consensuses were selected for the correlation analysis. Correlations were calculated between amino acid residues of DNA-binding HTH domains and nucleotides in binding sites, regions with gaps were cut out of the amino acid alignments (Supplementary Table 1, spreadsheets "HTH alignment group 1, " "HTH alignment group 2"), positions of amino acids were subsequently re-numbered starting from the beginning of the HTH domain, counting from zero.
Structural data of TtgV from P. putida in complex with DNA was taken as a reference model (Lu et al., 2010), supported by data on DNA-binding of wild-type and mutant TtgV and PobR (Kok et al., 1998;Fillet et al., 2009;Molina-Santiago et al., 2014), as well as structural data and DNA-binding modeling of TM0065 from T. maritima (Zhang et al., 2002).
Correlations were calculated using the Prot-DNA-Korr program package (Korostelev et al., 2016). The program calculates the correlation between each pair of columns, one from the amino acid alignment of the HTH domains, the other from the nucleotide alignment of the sites. Even-length and oddlength sites were aligned using central gap insertions, differences in sites length were compensated by introducing gaps on flanks. Datasets used in this work are given in Supplementary Table 1, spreadsheets "group 1" and "group 2." The mutual information was used as a measure of correlation. The statistical significance value of the mutual information was calculated as the Z-score. Correlated pairs of positions were displayed as a heatmap, with the color denoting statistical significance (significant correlations colored in the red-yellow palette), and as contingency tables (Supplementary Table 4) containing expected and observed counts of amino acid-nucleotide pairs, as well as χ 2 scores (scores higher than 30 were considered significant, scores higher than 50 were considered as strong preferences or avoidances, depending on the corresponding expected and observed values). For additional details about Prot-DNA-Korr see http://bioinf.fbb. msu.ru/Prot-DNA-Korr/main.html.

Binding Sites, Structure and General Statistics
Four thousand eight hundred and nine candidate binding sites have been predicted for 1340 IclR-family TFs constituting 181 orthologous groups in 320 bacterial genomes (Supplementary Table 1). Binding sites were identified via phylogenetic footprinting and further scanning of genomes with PWMs built based on footprinting results, as described in Materials and Methods. For verification of our results we used previously published experimental and comparative data (summarized in Table 1), as well as independently obtained data on candidate binding sites of many IclR-family TFs, available in the RegPrecise database 2 , and observe a complete agreement with it. Still, many TFs studied in this work are novel.
Moreover, 32 orthologous groups comprising 199 TFs have been predicted to bind three-box binding sites, with one pair of boxes corresponding to the first variant of the IclR-family motif consensus, and the other pair, to the second consensus variant (Figure 1). This feature may indicate a possibility of alternative dimerization modes of some IclR-family TFs, and agrees well with previous data on PcaU from the IclR-family, which has a threebox binding motif where the additional third box is also required for the binding of TF (Popp et al., 2002). As sites matching either the first or the second type of the consensus (group 1 and group 2) were analyzed separately, TFs of these 32 orthologous groups (cells marked orange in Supplementary Tables 1,3) were considered in both groups 1 and 2, excluding either the first or the third box.
There is no apparent correlation between the motif structure and phylogeny, all types of motifs (groups and subgroups) are scattered along the phylogenetic tree (data not shown).
IclR-family TFs are present predominantly in Proteobacteria and Actinobacteria, and we have observed some differences in the taxonomic distribution among the motif groups and subgroups ( Table 2). For example, in Firmicutes GTT-11-AAC type motifs are overrepresented and the TGT-11-ACA type absent, Proteobacteria have weaker representation of GTT-11-AAC motifs, compared to other subgroups, and only group 2 type motifs are identified in Thermotogales.

Binding Sites Positioning
We have analyzed not only structure, but also localization of identified sites relative to translational gene start in order to see whether there are any apparent tendencies in site positioning.
Sites centered at -400 to -300 nt and +40 to +50 regions are very rare. The most frequent position of site centers is -22 nt, a prominent peak is also observed at -3 to +1 nt. The majority of site centers are localized from -20 to -80 nt upstream of the gene start, gradually decreasing up to -300 nt. Similar trends in site localization were previously observed for other TF families, e.g., LacI (Ravcheev et al., 2014). Moreover, in this 60-nt zone we observe prominent oscillations in positioning of sites, with the distance between both pronounced peaks and minima approximately equal to one DNA turn (Figure 3 and Supplementary Table 2).
Notably, many studied TFs (from 94 orthologous groups, marked with italics in Supplementary Table 3) in both group 1 and group 2 are predicted to bind tandem palindromic sites with inter-site distance (between the site centers of symmetry) of 18-22 nt (mainly 19-21 nt, that is approximately two DNA turns). This observation agrees with the known fact that IclRfamily TFs can bind DNA as tetramers (Molina-Henares et al., 2006;Lu et al., 2010), with dimers facing the same side of DNA. The inter-site distances of approximately three and four DNA turns (possibly also allowing for the cooperative binding) are also overrepresented, but this trend is less pronounced (Figure 4 and Supplementary Table 2).

Regulog Content
Identification of candidate binding sites allows for reconstructing regulons and regulogs of corresponding TFs. We analyze content of IclR-family regulogs, speculating about functional characterizations of TFs and their regulated genes.
Regulogs of the studied IclR-family TFs vary in size, from 1 to 55 regulated COGs in a regulog. Most regulogs are rather small, more than half of them (98 out of 181) comprise ten or less regulated COGs.
The regulog content also differs widely between orthologous groups: out of 631 identified COGs, just 300 were present  in two or more regulogs, and only 48 COGs, in ten or more regulogs (Supplementary Table 3, spreadsheet "table COGs summary"). Out of these 48 "top" COGs, many are potentially involved in the metabolism and transport of aromatic compounds and sugars or sugar acids (Table 3). We observe difference in the distribution of these COGs in motif groups and subgroups. COGs involved in metabolism and transport of sugars and sugar acids are underrepresented in regulons of TGT-11-ACA and WTT-11-AAW motif types compared to other subgroups, while COGs involved in the metabolism and transport of aromatic compounds are overrepresented in regulons of TGT-11-ACA type. On the contrary, COGs involved in the metabolism and transport of sugars and sugar acids are frequently present in regulons of GTT-11-AAC and NGT-12-ACN motif types, while many COGs involved in aromatic metabolism are underrepresented in these subgroups. Moreover, COG1028 and COG1012 encoding dehydrogenases are especially overrepresented in NGT-12-ACN motif subgroup (Table 3).
We also attempted to study the COGs co-occurrence patterns in IclR-family regulogs to reveal possible functional connections between them, especially important for poorly characterized COGs, and it might help in understanding metabolic functions of IclR-family TFs.
As the composition of IclR-regulogs widely varies, we do not see co-occurences of COGs throughout the majority of regulogs, but observe a number of cases, where COGs are present in a small fraction of regulogs, but if present, are always or almost always found together. The most frequently associated pair is COG1788 (AtoD) and COG2057 (AtoA), each one of them is present only in 20 regulogs out of 181, but they co-occur in all of these twenty (for other examples, see Supplementary Table 3, spreadsheet "most frequent association").
In addition to obvious co-occurrence of transporter subunits (COGs LivKHMGH, DctPQM, DdpA-DppBCD, UgpABE, HisJM-GlnQ, TauABC, AraH-MglA-RbsB etc.) and enzyme subunits (e.g., AtoDA), we identified other frequently coregulated COGs in IclR-family regulogs (Supplementary Table 3). A large group of COGs, known or presumed to be involved in the metabolism of aromatic compounds, co-occurs in many regulogs in various combinations, with three subsets of the most frequently associated COGs (Supplementary Table 3, spreadsheet "sets of COGs") being: (i) COGs involved in the degradation of benzoate, catechol, muconate and their derivatives via the ortho-cleavage pathway, channeling them into the TCA cycle through β-ketoadipate, and    COGs that likely play a role in transport of aromatic compounds (Li et al., 2010;Suvorova and Gelfand, 2019); (ii) COGs forming the meta-cleavage degradation pathway of benzoate, catechol and their derivatives, COGs that may be involved in the degradation of aromatic compounds through CoA thioesters and forming the downstream part of the βketoadipate pathway, and COGs that may be involved in transport of aromatic compounds (Arai et al., 2000;Zaar et al., 2001;Gescher et al., 2002;Suvorova and Gelfand, 2019); (iii) COGs likely involved in the quinate/shikimate and 4-hydroxyphenylpyruvate metabolism, and transport of aromatic compounds.
One more set is comprised of COGs involved in the metabolism and transport of sugars and sugar acids and/or aromatic compounds (Hobbs et al., 2012(Hobbs et al., , 2013Maruyama et al., 2015;Suvorova and Gelfand, 2019;Watanabe et al., 2020), and includes two most frequently associated subsets (Supplementary Table 3, spreadsheet "sets of COGs").
Thus, IclR-family TFs indeed regulate mainly metabolism and transport of various aromatic compounds. Analysis of frequently associated COGs may be useful not only for functional characterization of unknown TFs, but also COGs with unknown or insufficiently studied functions; examples of such frequently associated COGs identified in this study are COG3254 and COG3618, as well as functional association of COG1545, COG2030, COG3181, and COG3333 with genes of metabolism and transport of aromatic compounds (Supplementary Table 3, spreadsheet "sets of COGs").

Protein-DNA Correlations
One of the goals of this work was to find correlations between nucleotides of identified binding sites and amino acid residues of DNA-binding HTH domains of the IclR-family TFs to predict potential protein-DNA contacts. Correlations, calculated based on mutual information, were found using the Prot-DNA-Korr program package as described in section "Materials and Methods."

Group 1
The correlation analysis for group 1 (Figure 5  preference for Pro, weakly prefer Lys, and strongly avoid Gln at position 1; weakly prefer Ala and Pro at position 2; strongly prefer Ser and, more weakly, Glu at position 30; and strongly prefer Ala, Asn, and weakly prefer Asp, Glu, Gly, His, Leu, and strongly avoid Arg at position 33 of the HTH domain. Amino acid residues at positions 27 and 30 of the HTH domain are correlated with nucleotides at positions 10/22 of the binding motif. Strong preference here is seen for His27 with the G/C pair and for Lys30 with A/T. Amino acid residues at position 28 correlate with nucleotides at positions 7/25 and 8/24. Lys28 strongly prefers G/C and avoids A/T at positions 7/25, and Leu28 is weakly correlated with C(7). Arg28 strongly prefers G/C and avoids A/T at positions 8/24. Amino acids at position 32 correlate with nucleotides at positions 8/24, 9/23, and 10. Similar to Arg28, Arg32 strongly prefers G/C and avoids A/T at positions 8/24 and, weaker, prefers G(10). Gln32 shows strong preference toward A/T (9/23), while His32 weakly prefers C(10).
Correlations of Asp28, Glu28, Gln32 with gaps at positions 8/24 are caused by flanking gaps inserted due to differences in sites lengths and are not considered further.
We also performed correlation analysis separately for each of four subgroups of group 1, to identify their contribution to the results of the whole group, and also for variants (i)-(iii) combined, to assess the differences between the odd-length and even-length motifs (results given in Supplementary Table 4 and Supplementary Figures 1-5).
The data on all groups and subgroups (see below) are summarized in Supplementary Table 4, spreadsheet "table of correlations, summary."

Group 2
Correlation analysis for group 2 of motifs (Figure 6 and Supplementary Table 4, spreadsheet "group 2") reveals multiple correlations of nucleotides at positions 13/25 with amino acids in positions 29, 30, 33, and 35. Here, Glu29 and Lys33 strongly prefer A/T, Leu35 and, more weakly, Val29 and His30 prefer G/C. Amino acid residues at position 31 are correlated with nucleotides at positions 16/22, although without significant preference for specific protein-DNA contacts.

Main Predicted Protein-DNA Contacts
The analyzed binding motifs have a symmetrical palindromic structure, hence, the obtained heat maps are also mainly symmetrical. If not, in most cases the correlation with the symmetrical base pair is only slightly lower than the significance threshold, although the same trend is still observed; however, in other cases there might indeed be asymmetry in the dimer/tetramer structure of a TF, as shown, e.g., for TtgV (Lu et al., 2010). Due to the symmetry, the observed correlations are by default shown for either a G/C or an A/T pair. Further differentiation between the contribution of G and C, or A and T, is not always possible, and may require additional information, e.g., donor-acceptor properties of the interacting amino acids FIGURE 5 | Heat map of correlations between amino acids and nucleotides for group 1 TFs and their binding sites. Sequence logos of HTH DNA-binding domains and their binding sites are shown on the top and to the left of the heat map, respectively. The total height of symbols at each position reflects the positional information content, whereas the height of individual symbols is proportional to the positional amino acid or nucleotide frequencies. The correlation scores are color-coded from yellow to red for amino acid-nucleotide pairs with statistical significance of correlation exceeding an automatically defined threshold (with red assigned for the most correlated pair). The violet-black palette is used for other pairs. Yellow lines denote positions where gaps have been removed from the amino acid alignment.
Taking into account only strong significant correlations, i.e., excluding weak ones, as well as amino acid positions associated with gaps (Supplementary Table 4, spreadsheet "table of correlations, summary"), chemical properties of amino acids residues and nucleotide bases suggest the following main predicted protein-DNA contacts for group1 and its subgroups: • Thr5 and Glu26 with A/T (10/22) for the WTT-11-AAW subgroup; • His27 with G/C (10/22) for group 1 with the contribution of all odd-length motif subgroups, mainly the GTT-11-AAC subgroup, the likely contact is His-G; • Lys28 with G/C (7/25) for group 1 with the contribution of all odd-length motif subgroups, mainly the TGT-11-ACA subgroup, the likely contact is Lys-G; • Arg28 and Arg32 with G/C (8/24) for group 1, the latter with the contribution of all odd-length motif subgroups, the likely contacts are Arg-G; • Arg28 with G/C (9/23) and Arg32 with G9 for the GTT-11-AAC subgroup, the likely contacts are Arg-G; • Ile28 with G9 for the GTT-11-AAC subgroup; • Lys30 with A/T (10/22) for group 1 with the main contribution of the WTT-11-AAW subgroup; • Gln32 with A/T (9/23) for group 1 with the contribution of all odd-length motif subgroups, mainly GTT-11-AAC and WTT-11-AAW, the likely contact is Gln-A; • Gly32 with G(10) for the NGT-12-ACN subgroup; • Gly33 and Glu33 with G/C (11/21), and Ala33 with T21 for all odd-length motif subgroups, with the main contribution of GTT-11-AAC subgroup, the likely contact is Glu-C; • Pro33 with G/C (11/21) for all odd-length motif subgroups, with the main contribution of WTT-11-AAW subgroup.

Comparison of the Group 1 and Group 2 Motifs
Previous comparison of binding motifs for some individual TFs revealed no common consensus for the IclR-family and no distinct similarity (Molina-Henares et al., 2006), especially between motifs that fall in group 1 and group 2 by our classification. Here, using a large collection of identified binding sites, we not only specify two main types of IclR-family motifs, but also find that they are similar in sequence but differ in the arrangement of half-sites of the palindrome. Joint LOGO diagrams of aligned binding sites from each group clearly show that the left half of the group 1 binding motifs corresponds to the right half of the group 2 binding motifs, and vice versa (Figure 7). It implies that there could be two different modes of dimerization of the IclR-family TFs. Moreover, since we have identified IclR-family binding motifs comprised of three boxes with alternating direction, where one pair of boxes corresponds to the group 1 consensus, and the other pair of boxes, to the group 2 consensus, we may assume that some IclR-family TFs are capable of alternative dimerization, possibly providing for more variable and precise regulation of transcription. It can be experimentally validated, for example, via DNA footprinting to precisely identify protected and sensitive regions upstream of the regulated genes and SPR to examine protein-DNA interaction. This observation is supported by previous studies of IclRfamily TFs PcaU, PobR, and HmgR, for which three-box binding motifs have been identified (Kok et al., 1998;Popp et al., 2002;Arias-Barrau et al., 2004;Molina-Henares et al., 2006;Jerg and Gerischer, 2008).
Despite differing organization of the boxes, their sequence similarity allows us to indirectly compare the predicted protein-DNA contacts for group 1 and group 2. Nucleotides 13/25 and 16/22 of the group 2 binding motifs, involved in the protein-DNA interaction according to the correlation analysis, correspond to nucleotides 10/22 and 7/25 of the group 1 motifs, respectively (Figure 7). Due to the differently removed gaps in the alignment, amino acid positions 29, 30, 31, 33, 35 of the aligned HTH domains of the group 2, which interact with DNA according to the correlation data, correspond to amino acid positions 26, 27, 28, 30, 32 of the group 1 HTH domains, respectively. Taking all this into account, we see quite congruent predictions for protein-DNA contacts in both groups (Table 4). Thus, despite the different structure of binding motifs, the interaction of TFs with DNA likely has similar features throughout the whole IclRfamily.

Protein-DNA Contacts
The main goal of this study was to predict protein-DNA contacts for IclR-family TFs via correlation analysis. In order to validate these predictions, we compared the observed correlations with known data on protein-DNA interactions of TFs from the IclRfamily (summarized in Table 5).
One 3D structure of the IclR-family TF in complex with DNA is currently available, namely, TtgV from P. putida (PDB:2XRO), a regulator of the RND-family efflux transporters (Lu et al., 2010). According to the crystal protein-DNA structure, residues Ser48, Thr49, Gln51, and Arg52 of the recognition helix of the HTH domain directly and specifically interact with the DNA major groove (Lu et al., 2010). These observations are supported by results of mutational analysis and data on other IclR-family TFs.
For example, experiments on wild-type and mutant TtgV show that residues from Arg47 to Ile54, Leu57, Glu60, and Phe61 are involved in binding DNA (Fillet et al., 2009;Molina-Santiago et al., 2014). Similarly, in mutant PobR from Acinetobacter baylyi  Arg-G, His-C,Gly-G/C b , Tyr-G/C b 35 13/25 Leu-G/C a Only in subgroup (iii). b Only in subgroup (iv). Bold font denotes strong preferences. Differences in numbering between group 1 and group 2 are due to removal of gaps from the alignment (see Supplementary  Table 1, spreadsheets "HTH alignment group 1," "HTH alignment group 2").
TtgV mutants at Arg19, Ser35, and Gly44 fail to bind DNA; moreover, residues equivalent to Pro46 are highly conserved on the multiple alignment of IclR-family TFs, and thus also are likely important for DNA binding (Lu et al., 2010;Molina-Santiago et al., 2014). Residues Arg19 and Ser35 lie across the minor grooves and interact with the DNA phosphate backbone, which is possible due to the strong bending of the operator sequence bound to TtgV, and residues Gly44 and Pro46 within the turn of the HTH domain are involved in this distortion, hence playing a Differences in numbering between group 1 and group 2 are due to removal of gaps from the alignment (see Supplementary Table 1, spreadsheets "HTH alignment group 1," "HTH alignment group 2").
role in the DNA binding (Lu et al., 2010;Molina-Santiago et al., 2014). Glycine residues may not interact directly with DNA, but provide flexibility to bind DNA targets with different half-site spacing (Zhang et al., 2002;Molina-Henares et al., 2006). The correlation analysis shows that most of the predicted contacts with DNA in all studied groups and subgroups are formed by the α3-helix of the HTH domain and, less frequently, the N-terminal part of the α1-helix, and the majority of amino acid positions significantly correlated with site positions and likely responsible for the binding specificity (nine out of 12 positions for group 1 and/or its subgroups, and four out of five positions, for group 2) correspond well to those previously identified for TtgV and PobR, and predicted for TM0065 (Figure 8, Table 5) are found, colored with red (even numbers) and brown (odd numbers).
"table of correlations, summary") (Kok et al., 1998;Zhang et al., 2002;Fillet et al., 2009;Lu et al., 2010;Molina-Santiago et al., 2014). Our results also agree with previous studies, where it has been demonstrated that residues of the recognition helix of the HTH domain form most of the protein-DNA contacts predicted via the correlation analysis (Korostelev et al., 2016;Suvorova and Gelfand, 2019).
It has been previously claimed that Arg, Asn, Lys, Gln, Thr, Ser, Asp, and Gly account for more than 70% of protein-DNA contacts, with Lys and Arg frequently dominating in the interactions (Molina-Henares et al., 2006), and Arg alone accounts for 23% of contacts (Marabotti et al., 2008). This trend has been observed in this study as well. The majority of predicted interactions involved these amino acids: one out of one of strong correlations in subgroups (i) and (iv), four out of eight in subgroup (ii), three out of five in subgroup (iii), five out of six if the entire group 1 is considered, and one out of three in group 2. Arg and Lys are among the most frequent ones: combined, they account for four out of six strong correlations in group 1, and one out of three in group 2 (Supplementary Table 4, spreadsheet "table of correlations, summary").
Arg-G, Asn-A, Asp-C, Gln-A, Glu-C, Lys-G, and, to a lesser extent, His-G and Ser-G, appear to be the most relevant, strongest and highly specific contacts (Lustig and Jernigan, 1995;Marabotti et al., 2008). Preferences are also known for Ala-C, Cys-G, Gly-G, Leu-A, Thr-G, and Trp-C (Marabotti et al., 2008). Many of protein-DNA contacts predicted for the IclR-family TFs conform well to these preferences, e.g., five out of six strong correlations in group 1 (Supplementary Table 4).

CONCLUSION
We have identified regulated genes and binding sites for 1340 IclR-family TFs from 181 orthologous groups in 320 bacterial genomes. Despite the prevalent opinion that IclRfamily motifs have no common consensus, here we describe two main types of IclR-family motifs, similar in sequence but different in the arrangement of the boxes. This, together with the prediction that many IclR-family TFs bind threebox motifs, where one pair of boxes corresponds to the first variant of the motif consensus, and the other pair, to the second variant, suggests that alternative dimerization is possible for IclR-family TFs. This hypothesis requires experimental validation.
We demonstrate that site positioning apparently follows the length of the DNA turn. The majority of site centers are positioned between -20 to -80 nt upstream of the gene start, and in this 60-nt zone the probability of site positioning distinctly oscillates, with the distance between the preferred positions approximately corresponding to one DNA turn. We also have observed that TFs from more than half of the studied orthologous groups bind tandem sites with 18-22 (mainly 19-21) nucleotides between their centers. This distance seems to be optimal for the tetramer binding of the IclR-family TFs, with dimers facing the same side of DNA.
We predict protein-DNA contacts by the analysis of correlations between amino acids of DNA-binding motifs of TFs and nucleotides of their binding sites. The correlation analysis shows that, despite differences in the motif structure, the majority of interacting positions and predicted protein-DNA contacts are similar in both studied groups and conform well to existing experimental data, as well as to previously described general protein-DNA interaction trends.
We have also reconstructed regulons for the IclR-family TFs and analyzed their content, identifying co-occurences between the regulated COGs. IclR-family regulogs vary in size and content, and those COGs that are most frequently present and associated with each other are involved in the metabolism and transport of aromatic compounds and sugars or sugar acids.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
IS and MG conceived the study, analyzed the results, wrote and revised the manuscript, and read and approved the submitted version. IS designed and performed the comparative analysis and visualized results. Both authors contributed to the article and approved the submitted version.

FUNDING
This study was supported by the Russian Science Foundation under grant 18-14-00358.

ACKNOWLEDGMENTS
We thank Yuriy Korostelev for designing the Prot-DNA-Korr software and Andrey Mironov for sharing GenomeExplorer and SignalX software. We also thank Maria Tutukina for the discussion of possible experimental validation of prediction.