The Electronic Behavior of Zinc-Finger Protein Binding Sites in the Context of the DNA Extended Ladder Model

The eukaryotic Cys2His2 zinc finger proteins bind to DNA ubiquitously at highly conserved domains, responsible for gene regulation and the spatial organization of DNA. To study and understand the zinc finger DNA-protein interaction, we use the extended ladder in the DNA model proposed by Zhu, Rasmussen, Balatsky \&Bishop (2007) \cite{Zhu-2007}. Considering one single spinless electron in each nucleotide $\pi$-orbital along a double DNA chain (dDNA), we find a typical pattern for the {\color{black} bottom of the occupied molecular orbital (BOMO)}, highest occupied molecular orbital (HOMO) and lowest unoccupied orbital (LUMO) along the binding sites. We specifically looked at two members of zinc finger protein family: specificity protein 1 (SP1) and early grown response 1 transcription factors (EGR1). When the valence band is filled, we find electrons in the purines along the nucleotide sequence, compatible with the electric charges of the binding amino acids in SP1 and EGR1 zinc finger.


INTRODUCTION
One of the major eukaryotic DNA-protein binding motifs are those related to zinc fingers (ZF), the key protein family for the chromatin condensation as well as the gene regulation.
There are around one thousand ZF encoding genes [2] and ten thousands highly conserved putative ZF binding sites along the human genome [3,4]. The majority of ZF proteins assist transcription factors, acting as repressors, activators and regulators [2,5]. They are responsible for the genome spacial structure in the DNA loops too, exposing or hiding the genes, and work as an insulator, avoiding the spread of heterochromatin [6]. These ZF proteins could mediate long-range chromosomal interactions in eukaryotic cells, > 100 thousand base pair (bp) [7][8][9][10]. However, the exact relation between the long-ranged correlation in genomic scale nucleotide sequences (>20 thousand bp) and the chromosomal three-dimensional organization is still not clear [10][11][12][13][14][15] and subject to intense research. Furthermore, since transcription factors spot specific sequences without the opening of the double helix, we expect some biological mechanism for probing nucleotide based on local properties [16]. To understand this there are two basic approaches: a polymeric description and by electric charges.
The most common polymeric description considers DNA as a single one-dimensional strand, explaining the DNA denaturation semi-analytically [17][18][19]. The literature also report the mechanical properties of chromosomal fibers and the long-range nucleotide interaction due to DNA loops, histones and zinc finger proteins, using Monte Carlo simulation [10,[20][21][22].
Since electrons play a crucial role in the DNA-protein interaction, we must consider the DNA from the electric charge distribution too. The electronic nature of DNA is still in debate, but the literature point us some cues about their organization. The double helices behave as isolants or conductor under silver deposition [23], material contaminants [24,25] and others environmental conditions [26]. However, when the conductivity is measures in atmosphere, low vacuum or Tris-HCl buffers, DNA has semiconductor features with the typical gap between the valence and conductor band in the electronic density of states (DOS) [26][27][28][29][30][31]. In order to describe this behavior, ionization models (also known as ballistic, polaron or wirelike charge transport) have been proposed [32][33][34][35][36]. The parameters in ionization models are easily measured, since one just needs to evaluate the loss of energy when we takes one electron in a neutral molecule. The lost electron is usually in the highest occupied molecular orbital valence band (HOMO) and it may easily jump to the lowest unoccupied molecular orbital in the conductor band (LUMO). But, the literature also suggests electronic affinity models, where the energy is described by the gain of electrons in neutral molecule [37][38][39][40][41].
These theoretical results usually combine density functional theory and molecular dynamic simulation.
In 2007 Zhu et al. joined both molecular ionization and affinity approaches [1]. This adaptation of the Peyra-Bishop DNA melting model [17] describes the nucleotide sequence from their semi-conductor features, avoiding the heavy computational cost of ab initio molecular dynamical simulations. Their work allowed to spot electronic local density of states (LDOS) in one viral P5 promoter sequence, connecting LDOS with one specific biological function [1,16]. Unfortunately, they did not observe in their model the gap between HOMO and LUMO in (C) n as one expects from the experimental data [30,42]. In our work we fix the problem of HOMO-LUMO gap, introducing the extended ladder in the model as suggested by [36,[43][44][45]. We also analyze systematically the DNA-protein binding sites for two transcription factor proteins: the human specificity protein transcription factor 1 (SP1) [46] and early grown response factor (EGR1, aka Zif268) [47] both localized in the promoter of a great variety of genes and characterized by a molecular structure called Cys 2 His 2 zinc finger (ZF). The descriptions of ZFs as well as SP1 and EGR1 are in the appendix. This paper is organized as follows. First, we describe the extended ladder model. Then, we test the model, studying the electronic behavior of (C) n and (T ) n sequences. We replace one nucleotide in order to understand the electronic interactions along the DNA chain. After this, we analyze systematically the SP1 and EGR1 binding sequences and report strand dependence and independence results.

THE MODEL
In this paper, we consider one double DNA chain with n base pairs, totaling 2n nucleotides, Fig. 1(d). In reality our model does not consider nucleotides, but nucleosides, i.e. the nucleotide with the phosphate group. However, we will call nucleosides nucleotides in this work in order to simplify the nomenclature. The electronic behavior of the spinless free electron of the π-orbital of the nucleotide is described using the same Hamiltonian as in [1], The first term in Eq. 1 is given by, where C † i and C i are the electron creation and annihilation operators at site i, ǫ i is the on-site ionization energy, n is the number of nucleotides and t ij is the electron hopping rate between nucleotides i and j, Here, we are using the extended ladder, where we duplicate the one dimensional lattice in [1] and include the interstrand hopping [36,43,45]. The structure of the ladder considers the long-distance charge and hole transport along dDNA [43,[48][49][50] The second term in Eq. 1 represents the coupling between the free electron and the nucleotide displacement field, where y i is the displacement (dark dotted line) of the electronic cloud from the equilibrium in the nucleotide (light dotted line), Fig. 1(d). The last term H b represents the interaction of the electron with the nucleotide: where D i and a i are parameters of the Morse potential, k v is the spring constant of the anharmonic interaction between two contiguous base-pairs. ρ and α are the parameters for modifying k v in order to evaluate long-range cooperative electronic behavior [1].
We study the electronic part H e and H eb of the Hamiltonian in Eq. 1 computing the eigenvalue E k and eigenvectors φ k i , i, k = 1, ..., 2n, of the matrix .
This matrix is similar to the one suggested in [36,45], except for the electron base component In order to estimate y i , we consider the self-consistency condition, given by where < ... > represent the average over the free electrons in the system. The iteration method for solving Eqs. 5 and 6 is described in [1], and it consists of the follow procedure.
Given a initial condition for {y i }, we diagonalize the matrix 5 in order to compute the electronic occupation in each site < n i >, where n i = ne k=1 |φ k i | 2 and n e is the number of electrons in the system. This set of < n i > will be used in the Langevin equation calculated from Eq. 6. We update the values of {y i }, using fourth-order Runger-Kutta method for the Langevin equation. The new {y i } set is inserted again in the matrix of Eq. 5. We repeat the iteration until we achieve the minimum local adiabatic electronic and structural configuration. The computation were done using R with the package deSolve for the Runger-Kutta algorithm [51]. The choices of the model parameters are in the appendix.
In this work, we estimate spatial distribution of electrons, energy level and displacement field only considering n e = n. Thus, the valence band is always filled with electrons and the conductor band is empty. Our model does not have periodic boundary condition. So, the elected regions for analysis must be larger in order to avoid boundary effects. We analyze only nucleotide sequences at least 10bp distant from the beginning and ending of the sample.
We apply the proposed model in poly(C)-poly(G) and poly(T)-poly(A) sequences with 63 base pairs in order to understand the behavior of the electrons dispersed along the DNA chain.
According to the literature [36,44,45,52], we do expect a gap in the energy band in the test sequences (C) 63 and (T ) 63 as can be seen in Fig. 2(a). The gap between the valence and conductor band in (T ) 63 is narrower than in (C) 63 . Furthermore, the gap of the pure (C) 63 sequence can be modulated, when we introduce one single T in the position 32. One HOMO and LUMO appear in the gap of the energy band, marked as H and L in Fig. 2(d).
Moreover, the HOMO and LUMO electronic cloud, dispersed in pure (C) 63 , black lines in Fig. 2(b,e), becomes localized in the introduced T (red lines in Fig. 2(b,e)). We remark that the electronic cloud of HOMO is dispersed in a pure (C) 63  On the other hand, when we substitute one thymine by cytosine in a (T ) 63 , the HOMO electronic cloud will be localized in the replaced nucleotide, Fig. 2 We apply the procedure described in the previous section and we estimate the eigenvalues E k and eigenvectors φ k i of the total Hamiltonian in Eq. 1 for the sequences in Table 2. The criteria for the sequence selection as well as the method for nucleotide alignments are described in the appendix. The alignments in Table 2    We start with the analysis of the position of BOMOs looking for the values of |φ k i | 2 with k close to 1. When we consider n = 50 as in MOAB and EGR1, the analysis of the first 8 eigenvectors are usually sufficient to spot the relevant ones. The electronic cloud n i , 0 ≤ n i ≤ 1, Eq. 1, is strand dependent, but we do not observe any strand related pattern for individual electrons. Thus, we sum the probabilities of the direct and the complementary strands to find the local electronic cloud. BOMOs could be degenerated in many electrons along the nucleotide sequence, but we should focus just in those around the binding sites, yellow and black lines in the valence band |φ k i | 2 , Fig.3(c,d). Note that the sum of these two degenerated BOMOs k |φ k i | 2 δ(E 0 − E k ) will result in the LDOS of the binding site, which is proportional to the differential tunneling conductance [1]. At low temperature, this quantity could be measured by scanning tunneling microscope (STM) [42]. The zinc fingers of SP1 and EGR1 transcription factors act as tips of an STM, scanning binding sites along the DNA chain. Finally, we mark the nucleotides with at least 10% of probability of electronic presence in gray and yellow in Fig.3(g,h).
The procedure for localizing the electronic cloud associated with HOMO and LUMO is very similar to spot BOMO probability distributions, except that k of HOMO and LUMO are close n. In order to find the electronic clouds, we need to consider k from 46 to 50 for HOMO and 51 to 54 for LUMO, when n = 50. The electronic cloud associated with LUMO is always close to the HOMO, with a maximum of ±6bp distance. Since the probability of finding one HOMO or LUMO electrons are strand independent, we add both direct and The orange lines in Fig. 3(c,d) and red lines in Fig. 3(e,f) are HOMO and LUMO, respectively. We can also measure the LDOS of HOMO and LUMO with STM, using the same approach for BOMOs. The nucleotides with at least 10% of probability of electronic presence is denoted by orange and red boxes in Fig. 3(g,h). Table 2, where all BOMOs are marked in gray and yellow as well as the HOMO and LUMO electrons are in orange and red boxes. Looking Table 2, the electronic distribution patterns for the binding sites for SP1 and EGR1 transcriptor factors are clear.

Now we return to
In the case of SP1, BOMO clouds are over the first (5'-ggg-3') and third triplets (5'-ggg-3') of the consensus sequence, light green in Table 2. These triplets are related with the first and third ZF binding positions of SP1. Moreover, the first BOMO electronic cloud has values from 4 to 5 bp, while the second ranges from 2 to 4bp. The eigenvalue of these BOMOs values 7.98±0.05eV. The energy level of HOMO electrons are fixed at 8.52±0.02eV and the electronic cloud sizes spans between 1 and 2 bp. We observe some fluctuation in the eigenvalue in LUMO for SP1, which values 9.3 ± 0.1eV. The LUMO electrons envelop 2 to 5 base pairs. The positions of HOMO and LUMO associated electrons are always before the first electron and these electrons are placed from -12 to 1.
For the EGR1 the first BOMO spans from the position 3 to 7 over the second triplet (5'-ggg-3'), and the probability in finding this particular electron spans over 2 or 4 bp. The second triplet is the binding site of the second ZF of the early grown response protein 1. The second electron is after the second triplet and is dispersed between the nucleotide position 7 to 15, covering the third triplet. The electronic cloud size ranges from 2 to 4bp. All BOMO energies in Table 2 values 7.99 ± 0.03eV. The HOMO and LUMO electronic cloud is over the second electron. All E HOMO in Table 2  Considering the HOMO and LUMO distributions, we believe that they may play some role in SP1 and EGR1 binding. These proteins bind DNA, embracing the major grove of the double helix as guide. In the case of SP1, the head may interact with nucleotides between position -11 to -1. The behavior for EGR1 is more elusive because HOMO and LUMO are completely disperse over the nucleotides 5 to 20. Despite the description emphasizing the similarity between the ZF and nucleotide interaction in the literature over the consensus nucleotides [5], the mechanisms of protein attachment in EGR 1 and SP1 are not the same [61,62].
The HOMO and LUMO electronic clouds are frequently overlapped, and the main reason for electrons of HOMO and LUMO are always in adenine and thymine rich sequences is as follows. The electrons from the highest occupied molecular orbital in the valence band should move to the nearby lowest unoccupied molecular orbital in the conductor band, when the system is disturbed. And the easiest way for this movement is placing the electron in regions with higher excitability, i.e. AT rich domains. We may conjecture that this jump of the electron in the HOMO to the LUMO has a unknown role in the transcriptor factor SP1 and EGR1.
On the other hand, the less mobile electrons are those in the CG rich domain, since they are at the bottom of the DOS. So, we expect to spot BOMOs in CG rich-sequences instead of AT rich-regions as we see for (T ) 63 with one C in the position 32, described in the previously. Furthermore, these BOMOs are degenerated, i.e. all electrons present the same energy level. Thus, these cytosine and guanine rich-regions, typical in promoters, are ideal landmarks for SP1 and EGR1 binding sites. The absence o excitation in the lowest states is vital for ZF transcription factors, because nucleotides with mobile electrons may change the position of the beginning of the gene reading, altering the gene expression. We are aware about the those eigenvalues between BOMO and HOMO, but we still do not find any obvious pattern associated with SP1 and EGR1.
We never observe overposition between the probabilities of BOMOs and the overlapped LUMO-HOMO electronic clouds using the criteria of a minimum 10% of the localization probability of one particular electron at the samples in Table 2.

THE COLLECTIVE ELECTRONIC BEHAVIOR
The electronic probabilities |φ k i | 2 of individual electrons, discussed in the previous section, are strand independent. However, the collective electronic probabilities n i and the field displacement y i depend of the strand.
When we have n e = n electrons, we fill only the valence band and usually observe in all analyzed sequences, Table 2, 100% of probability of electronic presence in purine (adenine or guanine) and the absence of an electron (hole) in pyrimidines (thymine or cytosine) in agreement with the DFT analysis [41]. Figs. 3(i,j) are the probabilities n i associated with finding one electron in one nucleotide for the MAOB SP1d and EGR1 binding site sequences. The electronic presence in each purine gives us a new biological interpretation of Peng et als. contribution [11,12]. Using exon and intron rich segments of the eukaryotic genome, they construct a DNA walk using purine and pyrimidine as criteria for steps. Then they report self-affine fractality in the walk, showing long-ranged correlation in the purine and pyrimidine distribution. When we look to Figs. 3(i,j), purines and pyrimidines reflect the electronic distribution along the DNA chain. This electronic distribution is related to BOMO, HOMO and LUMO distributions, which work as ZF binding sites, for example. It is important to stress that they report a self-affine fractal, but not a self-similar fractal. The self-similarity is related to the palindromic sequences, connected with DNA-loop structures as tRNA and rRNA [63,64], while self-affinity is related to introns [11]. Furthermore, the evidence of polynomial long-ranged nucleotide interaction is also supported by [13,14].
In this work the long contiguous sequences are represented by a sequence of 0 and 1 for noncoding and coding nucleotides, where noncoding nucleotides are intergenic regions and introns and coding nucleotides are genes and regions for metabolic controls. We made an auto-correlation analysis over the binary sequence and report correlation between two coding nucleotides at least 20 thousand bp distance apart.
The existence of long-ranged correlations has another consequence in the model. The second term in Eq. 4, the stacking interaction between adjacent base pair, mimics the bending of DNA as polymer. But, we will see that the short-ranged ρe −α(y i +y i+1 ) in Eq. The presence of electrons in purines have a profound impact in the ZF binding. We show the consensus nucleotide sequence and the core zinc finger binding amino acids in Fig.1(b) and (c) for EGR1 and SP1, respectively.
The EGR1 amino acid sequence is available in the Universal Protein Resource databank (UniProt) with the accession code P18146 [66]. The three zinc fingers of the human EGR1 are positioned between position 338 to 362, 368 to 390 and 396 to 418 of the 543 long amino acid sequence [66]. The dotted lines in Fig. 1(b) are the hydrogen bonds between aspartic acid (D) and adenine or cytosine, which stabilize the first guanine-argine(R) hydrogen bond of ZF. The positive charged basic argine (R), histidine (H) and lysine (K) responsible for the DNA-protein are highlighted in gray, while the negative charged weak acid threonine (T) and strong acid glutamine (E) are in yellow. Each red line in Fig. 1(b) is the binding of one particular nucleotide with its respective opposite charged amino acid of the core zinc finger segment of the EGR1 [61].
The 785 amino acid long SP1 transcriptor factor, UniProt accession code P08047, has three tandem ZFs between 626 to 650, 656 to 680 and 686 to 708 [46,66]. The dotted lines are the hydrogen bond that stabilize the first ZF guanine-argine(R) or guanine-lysine(K) bonds. The binding between core ZF amino acids and the correspondent nucleotides are indicated by red lines in Fig.1(c) [5]. Again, each nucleotide is connected with opposite charged amino acid .
Concerning the electrical charges of the SP1 zinc finger tips, the middle amino acid that bonds with the middle nucleotide of the triplet, there is one motif associated with the position of BOMOs, described in the previous section. The pattern of positive and negative charges along the nucleotide sequence coincide with the position of the three ZF tips. Since BOMOs are the most stable electrons, they guide the fingers as holder for fixing SP1 to the dDNA. We observe the same phenomenon for the EGR1.
When we compare the strand independence analysis of the previous section with the electronic strand dependence, one may suggest the existence of a contradiction between the presence of BOMO in the complementary strand 3'-ccc-5' at EGR1 in Fig. 3(h). Actually BOMO in this case is at the direct strand 5'-ggg-3'. We have the impression that BOMO is in the 3'-ccc-5', since we sum the electronic cloud of direct and complementary strand in the previous section, seeking the electronic motif of BOMO.
The collective probabilities n i are not the unique strand dependent variable in SP1 and EGR1. The field displacement y i of the Morse potential is also strand dependent. This means the electronic cloud y i , given by the Morse potential in Eq. 4, usually contract when n i = 1. i.e. in the presence of purines. The contraction of the electronic cloud is more intense in adenine (y i = −0.125 ± 0.001Å) than guanine (y i = −0.114 ± 0.001Å). The simultaneous measurement of the size of the electronic cloud of the direct and complementary strands y i mirror the nucleotide order and may lead to a new sequencing method.
The consensus sequences, the light and dark green lines in Figs. 3(g,h), reflect in y i and Figs. 3(i-l). We usually observe the absence of the electronic cloud in the middle cytosine of the direct strand of the SP1 and EGR1 binding sites, black lines with circle in Figs. 3(i,k), as well as the opposite behavior in the complementary strand, red lines with plus in Figs.   3(i,k). But, we should be cautious about, because this is not true for the purine sequences with one replaced pyrimidine or vice versa in the same way of n i , as mentioned before.

CONCLUSION
In the extended ladder model proposed in this article, the Morse potential is the key components for the electronic behavior in the double helix DNA chain. But, the stacking interaction between adjacent base pairs in the Zhu et al. [1] has limited influence on the results, since this interaction is short-ranged.
BOMO, HOMO and LUMO show an electronic motif behind the SP1 and EGR1 binding sites, compatible with the consensus multiple alignments. In the case of SP1, there is one BOMO in the first and another in the third zinc finger binding site, and the HOMO and LUMO positions are before the consensus sequence. The first BOMO is distributed for EGR1 over the second zinc finger binding position and the second BOMO is after the consensus sequence. The HOMO and LUMO are over the second BOMO. BOMOs are degenerated with 7.98 ± 0.05eV and 7.99 ± 0.03eV for SP1 and EGR1, respectively. The HOMO eigenvalues are 8.52 ± 0.02eV (SP1) and 8.52 ± 0.01eV (EGR1). The LUMO energy levels are 9.3 ± 0.1eV (SP1) and 9.4 ± 0.1eV (EGR1).
When the valence band is filled, we observe a 100% probability in electronic presence in purines (adenine and guanine) and its absence in pyrimidines (thynine and cytosine).
Furthermore, the sequence of electrons and holes coincide with the basicity and acidity of the DNA-protein binding animo acids in the zinc fingers. In particular, the sequence of positive and negative charges of the tips of SP1 and EGR1 coincide with BOMO cloud distribution. Finally, the collective electronic behavior for the filled valence band DNA chain will result in a sequence of electronic clouds around purine π-orbitals, dashed-dotted lines in Fig. 1(d).

Appendix A: The zinc-fingers in SP1 and EGR1
The Cys 2 His 2 zinc finger unit is composed by one zinc ion between two β-sheet and one α-helix [61], Fig. 1(a). There are two cysteine at one end of the β-sheet and two histidines in the C-terminal portion of α-helix [61]. Each ZP bonds with three nucleotides, Fig. 1 (b,c). Transcription factors use typically between 2 to 4 ZFs for identifying specific sites along the DNA [61]. In particular, the EGR1 interact with 1-3ZF, fitting itself in the major groove of the dDNA, Fig. 1(a). The interaction in situ of each Cys 2 His 2 zinc finger of the mouse EGR1 with the double DNA chain (dDNA) is detailed in [67,68]. The EGR1 gene is a nuclear protein, related to the cell differentiation and mitogenisis and is localized in the position 5q31.1 of the human cytogenetic map. SP1 acts in a large number of CG rich promoters dispersed along the genome, [46]. This ZF transcription factor acts as enhancer and histone deacetylase binder, increasing the strength of the chromatin wrapping by histones, and modulates DNA-binding specificity. The gene responsible for SP1 is in 12q13.1. Here we focus on transcription factors with few zinc fingers as for those with many ZF as transcription the mechanism of binding is not well-understood [5,22,61].

Appendix B: The model parameter choices
When we transform the uni-dimensional Zhu et al. model [1] in extended ladder, each site will represent one nucleotide, instead of a base pair. We also alter the concept of the displacement field y i [36,45]. The displacement field in the DNA melting model is the ratio of the electronic cloud from the hydrogen bridge between two nucleotides [1]. But, now the displacement field will be the variation of the electronic cloud ratio of the nucleotide πorbital. So, according to the density functional theory (DFT) studies in the nucleotides [40,41], D A , D T , D C and D G in the Morse potential H b , Eq. 4, are respectively 0.25eV, 0.44eV, 0.33eV and 0.45eV. The DFT analysis, supported by experimental anion photoelectron spectra, also suggests that a A , a T , a C and a G in the Morse potential are correspondingly around 3.0Å −1 , 3.0Å −1 , 3.0Å −1 and 2.5Å −1 [38,39]. The k v comes from DNA vibrational spectra using Raman spectroscopy [69] and we fix this to 0.0125eV. The electronic hopping rates in the free electron Hamiltonian H e , Eq. 3, are the same in the literature [36,[43][44][45].
Thus, we use Table 1 for hopping rate t G//C values 0.055 eV and t A//T is fixed to 0.047 eV. The values for the onsite potential ǫ A , ǫ T , ǫ C and ǫ G are respectively 8.631eV, 9.464eV, 9.722eV and 8.178eV.
In the electron-base Hamiltonian H eb , Eq. 3, we may control the gap size between HOMO and LUMO varying α v . Thus, we compare the gap in our spectra with those reported in literature [36,[42][43][44][45]52], and we fix α v to 1.0.

Appendix C: Selection of GenBank files and nucleotide alignment
We use the DNA sequence from the human reference map, annotation release 106 (build GRCh38) [70]. The criteria for selecting the binding sites in this work are as follows. The binding sites must have experimental confirmation in vitro. We remark that we usually observe many single nucleotide polymorphisms (SNPs) between the reference map and the reported experimental samples, because the reference map is basically a consensus sequence from 9 individuals [71] while the samples in experimental binding site data belongs to one individual. Nested binding sites is a common occurrence, but we try to avoid overpositioned binding sites in order to simplify the search for an electronic motif. The binding site of these transcrition factors is in the promoter, a region between 500 to 2000bp distant from the beginning of the gene. We spot similar SP1 and EGR1 binding sites, TATA box and other structures reported in individual samples in the GenBank reference map as well as in databanks as in the Eukaryotic Promoter Database for SP1 and EGR1 [72][73][74]. Finally, we use the nucleotides sequences in FASTA and GenBank flat file format, since the nucleotides are just nucleotides with the phosphate group as we mentioned in the introduction.
We select 16 binding sites in 10 different genes, see Table 2. We remark that the IL2 [53] and TNF genes [54] have binding sites for both SP1 and EGR1. The selected binding site regions of the human genome have 50bp of length as MOAB SP1p,d [55], IL2 SP1 and TNF SP1. We consider segments 70bp long around CDC25A SP1 [56] and PGR SP1p,d [57]. The segment around the three SP1 binding sites a, b and c of the gene SREBF1 (also known as SREBP-1a) has 90 bp [58]. We extract the sequences with 50bp of length for the EGR1 binding sites for the genes EGR1 [47], SYN2 [59] and TP53 (aka P53) [47], but 80bp long for VEGFA EGR1 binding site [60]. By the way, EGR1 is a curious zinc finger protein, because this gene has also a EGR1 binding site [47]. The EGR1 protein can regulate its own expression. The sequences of IL2 and TNF EGR1 binding sites are the same 50bp long as IL2 and TNF SP1. To identify the binding regions we use BLAST [75], SeqinR 3.1-3 [76], Biostrings 2.34.1 [77] and ShortRead 1.24.0 [78], since the genome has four different readings: the normal direction (5 ′ − XX ′ − 3 ′ in Fig. 1 in the complementary strand (3 ′ − Y Y ′ − 5 ′ ) and in the reverse-complementary direction . We indicate respectively these reading direction with 'r' and 'c' in parenthesis for reverse and complementary in Table 2.
Once we have selected the region of interest, we made multiple sequence alignment using Clustal Omega, a software that organize the sequences using hidden Markovian model for alignments and a multidimensional embedding space for data clustering [75]. We adopt the same notation for the nucleotide positioning in the promoter [56,59]. So, there is no position zero. The counting always starts with 1 in the first nucleotide of the consensus sequence. The first nucleotide before the nucleotide of the consensus sequence is always -1. This way there is no position zero for nucleotides. We adopt the position 1 as the first nucleotide of the literature consensus sequence for SP1 and EGR1. In Table 2, the consensus sequences are in light and dark green and they are defined by a simple majority, i.e. when the number of alignment nucleotides is bigger than or equal to 6 and 4 for SP1 and EGR1, respectively.   (1-3ZF) are indicated in light and dark green [5,47,[53][54][55][56][57][58][59][60][61].   [55] and EGR1 genes [47] are respectively in the left and right columns. We remark that the EGR1 transcription factor can bind in the promoter of his own gene [47].