The Mutational Landscape of the Oncogenic MZF1 SCAN Domain in Cancer

SCAN domains in zinc-finger transcription factors are crucial mediators of protein-protein interactions. Up to 240 SCAN-domain encoding genes have been identified throughout the human genome. These include cancer-related genes, such as the myeloid zinc finger 1 (MZF1), an oncogenic transcription factor involved in the progression of many solid cancers. The mechanisms by which SCAN homo- and heterodimers assemble and how they alter the transcriptional activity of zinc-finger transcription factors in cancer and other diseases remain to be investigated. Here, we provide the first description of the conformational ensemble of the MZF1 SCAN domain cross-validated against NMR experimental data, which are probes of structure and dynamics on different timescales. We investigated the protein-protein interaction network of MZF1 and how it is perturbed in different cancer types by the analyses of high-throughput proteomics and RNASeq data. Collectively, we integrated many computational approaches, ranging from simple empirical energy functions to all-atom microsecond molecular dynamics simulations and network analyses to unravel the effects of cancer-related substitutions in relation to MZF1 structure and interactions.


INTRODUCTION
Transcription factors belonging to the SCAN zinc finger family have been implicated in a number of cellular malignancies (Monaco et al., 1998;Dong et al., 2004;Yang et al., 2011;Eguchi et al., 2015;Singh et al., 2015). SCAN domains in zinc finger transcription factors are crucial mediators of protein-protein interactions (Williams et al., 1995;Sander et al., 2003;Edelstein and Collins, 2005;Noll et al., 2008) and allow SCAN zinc finger proteins to form homo-and hetero-dimers (Edelstein and Collins, 2005). The importance of dimerization for SCAN zinc finger transcriptional activity was demonstrated through two-hybrid experiments, using ZNF174. This study demonstrated that interaction between SCAN domains synergistically activated transcription (Williams et al., 1999). SCAN domains have been identified in more than 80 zinc finger genes throughout the human genome (http://www.ebi.ac. uk/interpro/entry/IPR003309/proteins-matched?species=9606), including a number of cancer-related genes. In addition, a subset of SCAN domain only factors (SCAND), which lack the DNA binding domains, has been discovered and suggested to function as regulators of the intact SCAN zinc finger factors (Sander et al., 2003;Edelstein and Collins, 2005).
SCAN domain is a highly conserved domain of ∼80 residues and it is usually located in the N-terminal region of transcription factors (Sander et al., 2003). It is composed by at least three α-helices separated by short loop regions. SCAN domains have an extended conformation, also defined as V-shaped structure, composed by five α-helices arranged in two subdomains. The Nterminal subdomain comprises the α-helices 1 and 2 that form one side of the V-shape, whereas the C-terminal subdomain is composed by the α-helices 3, 4, and 5 that pack together forming the other half of the V-shape. In some SCAN domains, the Nterminal subdomain of one monomer packs with the C-terminal subdomain of the other monomers so the two monomers interact in a domain-swapped topology (Peterson et al., 2006).
One example of a SCAN zinc finger transcription factor with a crucial role in cancer is Myeloid Zinc Finger 1 (MZF1). There are three known isoforms of MZF1 which may play different roles in tumorigenesis-these encode protein sequences of different lengths and domain composition Eguchi et al., 2015). The shortest MZF1 isoform encodes a 290residue protein, including the SCAN domain, a portion of the regulatory linker and a unique C-terminal motif (Eguchi et al., 2015).
The existence of highly conserved SCAN domains in the SCAN zinc finger family suggests that interactions between its family members may occur through hetero-dimerization (Edelstein and Collins, 2005). MZF1 has been shown to interact with its family members such as ZNF24, ZNF174, and ZNF202 through SCAN-SCAN interactions (Noll et al., 2008) along with SCAND proteins such as RAZ1 or SCAND1/RAZ108 (Sander et al., 2000). The role of the SCAN-SCAND interaction is still unclear, however in the case of MZF1, it has been suggested that the "zinc fingerless" SCAND proteins might decrease MZF1 signaling causing a decrease in the affinity for DNA targets or other interactions (Sander et al., 2000). The N-terminal region of MZF1, which includes the SCAN domain, can also coassociate with the promyelocytic leukemia nuclear bodies (PML-NB, Bernardi and Pandolfi, 2007;Noll et al., 2008) and recruit other factors, such as ZNF24, to the PML-NBs (Noll et al., 2008).
MZF1 was initially associated with malignancies in studies on hematopoietic development, as it was found to act as a transcriptional repressor of essential genes for hematopoietic differentiation. In addition, MZF1 promotes the emergence of a leukemic phenotype (Perrotti et al., 1995;Hromas et al., 1996). On the contrary, Gaboli and co-worker suggested that MZF1 could have a suppressor function on hematopoietic cancer types (Gaboli et al., 2001). The source of these discrepancies might be related to the existence of the different MZF1 transcripts, which may have diverse functions in the cell .
The cBioPortal database (Cerami et al., 2012) reports the MZF1 gene to be significantly amplified in certain solid tumors, such as breast, uterine, lung, and bladder cancers (Eguchi et al., 2015). A complex and heterogeneous role of MZF1 in tumors is supported by studies in cellular and animal models of different types of cancer (Hsieh et al., 2007;Mudduluru et al., 2010;Rafn et al., 2012;Chen et al., 2014;Tsai et al., 2015;Vishwamitra et al., 2015;Nan et al., 2016). The effects mediated by MZF1 seem to be strictly dependent on the cancer type. Thus, the interactions and functions involving this transcription factor will require further elucidation on a detailed structural level. Indeed, this is a critical step for the understanding of the molecular mechanisms involved in cancer initiation and development, as well as in the design of new therapeutic approaches.
Despite the many questions surrounding MZF1 and its role in cancer, no systematic studies have been devoted to understanding the structural and functional impact of cancer-related mutations involving this SCAN domain. A NMR structure of the MZF1 SCAN domain [Protein Data Bank (PDB) entry: 2FI2; (Peterson et al., 2006)] in its dimeric form is publicly available and we here use it to study the cancer-related mutational landscape of MZF1.
We analyzed the network of experimentally characterized protein-protein interactions of MZF1 and predicted the candidates for SCAN-SCAN or SCAN-SCAND interactions using an in silico high-throughput approach based on the PRISM energy function (Tuncbag et al., 2011;Baspinar et al., 2014). The expression levels of MZF1 and its protein partners were compared in normal and tumor samples using RNASeq data from cancer patients. In particular, we applied a Pan Cancer approach to 24 different RNASeq data sets obtained from The Cancer Genome Atlas (Tomczak et al., 2015).
We also integrated different computational techniques, such as sequence-and structure-based methods to predict the effects of mutations on protein stability and protein-protein interactions (Schymkowitz et al., 2005), microsecond atomistic molecular dynamics (MD) simulations and methods inspired by graph theory applied to MD Tiberti et al., 2014;Papaleo, 2015;Papaleo et al., 2016). In addition, we used high quality NMR data, such as Nuclear Overhauser Effect (NOE) experiments and chemical shifts (BMRB accession number: 6957, Peterson et al., 2006) to cross-validate our MD structural ensembles and compare the results achieved by the different MD force fields.

Protein-Protein Interactions of MZF1 SCAN Domain
We used the Interologous Interaction Database (I2D; Kotlyar et al., 2016), downloaded on January 8th 2016, to retrieve known interaction partners of human MZF1 (Uniprot identifier P28698). Gene ontology enrichment was performed using the gprofiler (Reimand et al., 2007), GOsim (Fröhlich et al., 2007), and corrplot (Friendly, 2002) packages for R. The interaction network was filtered to remove redundant edges and visualized using Cytoscape version 3.3.0 (Shannon et al., 2003). For each MZF1 partner sequence, we performed a search for SCAN and zinc finger domains using InterPro (Mitchell et al., 2014) and Pfam (Finn et al., 2016).
We then used PRISM (Tuncbag et al., 2011;Baspinar et al., 2014) which is a template-based software to predict the structure of protein-protein complexes. PRISM includes a rigid-body structural comparison of target proteins to known templates of protein-protein interfaces and a flexible refinement using a docking energy function. We used the standalone version of the program to screen all the interactors collected by highthroughput experimental studies for their interaction with the SCAN domain of MZF1. The interaction partners selected in the first PRISM round were then submitted to PRISM 2.0 and evaluated on the basis of docking energy scores using as a cutoff for favorable interactions a docking energy value of −2.39 kcal/mol, as previously suggested (Baspinar et al., 2014;Tuncbag et al., 2011).
The flexible refinement in PRISM was performed using Fiberdock (Mashiach et al., 2010) including a step to remove steric clashes. The resulting complexes are then ranked by binding energy scores (i.e., Docking energy) calculated using the CHARMM force field (MacKerell et al., 1998).

Analyses of the Cancer Genome Atlas (TCGA) RNASeq Data
Level 3 RNASeq data (RSEM counts) for all the cancer studies deposited at The Cancer Genome Atlas (TCGA) were downloaded and pre-processed using the R-package TCGAbiolinks (Colaprico et al., 2015;Silva et al., 2016). The RNASeq data used for analyses had been produced using the Illumina HiSeq 2000 mRNA sequencing platform. For the analysis, we retained only those data sets for which both tumor and normal samples were available. A summary of the analyzed samples and cancer types is reported in Table S1.
Before the analyses, sample outliers were removed using the TCGAAnalyse_Preprocessing function from TCGAbiolinks, which is a function that estimates the Pearson correlation coefficient among all pairs of samples (38). Samples with a correlation lower than 0.6 within the same cancer study were discarded (38). Next, we normalized the expression data using the TCGAnalyze_Normalization function from TCGAbiolinks. Normalization procedures included adjusting for GC-content, gene-length effects on read counts and full quantile filtering of datasets using a cutoff value of 0.25 (Lee et al., 2011;Risso et al., 2011). Normalized samples were batch-corrected in order to remove artifacts from well-plates and/or the sequencing center. The batch correction was performed using Combat (Johnson et al., 2007) (40), implemented in the R-package, SVA (Leek et al., 2012). In-house R scripts were subsequently used to extract the processed data from MZF1 transcripts and for the transcripts of its interactors. We carried out the analyses using all samples within a tumor-type, as well as matched paired samples only (i.e., tumor and normal samples from the same patient), to evaluate how this would potentially affect results. The normalized counts were log2 transformed (pseudo count = 1) to overcome the problem of extreme values due to differences in sequencing depth (Lee et al., 2011).
We then incorporated the gene expression data in the MZF1 protein-protein interaction network using Cytoscape version 3.3.0 (Shannon et al., 2003). In each network, the absolute value of the difference between the medians of the counts per gene in the normal samples and tumor samples was used to represent the node colors, upon log2 transformation. The color shade of the edges represented the value of the Pearson correlation coefficient calculated for each MZF1 interactor-pair according to the counts presented in the tumor samples with respect to the normal samples.

Structure-Based Prediction of Impact on Protein Stability and Binding Interface
We employed the FoldX energy function from the newest release of the FoldX suite (27) to carry out in silico saturation mutagenesis using a Python wrapper that we recently developed (to be published and available on request). The wrapper allows for the introduction of all possible 19 point mutations at each position of the protein using multithread calculations. Calculations with the wrapper resulted in an average G (differences in G between mutant and wild type variant) for each mutation over the whole NMR ensemble of 20 conformers for both the MZF1 monomer (PDB entry 2FI2, chain A or B only) and the dimer (PDB entry 2FI2, both chain A and B). The ensemble was used to account for flexibility in the protein since FoldX energy function only allows for local conformational changes. We calculated the G between mutants and wild type variants associated with protein stability and in relation to the formation of the dimeric MZF1 complex. We applied the BuildModel module from the FoldX suite and five independent runs for mutations in our scan. The typical prediction error of FoldX is about 0.8 kcal/mol (Guerois et al., 2002). Twice the prediction error (i.e., 1.6 kcal/mol) was used as a threshold to discriminate between neutral and deleterious mutations in the analyses.
Moreover, we implemented a correction to the FoldX energy values, as defined by Tawfik's group (Tokuriki et al., 2007) to make the G FoldX values more comparable with the expected experimental values (i.e., to achieve more overlapping distributions). We estimated the G of unfolding for each mutant variant ( Gu mutant ) from the equation: where Gu mutant is the estimated unfolding G of the mutant variant, Gu WT_exp is the experimental unfolding G of the wild type variant, G Foldx is the G estimated by FoldX, m and b are correction terms. In particular, m and b where selected according to the work by Tawfik et al. (Tokuriki et al., 2007), in which the correlation between the experimental values G values and those predicted by FoldX was calculated for 1285 mutations of 10 different proteins available in the ProTherm database (Kumar, 2006). The authors applied a linear correction to the FoldX G values to optimize the overlap between the distributions of the experimental and in silico data, using parameters (m and b) derived either from linear regression or Principal Component Analysis (PCA). The latter correction led to essentially identical distribution between the experimental and computational data, therefore all the FoldX values were corrected using the PCA equation G FoldX = −0.078 + 1.14 G Experimental (Tokuriki et al., 2007). Therefore, we here used the inverse of this relation, from which b and m were derived, to normalize our FoldX Gs before adding them to the experimental unfolding free energies. Reversing the equation G FoldX = − 0.078 + 1.14 G Experimental , we obtained: Where b = 0.068 and m = 0.877 to be used in Equation (1). In our calculations, we cannot use an experimental value of unfolding G for the wild-type variant of MZF1 SCAN domain since there are no experimental data available in the literature at the best of our knowledge. Nevertheless, values in the range of 5-15 kcal/mol are generally obtained for the net free energy of unfolding of proteins (Privalov, 1979;Fersht and Serrano, 1993). And other helical proteins have folding G ∼5-7 kcal/mol according to the data reported in ProTherm. We thus here used as an arbitrary reference a value of 5 kcal/mol as Gu WT_exp in the calculation of Gu mutant .

Generation of the MD Ensembles
We used the first conformer of the NMR ensemble (PDB entry 2FI2, Peterson et al., 2006) as a starting structure for all-atom explicit solvent MD simulations with Gromacs 4.6 (Hess et al., 2008). We carried out MD simulations with five different force fields belonging to different force field families to assess the robustness of the results, i.e., Amber-ff99SB * -ILDN (Best and Hummer, 2009;Lindorff-Larsen et al., 2010), Amber-ff99SB-NMR-ILDN (Li and Brüschweiler, 2010), CHARMM22 with the CMAP backbone corrections (herein termed CHARMM27; Mackerell et al., 2004;Bjelkmar et al., 2010), CHARMM22 * (Piana et al., 2011), and the modified OPLS-AA/L force field RSFF1 (Jiang et al., 2014). We used TIP3P adjusted for CHARMM force fields (MacKerell et al., 1998) and TIP4P-Ew (Horn et al., 2004) water model for Amber/CHARMM and RSFF1 force fields, respectively. The protein was solvated in a dodecahedral box with a minimum distance between protein and box edges of 1.2 nm applying periodic boundary conditions. His51 was simulated as the Nε2-H tautomer, according to NMR CD2-HD2 chemical shift (BMRB accession number: 6957) which is <122 ppm (Sudmeier et al., 2003). The system was equilibrated according to a protocol previously applied to other cases of study (Papaleo et al., 2014b;Tiberti et al., 2015a). Productive MD simulations were carried out in the canonical ensemble at 298 K using velocity rescaling with a stochastic term (Bussi et al., 2007). The LINCS algorithm (Hess et al., 1993) was used to constrain the heavy atom bonds to use a time-step of two fs. Long-range electrostatic interactions were calculated using the Particle-mesh Ewald (PME) summation scheme (Essmann et al., 1995). Van der Waals and Coulomb interactions were truncated at 0.9 nm.

Comparison of MD Ensembles with NMR Data
We selected 501 frames, which were equally spaced in time, from each of the five different MD simulations to calculate backbone and side-chain chemical shifts, as well as NOEs. We used backbone and side-chain chemical shifts for 90 residues. Moreover, we employed 4063 NOE-derived distances of which 1018 long-range (i.e., separated by more than four residues in the primary sequence), 2422 short-range, and 623 intermolecular.
For each of the MD frames, we predicted the chemical shift using PPM One (Li and Brüschweiler, 2015), after which we compared the predicted chemical shift with the experimentallyderived values (deposited in BMRB entry 6957) using a χ 2 -like approach according to the following equation: We calculated the pair-wise distances between the atoms for which experimental NOEs were available and then averaged them over the 501 frames extracted from each MD ensemble. Differences between average and experimentally obtained distances were then calculated.
For each MD ensemble we also predicted the resolution value using Resprox (Berjanskii et al., 2012) and compared it to the predicted resolution of the NMR ensemble deposited in the PDB (PDB entry 2FI2).

Network Analyses of MD Ensembles
A protein structure network (PSN) approach was used as implemented in the Pyinteraph framework . The residues that have zero edges are termed as "orphans" and those involved in more than three edges are referred as "hubs." The node inter-connectivity was used to identify the so-called "connected components, " i.e., a cluster of connected residues in the graph. The node clustering procedure was carried out so that each node was iteratively assigned to a cluster if the node could establish at least a link with another node of the same cluster. We tested two different distance cutoffs to define the existence of a link between the nodes (i.e., 0.5 and 0.55 nm). We then monitored the distribution of the hubs and the elements belonging to the first five more populated connected components to identify a suitable cutoff for the analyses. The distance is measured between the centers of mass of the residue side chains, except for glycine residues (these are not included in the analysis). Since MD force fields are known to have a different mass definition, we used PyInteraph mass databases for each of the MD ensembles.
To obtain a single PSN for each MD ensemble, we included, in the final graph, only those edges which were present in at least 20% of the simulation frames (p crit = 20%), as previously applied in other cases of study (Papaleo et al., 2012a(Papaleo et al., ,b, 2014aJónsdóttir et al., 2014;Tiberti et al., 2014;Lambrughi et al., 2016a,b). For each pair of nodes in the PSN graph, a variant of the depth-first search algorithm was employed to identify the shortest path of communication. The distance between two nodes (i.e., residues) that are directly connected in the graph was considered to be one. The shortest path was defined as the path in which the two residues were non-covalently connected by the smallest number of intermediate nodes. The calculations were performed using the PyInteraph suite of tools  and analysis of the output was performed using in-house Python scripts (available on request). To evaluate the convergence of the PSN properties over the simulation time, we used the Jack-Knife resampling method (Miller, 1974). In particular, we calculated the hubs and connected components from an average PSN generated by the whole MD ensemble and by smaller MD ensembles obtained discarding 10% of the simulation frames at regular time intervals.

RESULTS AND DISCUSSION
The Landscape of Protein-Protein Interactions Mediated by MZF1 SCAN Domain We retrieved 17 MZF1 interaction partners from the I2D database ( Figure 1A). For each of them we collected information on the experimental techniques used to probe the interaction and the presence of SCAN or zinc finger domains (Table  S2). Among the identified interactors, five proteins (ZNF202, ZNF24, ZNF174, and ZSCAN2) were SCAN-containing and harbored multiple zinc finger motifs, while only SCAND1 was a SCAND protein. Gene ontology analyses of the MZF1 interaction network pointed to a major role in gene transcription as well as in the regulation of other biosynthetic processes ( Figure 1B).
To identify those interactors, which form a complex with the MZF1 SCAN domain, we employed the PRISM approach. As a target ensemble for PRISM, we used the structures of each of the Frontiers in Molecular Biosciences | www.frontiersin.org MZF1 SCAN domain monomers (PDB entry 2FI2, chain A and chain B). Moreover, we used the interaction partners, from I2D, for which at least one experimental structure was available in the PDB. We also included two additional SCAN domain proteins, the Zfp206 and Peg3 [PDB entries 4E6S  chain A and 4BHX (Rimsa et al., 2013) chain A, respectively] to further investigate the role of SCAN-mediated heterodimerization. For analyses, we retained only those complexes with a predicted docking energy lower than −2.39 kcal/mol (Figure 2).
The PRISM analyses suggest that all the SCAN domains, and MZF1 especially, can form homo-and hetero-dimers with other SCAN domains with a very low predicted docking energy (< −33.46 kcal/mol; Figure 2E). These results are confirmed by the fact that the docking energy for MZF1 homo-dimerization, for which the structure is known (PDB entry 2FI2), is within the same range of other homo-and hetero-dimeric complexes of SCAN domains predicted by PRISM. All the predicted homoand hetero-complexes of SCAN domains have similar interaction interfaces mainly comprising residues in the α-helices 2, 3, and 5 (residues 58-73, 80-95, 112-123 from PDB entry 2FI2). Residues known to contribute to MZF1 homo-dimerization are also conserved at the interface of other SCAN-SCAN complexes predicted by PRISM, such as F47, R48, Y52, P58, A61, L65, R66, W72, L73, P75, K82, L84, V88, Q93, P97 (2FI2 numbering, Figure 2). Interactions with the Cdk4 kinase, CD14, and the Nfyc transcription factor are also predicted within the significance threshold, suggesting an even broader network of interactions and diversity in signaling and expression regulation, promoted by the SCAN module.

Expression Levels of MZF1 and Its Interactors in 24 Cancer Studies from the Cancer Genome Atlas (TCGA)
To evaluate the relationship between the MZF1 network (i.e., MZF1 and its interactors) and different types of cancer, we analyzed genomic profiling data from 10173 samples stratifying into 24 different cancer studies-deposited in TCGA (Table S1).
MZF1 is expressed at an overall higher or lower level in tumor vs. normal samples in 19 out of the 24 cancer studies that we analyzed, strongly supporting its crucial but diverse role in cancer (Figure 3, Figure S1 and Table S1). The majority of cases are characterized by higher levels of MZF1 within tumors, with the exception of Adrenal Gland and KICH/KIRC Kidney tumors, in which MZF1 has lower expression levels in tumor compared to normal samples. In many cancer types we observed a high variability in the expression levels of MZF1, emphasizing the importance of using paired tumor-normal samples in the comparisons. We then investigated if any correlation was observed between MZF1 changes and changes in the expression levels of its interactors. We observed as a common feature that changes in CDK4 and ZNF688 positively correlate (red dotted lines in the Figure 3) with changes in MZF1 (Figure 3). MZF1 has a predicted phosphorylation site for CDK kinases in its Nterminal region (Ser8) and we thus hypothesize that the MZF1 SCAN domain acts as a docking site for CDK4 kinase, which is in turn able to recruit the MZF1 disordered N-terminal tail for phosphorylation. The CDK4 binding site on MZF1 predicted by PRISM (Figure 2) partially overlaps with the SCAN-SCAN dimerization interface and thus the binding to the kinase would compete with the dimer formation according to our models.
We also noticed that the correlations between expression levels of MZF1 and the other SCAN domains vary from one cancer type to the other, unveiling a complex network of interactions and diverse cellular signaling that can be elicited in different cancers. The SCAND-only SCAND1 protein has been experimentally reported to interact with MZF1 and its expression levels and MZF1 expression levels are tightly correlated in the majority of the cancer types according to our analyses (Figure 3). SCAND1 is thus likely to act as a regulator of MZF1 activity, in agreement with the hypothesis that SCAND1 could decrease MZF1-mediated signaling by altering its affinity for DNA targets (12). In certain cancer types, when MZF1 level increases, SCAND1 is also up-regulated, whereas in other cancer types they are inversely correlated, i.e., MZF1 levels increase but SCAND1 expression decreases compared to the normal samples. The latter scenario occurs in KIRC, LUSC, PRAD, and STAD cancer studies annotated in TCGA (Table S1, Figure 3). In these cases, the regulatory function of SCAND1 on MZF1 activity is likely to be lost altering the downstream effects mediated by this transcription factor. The same heterogeneity in different cancer types is observed for the other SCAN-containing proteins.

Sequence-Based Prediction of the Effects Induced by Mutations in the MZF1 SCAN Domain
Twenty three cancer-related mutations located within the MZF1 SCAN domain were identified through the analyses of various cancer databases (see Section Materials and Methods) along with 21 mutations which have not been associated with cancer so far (Table S3).
We subsequently employed both sequence-and structurebased methods to predict the impact of these mutations on function, stability and protein-protein interactions.
We integrated seven sequence-based methods mainly based on evolutionary information to assess the pathogenic potential of the mutations in the MZF1 SCAN domain. The methods employed were in agreement for most of the mutations even if a complete consensus could only be identified for nine mutations (Figure 4). MutPred, Provean PON-P2, Polyphen 2, and Mutation Assessor predictors yielded similar results, as illustrated by the hierarchical clustering reported in Figure 4. A subset of predicted deleterious amino acidic substitutions included E41K, R44G, F47S, R48L, Y52H, G57W, P58L, C69Y, R74C, S79P, Q82R, P97T, and I100N. In addition, most methods (five out of seven) identified another group of substitutions with neutral effects on protein function i.e., P40S, R51C, R51H, R70H, R78H, G94S, A102T, R103C, R103Q, R108L, R124Q, P126L, P126T, G127S, and G128R. MutPred also predicted with significant p-values (p < 0.05) a gain or a loss of Post-Translational Modifications (PTMs) upon certain mutations, such as gain of glycosylation  in E41K (assuming that the lysine is hydroxylated), gain of phosphorylation at F47S and loss of methylation at R48 and R51.

Effects of MZF1 Mutations on Structural Stability and Dimer Formation of the SCAN Domain
Due to the intrinsic limitations in sequence-based methods to predict mutational impact, we turned our attention to structure-based methods. Specifically, we carried out an in silico saturation mutagenesis scan based on the FoldX energy function using the whole NMR conformational ensemble of the MZF1 SCAN deposited in the PDB (Figure 5). The FoldX energy function provides a quantitative description of the intermolecular interactions that stabilize a protein in order to predict the change in thermodynamic folding stability or in the free energy of protein complex formation ( G) with respect to the wild type. The saturation scan allowed us to assess the impact of the mutations on protein structural stability (the scan is performed on the MZF1 monomers) and local effects influencing the interface for monomer-monomer interaction (when the scan is performed on the MZF1 dimer). Negative G values indicate variants that are more stable than the wild type, whereas positive values indicate that the mutant variants are less stable than the wild type MZF1 protein. Thus, mutant variants with G >0 upon the monomer scan have a higher population of (partially) unfolded structures that are prone to aggregation, misfolding or degradation. Mutant variants with G >0 upon the proteinprotein binding scan have a decreased monomer-monomer binding affinity. The high-throughput mutation scan allowed us not only to assess the impact of MZF1 mutations collected from the mutation databases, but also to predict the effect of any other possible amino-acid substitution (Figures 5A-D). Thus, the full data set provides a valuable source of information such as (i) the predicted impact of MZF1 mutations potentially identified in future studies related to disease and (ii) critical hotspots for protein structural stability and/or protein-protein binding. The data sets comprised 1786 and 3572 mutated variants for monomer and dimer scans, respectively. The distribution of G values for the data sets ( Figure 5E) is similar to those of other proteins with different folds investigated using the same pipeline (Papaleo et al., 2014a;Mathiassen et al., 2015). The distribution of G values for complex formation turned out to be much narrower with values rarely higher than 5 kcal/mol. This observation may be partially explained by the fact that the FoldX energy function is only capable of accounting for local effects of mutations. Indeed, as there are no major conformational changes of the backbone during FoldX calculations, the method will neglect the contribution of mutations located at distal sites with respect to the monomer-monomer interface. The heatmaps depicting the results of the saturation scan are reported in Figures 5A-D, while the analyses of the MZF1 mutations reported in online databases (Table S3) are shown in Figure 5F. Based on Figure 5F, the only substitutions which did not affect protein stability but had a marked impact on the dimer formation were P58L and Y52H. These two sites are predicted to be sensitive to any amino acid substitution at the interface, as observed from the heatmap (Figures 5C,D). A proline at position 58, due to the intrinsic rigidity of its side chain, is crucial for the proper orientation of the upstream turn motif. This motif allows a glutamate (E54) within one of the monomers (blue in Figure 5G) to participate in an electrostatic intermolecular network with the residues R123 and D120 of the other monomer (orange in Figure 5G). Our scan also pointed to a subset of mutations with detrimental effects on both protein stability and the formation of the dimeric complex e.g., C69Y, P112R, and G57W. The majority of the remaining deleterious mutations including F47S, I100, L86V, P97S, P40S, R44G, R44H, and R74C mainly had an impact on protein stability. The deleterious amino acidic substitutions mentioned above were also located in sensitive structural hotspots where most substitutions are not tolerated, as seen by the heatmap (Figures 5A,B).
To better appreciate the effects of these mutations on protein stability, we also estimated a G of unfolding (∆Gu mutant ) of the mutant variants reported in Table S3, as explained in details in the Materials and Methods. The ∆Gu mutant values are also reported in Table S3. These data could be useful for comparison with future experimental determination of changes in the free energies of folding/unfolding upon mutation of MZF1 and overall confirm the scenario described above.
Moreover, we noticed that the mutations predicted to be damaging for protein structure and function had previously been identified in cancer studies, as well as in databases that contain mutations not yet classified for their clinical relevance (Table S3). Our findings suggest that these unclassified mutations could exert damaging effects on MZF1, highlighting the need for further experimental studies and greater efforts in genomic profiling of cancer patients. Collectively, our results demonstrate the utility of structure-based computational tools to discriminate between substitutions more likely to be cancer drivers and those exerting only neutral effects. F47S, G57W, and C69Y are of particular interest as these were predicted to affect protein stability or dimerization interface and have previously been identified in patients with bladder cancer, kidney cancer, colorectal cancer, and in established cancer cell lines (Table S3).

Validation of MZF1 MD Ensembles with NMR Data
One of the limitations of the high-throughput saturation mutagenesis performed above is that only local structural changes can be modeled. Thus, the information on structural changes promoted by distal sites are lacking. To overcome this issue, we carried out microsecond all-atom explicit solvent MD simulations using state-of-the-art force fields. We accounted for differences in the formulation of the physical models employed in MD on the simulated dynamics, using five different force fields, i.e., CHARMM22 * , CHARMM27, Amber99-SB * -ILDN, Amber99-SB-NMR-ILDN, and RSFF1. Even minor changes in the torsional potential of protein backbone and side chains in the force fields have a major impact on the dynamics and structure described by MD simulations (Guvench and MacKerell, 2008;Lange et al., 2010;Lindorff-Larsen et al., 2012;Martín-García et al., 2015;Tiberti et al., 2015b;Unan et al., 2015) and the structural consequences upon selecting a certain set of force field parameters are hard to predict.
The MZF1 SCAN domain has been studied by NMR spectroscopy resulting in a number of high quality probes of protein dynamics in solution, such as backbone and sidechain chemical shifts, as well as short-and long-range NOEs (Peterson et al., 2006). These data are a valuable resource for experimental validation of the conformational ensembles collected in our MD simulations as well as for comparison of MD force fields (Guvench and MacKerell, 2008;Lange et al., 2010;Beauchamp et al., 2012;Best et al., 2012;Dror et al., 2012;Lindorff-Larsen et al., 2012;Piana et al., 2014;Papaleo et al., 2014b;Henriques et al., 2015;Martín-García et al., 2015;Papaleo, 2015;Unan et al., 2015;Tiberti et al., 2015b). Thus, we calculated these NMR parameters from each MD ensemble and compared them to the experimental values.
Backbone and side-chain chemical shifts are known to report on motions occurring on a heterogeneous range of time scales (Robustelli et al., 2012;Case, 2013;Palmer, 2015). The calculated chemical shifts from our simulations are in good agreement with the experimental values as it can be appreciated by the low χ 2 values (Figure 6, Figures S2, S3). They converge rather quickly during the simulation time (after ∼100-200 ns). NOEs can be used to report on either secondary structure (short-range NOEs) or tertiary contacts (long-range NOEs). Also in this case we did not observe remarkable deviation with respect to the original NMR ensemble, suggesting that the selected MD force fields describe with good accuracy the structure and dynamics of MZF1 (Figure 7). FIGURE 7 | Density plot of the average distance difference of NOE pairs between MD simulations and measured NOEs. For all the panels, the distances were averaged over the trajectory and subtracted from the measured NOE (see Section Materials and Methods). The distance differences between the all measured NOEs (A) are very similar for all of the force fields (black; NMR conformers, blue; Amber-ff99SB-NMR-ILDN, green; Amber-ff99SB*-ILDN, red; CHARMM22*, cyan; CHARMM27, magenta; RSFF1) as well as for the 20 NMR conformers (PDB:2fi2). Noticeable is a slightly lower average distance between the NOE pairs compared to the experimentally obtained values. The similarity between the force fields is observed when plotting only long range NOEs (B), short range NOEs, (C) as well as the intermolecular NOEs (D).
The similarity between the different MD ensembles was also highlighted by the overlap between the first 20 principal components of the covariance matrix of Cα atomic fluctuations estimated in terms of Root Mean Square Inner Product (RMSIP; Hess, 2002). RMSIP was continuously higher than 0.85 for any pair-wise comparison of the MZF1 MD ensembles, suggesting a high overlap between the conformational spaces described by our simulations.
The prediction of structural resolution (R) for the MD ensemble (Figure 8) allowed us to better discriminate between the different force fields. We observed a shift to lower R values and thus higher structural quality in MD ensemble with the CHARMM force-field family. The predicted R-value quantitatively accounts for different structural quality parameters (Berjanskii et al., 2012) such as population of χ 1 sidechain dihedral angles, side-chain rotamers that lie outside the distribution described by the penultimate rotamer library (Lovell et al., 2000), outliers in the Ramachandran plot, packing of the protein core, hydrogen-bond networks and atomic clashes. Thus, the predicted R-value is sensitive toward different structural properties with respect to chemical shifts and NOEs and is a powerful complementary metrics to evaluate MD ensembles.

MZF1 Cancer-Related Residues Are Hubs in the Protein Structure Network and Mediate Long-Range Communication
A protein structure network (PSN) approach (Di Paola et al., 2013;Papaleo, 2015;Papaleo et al., 2016) was applied in the analyses of MZF1 molecular dynamics, as detailed in the Materials and Methods Section. The PSN method employs the graph formalism to identify a network of interacting residues in a given protein from the number of non-covalent contacts in the protein.
Two main properties of a PSN are the hub residues, i.e., residues that are highly connected within the network and the connected components, i.e., clusters of residues which are inter-connected but do not interact with residues in other clusters.
We evaluated the convergence of these two properties in our MD simulations by comparing their distribution in the average PSN and in PSNs generated by the shortest subsets of the trajectories (see Section Materials and Methods; Figures S4, S5). The analyses were carried out using two different distance cutoffs (0.5 and 0.55 nm) for PSN edge definition. We observed that if the distance cutoff is higher than 0.55 nm most of the protein residues are grouped in the same connected component ( Figure S4), suggesting that this value is too high to achieve a proper network description of the system and that 0.5 nm is a suitable cutoff for the PSN analyses of the MZF1 MD ensembles. The result is force field independent since we observed the same behavior in all the MD simulations, i.e., a loss in the number of elements in the connected components with indexes higher than one.
The connection degree of the hub residues in MZF1 is not higher than five in all the simulations and the hub distribution profiles are also independent on the force field used for the simulations, especially when force fields of the same family are compared ( Figure S5), such as AMBER or CHARMM. Hubs in the networks have the important role of shortening the "communication" between distal nodes, and thus, they can play a crucial role in mediating structural effects over long distances in the protein structure. Furthermore, hubs preserve robustness of the network. Changes in the small degree nodes should not have a marked effect on the network integrity whereas, if important hubs are affected (especially the largest one), the network integrity can be easily compromised. We therefore evaluated if each of the mutation sites investigated so far constitutes a hub in the MD-derived PSNs of MZF1 (Figure 9). The MZF1 hub residues are generally located at the interface between the two monomers, highlighting the importance of the dimer architecture for the function of SCAN domains. Moreover, MZF1 hub residues are enriched in arginine or proline residues and mutation sites such as P112, P58, C69, R44, R74, and F47 have been identified as hub residues in the MZF1 simulations and are among the mutations that are predicted to affect the monomer stability or the protein-protein interface ( Figure 5E). Of these residues, P112, P58, and C69 are not only the residues with more deleterious effects in terms of free energy of binding and stability but also hubs with the highest connection degree in the PSN.
We then calculated the communication paths between the MZF1 mutation sites and the interface for SCAN dimerization in the CHARMM27 simulation that is the MD ensemble associated to the highest predicted structural resolution (Figure 8). We selected the residues (Figures 5C,D) that encompass the highest impact on binding free energies (Y52, P58, L62, C69, L84, L87, and A95) as interface residues for the path calculation according to the saturation mutagenesis scan in silico. The mutation sites involved in distal communication to the SCAN dimerization interface are mainly proline (P58, P112, and P97) and arginine sites (R124, R44, R66, and R70) along with C69. C69 is central in the MZF1 network and has a key role in the path communications over long distances since it allows the SCAN dimerization region to mediate effects over long distances to solvent exposed sites, which might work as recruitment point for other cofactors (Figure 10). Indeed, C69 allows the dimerization interface to communicate to three different surface hotspots where R66 (Figure 10A), E41 (Figure 10B), and a cluster of charged residues (E54, D120, R123, and R124, Figures 10C,D) are located. The shortest paths that transmit structural effects from the SCAN dimerization interface to these surface hotspots are involving inter-molecular contacts between the two monomers emphasizing the importance of a properly folded dimeric SCAN domain for MZF1 functionality. The intermediate or the end nodes of the paths mediated by C69 are also known mutation sites of MZF1, such as R48 which is mutated in head and neck cancer, E41 in bladder cancer and F47 that is mutated in different cancer types.

CONCLUDING REMARKS
We integrated a plethora of different computational methods to unveil the role of MZF1 alterations in different cancer types. In particular, we focused on the SCAN domain, which is an important building block for protein-protein interaction in transcription factors. A focus on SCAN domains in cancer is of notable interest if we consider that both SCANonly regulatory proteins and a SCAN-only short isoform of FIGURE 9 | Hub localization on MZF1 dimeric structure upon PSN analyses of the MD simulations. Since in a PSN a hub is defined as a residue connected by at least three edges, all the residues with a degree lower than three are set at zero. The structure is depicted as ribbon with rainbow shades of colors from blue to red according to the node degree. The MZF1 residues for which mutations have been collected from different databases reported in Table S2 are   mzf1 with disordered N-and C-terminal tails are known to play major roles in cancer biology. The SCAN domain of MZF1, and SCAN domains in general, can serve as a powerful combinatorial network for regulation of gene expression since they can provide a great diversity in the signaling thanks to their heterodimeric complexes, complexes with other signaling proteins and further regulation by posttranslational modifications as Cdk4-mediated phosphorylation predicted here.
Our results show a complex and heterogeneous role of MZF1 and its interaction network in cancer. The alterationsand mutational landscape of MZF1 is strongly cancer-type dependent. We observed a marked variability in the levels of mzf1 among patients with same cancer types. The same observation holds for several of the MZF1 interactors. Our results highlight the need for more studies on MZF1 alterations in cancer with a specific focus on different cancer subtypes and other available clinical data. The integration of information available about MZF1 biological partners in the same cancer studies is also crucial since these proteins can exert marked effects, contributing to reshape and modulate the MZF1-mediated effects in the cell.
Our work provides a computational framework that allows to bridge global changes, such as changes in the expression levels or mutations of a gene and its interactors, to a detailed atomistic and structural understanding of the effects induced by these alterations. Moreover, since we performed the mutation scan at a high-throughput level, we had the possibility to assess the impact of the cancer mutations and of any other possible amino acidic substitutions. These data sets provide a valuable source of information to predict the impact of MZF1 mutations that will be identified in future studies related to disease. The most deleterious cancer-mutations are in hotspots where all mutations would not be tolerated suggesting that any modifications at these sites could harbor a pathogenic effect. In a broader context, this also suggests that other similarly critical hotspots (where no mutations are tolerated) are likely candidates for cancerrelated mutations and structural analyses. The ones presented here are a valuable addition to the molecular characterization of the mutational landscape of oncogenes or tumor suppressors. We could also identify a subset of cancer passenger mutations, i.e., mutations found in cancer patients but unlikely to have any major impact on cancer pathways since these have neutral effects on structure, interactions and protein dynamics.
However, caution has to be taken since the energy function employed here can only predict local effects and will neglect all the long-range contributions that can perturb in a cascade of conformational changes to very distal sites. To overcome this issue, the integration of experimentally-validated molecular dynamics, to better account for conformational heterogeneity of the protein, and graph analyses, to identify paths of communication between distal sites, can help to achieve a more complete description of the intricate structural mutational landscape of cancer-related proteins.
In particular, our study pointed out a major role of a cysteine residue (C69) at the biological interface for protein-protein interactions mediated by the SCAN domain of MZF1. Alterations at this site have only been reported in cancer cell lines so far, but our predictions suggest that it can be a critical diseaserelated hotspot. Indeed, it is at the cross-road between multiple key paths of communication from the biological interface for protein-protein interactions to other hotspots on the MZF1 surface which can act as recruitment sites for other biological partners.

ACKNOWLEDGMENTS
This work was supported by the ISCRA-CINECA Grants HP10C2TOOC (NetDyn) and HP10CWP9KW (ALLO-PCM), as well as by Innovation Fund Denmark PhD Scholarship to EP group. VS and MD have been supported by the Erasmus plus traineeship initiative 2015-2016 in EP group. The work was also supported by grants from the Danish National Research Foundation (DNRF125) and the European Research Council (AdG 340751) to MJ. The authors would like to thank Wouter Boomsma and Antonio Colaprico for fruitful discussion and comments and Maria Francesca Allega for careful proof-reading of the manuscript.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fmolb. 2016.00078/full#supplementary-material