Prediction of Protein Mutational Free Energy: Benchmark and Sampling Improvements Increase Classification Accuracy

Software to predict the change in protein stability upon point mutation is a valuable tool for a number of biotechnological and scientific problems. To facilitate the development of such software and provide easy access to the available experimental data, the ProTherm database was created. Biases in the methods and types of information collected has led to disparity in the types of mutations for which experimental data is available. For example, mutations to alanine are hugely overrepresented whereas those involving charged residues, especially from one charged residue to another, are underrepresented. ProTherm subsets created as benchmark sets that do not account for this often underrepresent tense certain mutational types. This issue introduces systematic biases into previously published protocols’ ability to accurately predict the change in folding energy on these classes of mutations. To resolve this issue, we have generated a new benchmark set with these problems corrected. We have then used the benchmark set to test a number of improvements to the point mutation energetics tools in the Rosetta software suite.


INTRODUCTION
The ability to accurately predict the stability of a protein upon mutation is important for numerous problems in protein engineering and medicine including stabilization and activity optimization of biologic drugs.To perform this task a number of strategies and force fields have been developed, including those that perform exclusively on sequence (Casadio et al., 1995;Capriotti et al., 2005;Kumar et al., 2009) as well as those that involve sophisticated physical force fields both knowledge based (Sippl, 1995;Gilis and Rooman, 1996;Potapov et al., 2009), physical models (Pitera and Kollman, 2000;Pokala and Handel, 2005;Benedix et al., 2009), and hybrids (Pitera and Kollman, 2000;Guerois et al., 2002;Kellogg et al., 2011;Jia et al., 2015;Park et al., 2016;Quan et al., 2016).
To facilitate the development of these methodologies and provide easy access to the available experimental information the ProTherm database (Uedaira et al., 2002) was developed.This database collects thermodynamic information on a large number of protein mutations and makes it available in an easy to access format.At the time of this writing it contains 26,045 entries.
Due to its ease of access the ProTherm has served as the starting point for a number of benchmark sets used to validate different stability prediction software packages, including those in the Rosetta software suite.However significant biases exist in the representation of different types and classes of mutations in the ProTherm, as it is derived from the existing literature across many types of proteins and mutations.The most obvious example of this is the large number of entries involving a mutation from a native residue to alanine as making this type of mutation is a common technique used to find residues important for protein function.Therefore a large number of the benchmark sets derived from the ProTherm, which did not account for this bias, have significantly under or overrepresented these classes of mutations.These findings suggest previous reports on the accuracy of stability prediction software does not accurately reflect these tools' ability to predict stability changes across all classes of mutations.
To address this issue we have generated a novel benchmark subset which accounts for this bias in the ProTherm database (Supplementary Table 1).We then used this benchmark set to validate and improve upon an existing free energy of mutation tool within the Rosetta software suite, "Cartesian G, " first described in Park et al. (2016).

RESULTS
In order to benchmark our Rosetta-based stability prediction tools we classified the possible mutations into 17 individual categories as well as reported results on four aggregate categories.We analyzed five previously published benchmark sets to determine their coverage across the different classes of mutations and found them inadequate in a number of categories, especially involving charged residues (Figures 1A-E).For example, the number of data points for mutational types ranged from 0 to 24 for negative to positive, 0 to 50 for positive to negative, 3 to 28 for hydrophobic to negative, and 3 to 44 for hydrophobic to positive entries across the benchmark sets tested.Mutations to and/or from hydrophobic residues dominated the benchmark sets ranging from 75 to 92% of the total entries.
To compare the composition of these benchmark sets to that of the database we examined the curated ProTherm (ProTherm * ) provided by Ó Conchúir et al. (2015) 1 which is a selection of entries containing only mutations which occur on a single chain and provide experimental G values (Supplementary Table 2).We find that significant biases still exist here, with several categories having fewer than 50 unique mutations.These include: positive to negative, 42; hydrophobic to negative, 43; and noncharged polar to positive, 47.Mutations involving hydrophobic residues are still overrepresented, with 62.2% of all mutations in the database being mutations to hydrophobic residues, compared to the expected 39.8% if mutations from the starting structures were chosen randomly (Figures 1F-G).
We also analyzed the benchmark sets with respect to the number of buried vs. exposed residues in the data sets.No large 1 https://guybrush.ucsf.edu/benchmarks/benchmarks/DDGbiases were observed.All benchmark sets were within 6% of what would be expected if mutations were random (data not shown).
To sample more broadly across all types of mutations and remove sources of bias in our algorithm development we created a new benchmark set of single mutations that are more balanced across mutational types and avoid other biases.To generate this set we performed the following operations: (1) Removed any entries from the curated ProTherm * that occur on the interface of a protein complex or interact with ligands-the energetics of these mutations would include intra-protein and inter-molecular interactions that would alter the desired intra-protein energetics of a free energy calculation.
(2) We removed entries of identical mutation on similarsequence (>60%) backbones.For mutations occurring at the same position in similar sequences, if the mutation is identical (e.g., L → I) and the sequence identity >60%, then that mutation is included only once in the database; if the mutation is not identical (L → I in one protein and L → Q in another) then the mutation is included.(3) We populated each mutation category, excluding small to large, large to small, buried, and surface, with 50 entries except for the cases where insufficient experimental data points exist.Statistics on the excluded categories were derived from data points that were already present in the other categories.(4) When multiple experimental values (including identical mutations as identified in point 2 above) were available we chose the G value taken at the pH closest to 7.
The resulting benchmark set contains 767 entries across a range of different types and classes of mutations (Figure 2).This constitutes a reduction from the 2,971 total entries in the curated ProTherm * , with mutations to hydrophobics being the most frequently being eliminated.This reduction does eliminate potentially useful data, and introduces a slight bias toward solvent exposed residues: 66% of the mutations being on residues with greater than 20% solvent exposed surface area compared to 54% if chosen randomly.This change is useful to reduce bias toward favoring hydrophobic mutations and has been controlled for by checking our algorithm's performance when residues are classified by burial.
We tested Protocol 3 described in Kellogg et al. (2011) on this benchmark set (Table 1).To assess method quality, we analyzed prediction power by a number of different methods including Pearson's R, Predictive Index (Pearlman and Charifson, 2001), and Matthews Correlation Coefficient (MCC) (Matthews, 1975).We also analyzed classification errors instead of correlation.A mutation is classified as stabilizing if the change in free energy is ≤-1 kcal/mol, it is classified as destabilizing if the change is ≥1 kcal/mol, and neutral if it falls between these values.Each mutation is assigned a value of 0 for destabilizing, 1 for neutral, and 2 for stabilizing.We then scored each entry by taking the absolute value of the difference between the value for the experiment and the prediction.A value of 0 indicates the prediction was correct, 1 indicates the prediction  was moderately incorrect, i.e., the mutation is destabilizing and the prediction was neutral, and 2 indicates the prediction was egregiously wrong.
To address some metrics on which Protocol 3 performs poorly, we were interested in using a more modern Rosetta G protocol, Cartesian G, first briefly described in Park et al. (2016).We refactored the Cartesian G code to utilize the Mover framework described in Leaver-Fay et al. (2011), keeping the underlying science the same (other than changes highlighted here) while eliminating bugs and improving efficiency and modifiability (Supplementary Table 3).
We tested changes to the preparation phase of the protocol relative to what was used in Park et al. (2016; Figure 3).To improve the preparation step (step 1), we tested Cartesian Relax (as opposed to traditional torsion space Relax, used in Kellogg's Protocol 3) (Kellogg et al., 2011) both with and without all atom constraints and found that Pearson's R correlations were worse when models were prepared without constraints but the Predictive Index and MCC improved.The variability between runs also dropped when models were prepared without constraints.The biggest impact was on the classification of mutations, however, with the number of egregiously wrong predictions falling from 34.00 ± 2.0 to 24.67 ± 0.6 (Supplementary Table 2).This likely has to do with the use of Cartesian minimization during step 4, and the importance of preparing a structure with similar sampling methods to those used during mutational energy evaluation.We consider the MCC and Predictive Index improvements more valuable and thus recommend model preparation without all atom constraints.We also examined a potential runtime improvement for Cartesian G.In Park et al. (2016), the final energy for a mutation is the average of three replicates.We examined a multirun convergence criterion, described in further detail below, and settled on the convergence criterion method due to its equivalent accuracy with reduced run time.
Finally, we tested adding increased backbone sampling around residues that are being mutated to or from proline, which had no impact on the Pearson's R, Predictive Index, and MCC, but reduced the number of egregious errors slightly from 25.33 ± 0.6 to 24.67 ± 0.6 (Supplementary Table 3).
This updated Cartesian G algorithm has improved performance overall when compared to Protocol 3, especially in the ability to accurately classify mutations including the large reduction of egregious errors in classification (Tables 1, 2).For example, the number of mutations predicted as stabilizing when they are destabilizing or vice versa fell from an average of 53 with Protocol 3 to an average of 24.6 across three replicates."Off by 1" errors are also lower (317.3 vs. 289.3)(Supplementary Table 2).This trend is much stronger than the improvement in correlations, and more importantly reflects the practical value of correctly classifying mutational categories.For example in protein engineering, a protein designer's practical interest is whether any given mutation is stabilizing at all, more than which of two mutations is more stabilizing.
The overall level of accurate classification predictions increases from 51.7 to 59.1% from the Protocol 3 to Cartesian This table contains the Pearson's R correlations for each class of mutations in our benchmark set for both Protocol 3 and Cartesian G.Each is repeated three times using the same inputs and the average and standard deviation are shown.Given the sensitivity to outliers of Pearson's R we also report it as Pearson's R filtered after removing up to five outliers from each set.An outlier is defined as any single entry which, when removed, changes the correlation coefficient by 0.025 or greater.We also report the Predictive Index and the Matthews Correlation Coefficient which are less sensitive to the absolute free energy of a prediction but rather whether it can be correctly classified.Cartesian G significantly outperforms Protocol 3 both in the unfiltered Pearson's R, Predictive Index, and Matthews Correlation Coefficient.In each analysis metric the higher value indicates greater accuracy.
G Rosetta methods.We also note that over all charged residues the category predictions accuracy was 47.9% for protocol 3 and increases to over 60.5% with Cartesian G.The Cartesian G algorithm is more broadly useful across any type of protein mutation, while Protocol 3 had uneven applicability.

DISCUSSION
Here we describe a number of issues in previous benchmark sets used to assess the quality of protein stability prediction software.In particular we have found a lack of adequate experimental data being included for mutations involving charged residues.
Using these updated benchmarks we show that protein stability prediction tools in Rosetta vary widely across different types of mutation classes.In addition, given that this problem is pervasive throughout the field, it is likely that the reported accuracy of many methods for stability prediction may not reflect the diversity of possible mutation types.We encourage other developers to analyze the performance of their tools across different types of mutations using our benchmark set or one which has appropriately accounted for the biases that exist within the databases (Supplementary Table 1).The reduced size of this data set may also be useful for rapid training or situations with computational limitations.
Last, we have refactored the Cartesian G protocol code to improve consistency and modifiability, and have also made minor modifications to the structure preparation and analysis step as well as to how mutations involving proline are sampled.By analyzing these algorithms with the new benchmark set, and This table shows the ability of Protocol 3 and Cartesian G to correctly classify a mutation.Mutations are assigned a value of 0 for destabilizing, 1 for neutral, and 2 for stabilizing.The absolute value of the difference between the predicted class and the experimental class represents no error, mild error (off by one class), or egregious error (off by two classes).Performance across the benchmark is reported here as a percentage of mutations in each class.Cartesian G correctly classifies more entries (59.1 vs. 51.7%),and produces fewer catastrophic off by two errors (3.2 vs. 6.9%).
focusing on previously underrepresented categories of mutations (e.g., uncharged to charged), we are able to demonstrate the Cartesian G algorithm has improved correlation to experimental values and improved ability to correctly classify (stabilizing/destabilizing/neutral) a mutation relative to the older Protocol 3 methodology.These results show the importance of diverse datasets in algorithm benchmarking, and the need to look beyond the surface when analyzing the results of these algorithms.

Benchmark Set Pruning
To create our benchmark set, we began by making a copy of the curated ProTherm * database (Ó Conchúir et al., 2015) and began removing entries that were unsuitable.Because we wished to develop a point mutation algorithm without the complexities of multiple mutation interactions, we excluded any entry which did not represent a single mutation.Because the algorithm is intended to represent G of monomer folding and not binding interactions, we also removed entries on the interface of a protein-protein complex, or interacting with a non-water ligand.Interactions were defined as any atom in the mutated residue within 5 Å of an atom not on the same chain.To increase experimental diversity, we wished to remove duplicate mutations.
To identify duplicates, we performed an all to all sequence alignment to find parent backbones with ≥60% sequence identity.Within these clusters of sequences, any entries in which the same native residue is mutated to the same target were treated as identical.When multiple experimental G values were available for an identical mutation we chose the value taken at closer to neutral pH.

Benchmark Category Population
We identified 21 categories of mutation type by combinations from nine residue type classifications (Table 3).We then populated each narrowly defined category (e.g., polar noncharged to negative) with up to 50 entries.Some categories (large to small, small to large, buried, and surface) are supersets of the more narrowly defined categories and were sufficiently populated by the experiments selected from the other groups.
A few categories involving charged residues (positive to negative, negative to positive, non-charged polar to positive, hydrophobic to negative, hydrophobic to positive, and like to like charge) did not have enough data to hit 50 entries so every available unique experiment was added.

G Prediction
To prepare models for G calculations, structures were stripped to only the chain in which the mutation occurs.Rosetta local refinement, consisting of alternating cycles of side chain packing and all atom minimization ("Relax"), was then performed 20 times on the chain of interest and the model with the lowest Rosetta energy was selected as input.As noted in the Results, this was done without all atom constraints and in Cartesian space, not torsional space.
G predictions were then performed using Protocol 3 described in Kellogg et al. (2011), the version of the Cartesian G application described originally in Park et al. (2016), or the refactored and improved version of Cartesian G elaborated upon here.
To provide context for our modifications, a brief description of the Cartesian G protocol as presented in Park et al. (2016) is warranted.Cartesian G calculates the change in folding energy upon mutation by taking the prepared starting structures, then mutating the target residue.This residue and its neighbors within 6 Å are then repacked.After repacking the mutated residue, the side chain atoms of residues within 6 Å of the target residue and the side chain and backbone atoms of sequence-adjacent residues are minimized in Cartesian space.The same optimization, without the change in sequence, is done on the starting structure to determine the baseline energy.The process is performed three times for both the mutant and the wild type sequence and the G is calculated from the average of each.There is no particularly different handling of mutations involving proline.
Our modifications to the Cartesian G tool include a change to the analysis and a change to proline handling (Figure 2).In the analysis step, we changed the number of mutant models generated using the following convergence criterion: the lowest energy 2 structures must converge to within 1 Rosetta Energy Unit, or take the best of 5 models, whichever comes first.In either case the lowest, not the average, energy is used.In order to address changes in the backbone resulting from mutations to and from proline we added additional fragment based sampling around mutations involving proline.By default 30 fragments of 5 residues in length, centered on the mutation, are sampled and the best scoring structure is carried forward for analysis.This uses the Cartesian Sampler system described in Wang et al. (2016).Command line flags and XML files can be found in Supplementary Material.

FIGURE 1 |
FIGURE 1 | This figure shows the population of different mutation classes used to benchmark a number of methods ability to predict the change in free energy upon mutation.The citations for these benchmark sets are as follows: (A) Guerois et al. (2002), (B) Potapov et al. (2009), (C) Kellogg et al. (2011), (D) Jia et al. (2015), (E) Quan et al. (2016), curated ProTherm* (F) Ó Conchúir et al. (2015).The probability of these classifications occurring given the amino acid composition of the structures in the Curated ProTherm* database are shown in (G).Classes involving charged residues are colored in red.All data sets are significantly biased in their types of mutations present, especially when it comes to mutations to hydrophobics.All data sets contain greater than 27% hydrophobic to hydrophobic mutations vs. the expected 18.4% (G).

FIGURE 2 |
FIGURE 2 | Categorically Balanced Benchmark mutational category statistics.This figure shows the metrics of our new benchmark set selected to provide a more balanced representation of different mutation classifications.Classes involving charged residues are shown in red.

FIGURE 3 |
FIGURE 3 | Diagram of Protocol 3 and Cartesian G.This figure diagrams the steps involved in the older Protocol 3 as well as the Cartesian G protocol.Novel changes described in this paper include the removal of constraints during step 1 of the Cartesian G protocol, the addition of Step 2.1 for mutations involving proline as well as the choice to repeat testing until the protocol converges on a lower energy score instead of a fixed number (3) of times.

TABLE 1 |
Correlations and Predictive Index for Protocol 3 and our improved Cartesian G across different mutation categories.

TABLE 2 |
The ability of Protocol 3 and Cartesian G to correctly classify mutations.

TABLE 3 |
Residue category assignments and category combinations.On the left, we list the nine residue groupings considered in this benchmark and annotate which residues go in each class.At center and right, we list the 19 mutational types considered by combining these classes.