Computing formation enthalpies through an explainable machine learning method: the case of Lanthanide Orthophosphates solid solutions

In the last decade, the use of Machine and Deep Learning (MDL) methods in Condensed Matter physics has seen a steep increase in the number of problems tackled and methods employed. A number of distinct MDL approaches have been employed in many different topics; from prediction of materials properties to computation of Density Functional Theory potentials and inter-atomic force fields. In many cases the result is a surrogate model which returns promising predictions but is opaque on the inner mechanisms of its success. On the other hand, the typical practitioner looks for answers that are explainable and provide a clear insight on the mechanisms governing a physical phenomena. In this work, we describe a proposal to use a sophisticated combination of traditional Machine Learning methods to obtain an explainable model that outputs an explicit functional formulation for the material property of interest. We demonstrate the effectiveness of our methodology in deriving a new highly accurate expression for the enthalpy of formation of solid solutions of lanthanides orthophosphates.

1. Introduction.In recent years, the role of existing Machine Learning (ML) methods have experienced a tremendous growth in many scientific computing domains including Materials Science and Quantum Chemistry [20,28,36].Concurrently, new revolutionary methods and algorithms have appeared that expanded the range of applicability of existing state-of-the art techniques [17,3].This trend led to a high interest in general ML applications which, in turn, left the scientific community struggling in reconciling the need of developing refined tools with the assessment of their usefulness when applied to specific problems [40].The assessment of their efficacy is particularly relevant when the target is an explainable learning method [29] and the domain knowledge is integrated in the final model [39].
In recent years several machine learning methods have been proposed to predict enthalpies of formation of several categories of materials [34,6,37].Despite their progress, none of these efforts provides a fully explainable model which outputs a mathematical expression building on existing knowledge and improves it by adding to it additional terms that are statistically inferred.
In this article we propose a 3-step approach to use traditional machine learning tools to arrive at a scientifically consistent explainable method [29].At first, we propose the use of Kernel Ridge Regression [11] methods to first assess which, out of a number of different kernels, provides the most reliable and transparent method.Second, we make a post-hoc interpretation of the model and we proceed to reverse engineer it so as to find which coefficients of the model are the most relevant to recover a fully explainable mathematical expression for the target property.Finally, we integrate domain-specific knowledge by forcing scientific consistency through a constraint on how the input variables could be combined.The end result is a mathematical expression which relates the target property to the input variables in a functional dependence that replicates known results and add further terms substantially improving the accuracy of the final expression.
Our methodology is in part inspired by the work of Ghiringhelli et al. [9,8] which uses the Least Absolute Shrinkage and Selection Operator (LASSO) [32] together with a sparsification process they term LASSO+ 0 to learn the most relevant descriptors for a given target property.In this work, we go beyond their approach by constraining the functional form of the prior using knowledge coming from both the algorithmic model (the assessment of best kernel) and integration of domain-knowledge (to ensure scientific consistency).To demonstrate the feasibility of our approach we applied it to the specific problem of computing the excess enthalpy of formation of solid solution (enthalpy of mixing) of lanthanide orthophosphates (LnPO 4 ).We investigate the functional dependence of the mixing enthalpy for binary solid solutions of two distinct lanthanide cations (Ln), taking into account two distinct phases these materials form: monazite and xenotime [14].
1.1.Excess enthalpy of solid solution formation.Monazite is a phosphate mineral that contains rare-earth elements.Among these, lanthanide phosphates (LnPO 4 ) are the most widely distributed.These form monazite-type structure for Ln = La, ..., Tb and xenotime-type (zircon) structure for heavier lanthanides [5,24,22,33].Among other potential applications, synthetic (monazite-type) ceramic solid matrices are suitable for the conditioning of long-lived radionuclides such as minor actinides (Np, Am, Cm) or Pu in a form of a synrock [7,35,19].However, before these ceramics could be used in nuclear waste management, their physical and chemical properties and, most importantly, their thermodynamic parameters have to be well characterized and understood.
A solid solution is formed when two or more cations are incorporated into a solid host matrix on the same crystallographic site.When atoms of the solute solid phase are incorporated into the solvent solid host phase, the whole process can be interpreted as a sort of impurity inclusion into the host phase [10,27].Here, we consider a combination of two cations within a single phase, either monazite or xenotime.In reality, however, when lighter (monazite as stable structure) and heavier (xenotime as stable structure) lanthanide are mixed, such a solid solution has a wide miscibility gap, i.e. it is thermodynamically unstable in a wide solid solution range, with different stable phases of solid solution endmembers (solute and solvent).In these cases, the mixing enthalpy of single phases solid solutions is a key factor for describing the two-phase cases, such as monazite-xenotime system [21,14].
Whether a single phase solution will stay stable or not, the result is driven by the excess enthalpy of the mixing [10,18].The latter is defined as the difference between the formation enthalpies of the mixed compounds and those of the solid solution endmembers, which could be measured [38] or accurately computed [2,31,1].The single phase solid solutions such as monazite-type, resemble closely a symmetric solid solutions and are well described by a simple model, H E = m(1 − m)W , with W being the Margules interaction parameter and m the solid solution ratio [26,18].With systematic DFT-based calculations, it was found [16] that for the monazite-type solid solutions, the Margules interaction parameter W is a function of the Young's modulus, Y , the unit-cell volume, V , and the unit-cell volume difference between the solid solution endmembers (solid and solute), ∆V , namely [13] (1.1) The relationship between the excess formation of mixing and the physical parameters has been a topic of discussion of various studies [21,18,16].Among these, the ionic radius of the mixing cations is often used as the main discriminant parameter [21].Such a choice, however, makes the thermodynamic parameters of the mixing only weakly material dependent.As such, the excess enthalpy of mixing of monaziteand xenotime-type solid solutions are described with very similar relationship as a function of ∆R/R (Fig. 6 of [21]).
In Figure 1 we illustrate how existing models describe the functional dependence on physical parameters of the excess of enthalpy for the data used in this work.Plot (a) shows the case of monazite for which the models of [16,4] reproduce the data reasonably well.This is in part because both models use the difference in volumes of the endmembers as a parametric variable, while the model of [21] uses the difference in ionic radii.
The situation is diametrically different in the case of xenotime-type solid solutions (plot (b) of Fig. 1).Here, the models of [4] and [16] give predictions that are inconsistent with the ab initio data by a factor of ∼ 2. This points towards the possibility of another, unaccounted term in the Margules parameter that could be quite relevant in the case of xenotime-type solid solutions but of minor importance for monazite-type solid solution.Fig. 1: The excess enthalpy of mixing for monazite-and xenotime-type solid solution computed ab initio and from models of [4], [16] and [21] A combination of ab initio and calorimetric studies, [23] has shown that the ab initio data themselves are not enough to constrain the values of the Margules parameter W , and that understanding of the dependence of W on the selected physical parameters is crucial for precise modeling of the stability of solid solutions.As such, the study of the excess of enthalpy for this type of solid solutions lends itself perfectly to test the explainable machine learning methodology we have devised.

Methodology and learning algorithms.
In predicting materials' properties one needs a set of curated training data organized in input variables x and target properties y.The set of input variables has to represented in terms of a number of independent descriptors that are invariant under known mathematical and physical symmetries and usually requires a considerable amount of domain expertise.In this work, the input variables are represented by properties of the elements constituting a solid solution (e.g.electron orbitals, nuclear charge, etc.) and the target property by the solution excess enthalpy of formation.
2.1.Elemental properties and descriptors.As mentioned in other works [9,37,34] and based on the physics of the solution process, properties associated to the electron orbitals and the nuclei are expected to carry most of the weight in determining the value of the enthalpy of formation.Moreover, all solid solutions that are part of our data set have in common the same phosphate group (PO 4 ).Consequently, properties of the atomic elements of such group are not taken into consideration.Based on the online database http://www.knowledgedoor.com/,we list in Table 2 a meaningful list of elemental properties (elementals) k that are available for each and every lanthanide element.

Name
Symbol Unit Table 1: List of elemental properties and their physical units These elementals can be arranged in an abstract vector = ( 1 , 2 , . . . ) T = (Z, m, R 8 , R 9 , IP 2+ , IP 3+ , . . . ) T .For each lanthanide L i ∈ (La, Ce, Pr . . . ) there is one such vector.We build descriptors out of elementals.Since we are investigating solid solutions made of two lanthanides, our descriptors x k [ k (L i ), k (L j ), m i , m j ] are functions of elementals from two different lanthanides together with their mixing ratio m i .The inclusion of m i s is necessary to distinguish between different solution ratios.However the descriptor is defined, it should be invariant to simultaneous permutation of lanthanide and mixture ratios.Noting that m i + m j = 1, we can actually use only one mixing ratio m = m i so that m j = 1 − m.
Invariance under permutation can be expressed as It is important that a descriptor x changes significantly with the mixing ratio.Additionally, we should include descriptors capturing certain processes were the enthalpy is strongly dependent on which lanthanide has the largest abundance.Last but not least, descriptors has to be homogeneous functions of elementals and cannot mix elementals with different physical units (unless conversion constants are involved).For the reasons above we selected three type of descriptors x (1) , x (2) and x (3) , listed in Table 2, for every elemental k and every lanthanide pair (L i , L j ).
Notice that the quadratically weighted mean is not quadratic in the actual values of the elementals k but quadratic in the mixture ratio m.That means the descriptor x (2) will lean heavily to the value of the elementals of the more abundant lanthanide.For each combination of lanthanides pairs the ten elementals k are organized in a descriptor vector x made of thirty descriptors in total.
Each vector x ∈ X of size d = 30 makes up the input variables for the learning algorithm.The target value y is the excess enthalpy of formation H E .For each choice of lanthanide pairs (L i , L j ) and choice of mixing ratio m we have a data point (x, y).All data points together constitute a set holding N data points.We will see at the end of this section how this set of points is generated.

Learning algorithms.
Since the data points in our set have both input values and target value, we use a common type of supervised learning algorithm: kernel ridge regression (KRR) [11].Kernel ridge regression is a non-linear regression algorithm with a regularization term (from which the name "ridge") that is comparable to the well-known Support Vector Machine algorithm.
The simple linear regression algorithm aims at finding the unknown coefficients β of the function f (z) = β, z minimizing the error E[f (z) − y] (also known as loss function) over the entire set of data.In order to alleviate over fitting of the data, a regularization term is usually added.In the ridge regression, the regularization amounts to adding a penalty term to the minimization problem.Choosing the squared error as loss function leads to the following minimization problem (2.1) arg min By introducing a function z → φ(z) which maps the input space X to a feature space H, the use of kernels generalizes the linear to non-linear regression (see [30] for a didactic introduction).In this context a kernel is an inner product in feature space k(x i , x j ) = φ(x i ), φ(x j ) .The advantage of using kernels is that the function f (z) = i α i k(x i , z) is not expressed anymore as a sum over dimension of the input space d, but instead as a sum over the number of data N making up the training set.
With this set up, the minimization problem to be solved becomes arg min In practice, the kernel function is expressed as a matrix of inner products between points of the training data set in feature space k(x i , x j ) = K ij .Eventually, the solution of the minimization problem can be expressed by the linear equation with α ∈ R N being the vector that contains the information learned.
Almost all Machine Learning methods do not work directly out of the box but have a number of parameters that have to be fixed.In the case of the KRR, the level of regularization through the parameter λ is tuned for the dataset at hand and the selected kernel.Additionally, almost every kernel has some extra parameters that must also be tuned.The entire set of these adjustable parameters are called hyperparameters.
Given a kernel, its computation can still be performed in input space despite its value describes the inner product in feature space.In this work, we employed three different kernels with the same set of data: the polynomial, the Gaussian, and the Laplacian kernels, which are respectively based on the inner product, the 2 -norm and the 1 -norm The actual computation of α amounts to solve a positive-definite numerical linear system Aα = y.Once α is computed, it is used to compute predictions for any new data point with ŷ = i α i k(x i , x).For validation purpose, the results of such prediction are typically presented in a scatter plot where the predicted and computed target values are represented on the x and y, respectively (see plots of Fig. 2, for instance).
The Laplacian and Gaussian kernels are by far the most used in KRR because they provide the most effective map since they use a virtually limitless number of functions as the prior of our statistical model.The down side is that we cannot recover the explicit functional form which express the target value in terms of input variables.This is why we have also included polynomial kernels, since they could be virtually inverted and return a functional expression for the coefficients α in terms of descriptors.Being able to statistically infer such functional expression would allow us to go beyond the prediction of target values for new solid solutions and understand which descriptors are more relevant and contribute the most to determine the target values.
In this sense, the polynomial kernels that will be closest in error to the Gaussian or Laplacian kernels will provide a hint on the order of polynomial functions that should be included in our prior.With this information we can manually construct thousands of candidate functions of the original descriptors x (t) i that could faithfully represent the underlying surrogate space.In our application, we denote these candidate functions as where M can range from O(10 3 ) to O(10 4 ).We then apply a sparsification technique, which amounts to find the most relevant among v k by forcing as many coefficients of the statistical model to be zeros.In Section 3.1, justified by the KRR results, we show how each distinct v k is constructed from a low degree p polynomial function of the original descriptor.
The objective of sparsification, is finding the most relevant term(s) among v k which contributes the most to the target values.Moreover, the number of the relevant terms should be also controllable.A straightforward sparsification technique that one can employ is the LASSO [32] approach combined with an 0 regularization.This combination is able to sparsify the coefficients of LASSO regression into a small determined number.The minimization problem to be solved is given as follows (2.6) arg min In this formula, the 0 -norm of a coefficient vector γ is defined as indicating the number of non-zero elements of the γ vector.A vector While the LASSO+ 0 is the exact problem we want to solve, it has a significant drawback: this minimization problem is not convex.This leads to a "NP-hard problem"1 which is infeasible when M is large.Therefore, LASSO+ 0 cannot be directly applied to sparsify the candidate functions v.In order to compromise between the convexity and sparsity of the coefficient vector, we first utilize a LASSO with a Manhattan 1 -norm regularization λ γ 1 (LASSO+ 1 in this paper) to carry out an initial feature selection out of which we can achieve the sparsification of v [8].
(2.8) arg min This latter optimization problem is convex, which promote also sparsity.In equation (2.8), λ > 0 is the penalty parameter which is able to control the sparsity of the coefficient vector v: the larger λ is, the smaller the 1 -norm γ 1 would be, hence higher sparsity is achieved and more candidate functions are eliminated.
In our application, we start from a very small λ, then increase it with a fixed step.With the increase of λ, we observed that a large number of constructed functions in v get quickly eliminated to rapidly flatten the curve which stagnates even for much larger λ values; in other words the minimization process with the 1 -norm reduces the vector γ to be at most κ-sparse, κ being usually smaller than 20 in our application.In practice, we compressed a very large feature space spanned by v into a tiny one spanned by v , in which v is a subset of v whose number of elements is smaller than κ.Thanks to the low dimension of feature space selected by LASSO+ 1 , the Equation (2.6) of LASSO+ 0 can be solved rather effectively with a brute force approach.More details of numerical results are given in Section 3.
This method, combining LASSO with a sparsification of the coefficients vector, has been developed in the context of compressed sensing [8], where one is interested in reproducing the gathered data with as few as possible degrees of freedom.Unlike in the [8], our work starts with a Kernel Ridge Regression and maps the descriptors x i to a much larger but finite dimensional input space where the new features v i are made out of a finite number of selected functions of the original descriptors x i .We additionally integrate domain-knowledge by modifying the initial selection of descriptors and features to ensure scientific consistency with the existing knowledge from the physics of solid solutions.The necessity of tuning of the descriptors and the features is also confirmed by the inefficiency of the LASSO+ 1 method when the original descriptors x i are used.
In the next section, we see that KRR polynomial kernel with degree 3 returns one of the lowest statistical error.We capitalize on such a result and make the plausible hypothesis that a polynomial map of degree at most 3 for LASSO should identify the most promising functions of the descriptors x (t) k .We then show that the results of feature selection by LASSO+ 1 suggests the selection of a modified set of descriptors that guarantees scientific consistency.Finally we determine the desired functions which map the elemental descriptors to the excess of enthalpy for both monazite and xenotime.

Data set generation.
The data set used by the selected learning methods was computed with the Quantum-ESPRESSO code using the approach of [18,16].The solid solutions were modeled with special quasi-random structures constructed using procedure of [41].All the structures were modeled with supercells containing 192 atoms (32 formula units).We applied the PBEsol exchange-correlation functional [25] and the f electrons were included into pseudopotential core.It has been shown that this setup results in a very good predictions of structures and thermodynamic parameters of lanthanide phosphates, including formation enthalpies [2,12,15,1].The excess enthalpies of mixing and Margules interaction parameters were derived from differences in the total energies of the mixed cation structure and weighted (according to the solid solution composition) sum of the end members.
The dataset consists of excess enthalpies of formation between all 15 lanthanides, which leads to 105 possible combinations ( 14 i=1 i = 105).Those 105 combinations were then modeled for five different mixture ratios m = 0.25, 0.375, 0.50, 0.625, 0.75 giving a total of 525 data points.Two distinct data sets were generated for the two lanthanide orthophosphate phases, monazite and xenotime, which correspond to the two possible coordination numbers of the lanthanides (see Table 1).In the following, we will test our models on three distinct configurations of these data: monazite only (525 data points), xenotime only (525 data points), and fused (1050 data points).Not all these points are used to train the learning model.A subset is reserved for testing and validation purposes.
3. Regression, sparsification and interpretable formulas.The first step before determining the coefficients α of our KRR models is to determine the optimal value of the corresponding hyperparameters.To this aim, the entire given dataset is typically split in two parts.The training dataset (x i , y i ) ∈ T is used to compute the actual coefficients α of the regression that appear in Eq. (2.2).The testing or prediction dataset (x i , y i ) ∈ P is kept separate and is used to evaluate the quality of a given set of hyperparameters values for predicting unseen data ŷ = α T K , where The optimal set of parameters is selected when the predictions (i.e.ŷi ) are in the best possible agreement with the known values from the testing dataset (i.e.y i ).Different error functions are commonly used to quantifying the quality of a prediction.The most common that we are also using in this work are Mean Squared Error (MSE), Mean Absolute Error (MAE), and Maximum Error (ME).
Because possible hyperparameters values may span over multiple dozens of orders of magnitude, brute force methods that quickly scan the entire space in a grid-like pattern are preferred to conventional minimization methods.Once an approximate local minimum is found, local optimizations are used to refine the values of the hyperparameters.In the following we do not report of the hyperparameter search results which is a standard procedure in the use of ML algorithms.In all the scatter plots and tables it is implicitly intended that all hyperparameters have been opportunely optimized following the procedure just described.3.1.Predicting excess of enthalpies with KRR.In the search for optimal hyperparameters, we have split the total subset of data in two parts between T and P. The ratio of the data size between T and P is 4 : 1.Once the hyperparameters search is completed we fitted the data of the training set T using all three distinct kernels for the KRR model.
In Table 3 we report the results obtained for all kernels with all three type of errors (MSA, MAE, ME) for both sets T and P on the fused data set configuration (similar results are obtained for the other two configurations).Despite its remarkable low errors on the T set, the Laplacian kernel does not return a satisfactory result on the set of prediction data P.In fact, no choice of hyperparameters returns a reliable regression for unseen data: the search space minimization return a value for λ numerically indistinguishable from zero.This is a typical sign that KRR with this kernel is overfitting the data and returning an in-sample error much smaller of the outof-sample error.For this reason we discarded the learning model using the Laplacian kernel.
The low order polynomial and the Gauss kernels return a much nicer picture in terms of the errors.Already the degree 2 polynomial kernel is able to fit the data quite well.Its MAE differs only ≈ 0.01 kJ/mol from the error of the degree 3 kernel and the Gaussian Kernel.Judging from the fact that the degree 3 polynomial kernel returns a MAE as low as ≈ 0.02 kJ/mol for the P set indicates that the underlying function for the excess of Enthalpy could be represented by functions of the descriptors having up to cubic terms.Since the actual values for the excess enthalpy of formation for the lanthanide orthophosphates span a range going from of 0.5 to 10 kJ/mol, the relative errors of our model represent the same level of uncertainty returned by the DFT simulations.In other words, the prediction provided by the KRR model with either a degree 3 polynomial or the Gauss kernel are indistinguishable from the finite accuracy of the in silico simulation used to generate the data used in both sets T and P (see Fig. 2).What distinguishes the Gaussian and the degree 3 polynomial kernels are the value of the hyperparameters: the degree 3 kernel requires a quite large value of λ and γ respect to the Gaussian.This is not necessarily a negative result but it points out that our choice of descriptors may not be ideal when the kernel represents a finite set of prior functions like in the case of the degree 3 polynomial.When the set of prior functions becomes virtually infinite (the Gaussian kernel can be seen as an infinite series of polynomials), the descriptor choice becomes unimportant.We will see in the following subsection how the choice of descriptors becomes significant when one would like to sparsify the vector of coefficients α starting from a finite set of prior functions of the descriptors.
To ensure that our KRR models provide a good fit for all data independently on how they are split between T and P, we have used cross validation.In practice, we run the KRR models fitting several different choices of training and testing datasets always keeping the same choices of values for the hyperparameters that were previously selected.If the results with the original dataset were truly a product of chance, a fit with an entirely new combination of data points should show a different performance.Table 4 shows the results for repeating the KRR with polynomial kernel of degree 3 for five different subsets, some of which even showed slightly better performance than the original regression.3.2.LASSO + 0 .In the previous section, we have concluded that KRR with Gaussian and degree 3 polynomial kernel performs very similarly with a slight advantage for the latter kernel.In this section we want to pursue the road of finding a surrogate model that is explainable: we aim at formulation of the excess enthalpy of mixing that can provide the domain scientist with a understandable function of a small number of descriptors.We are driven to recover this result by the observation that already polynomial functions of degree 3 provide enough prior to get an accurate KRR model.We achieve this result through a so-called sparsification process.

Sparsification with Descriptors in KRR.
A large number of candidate functions have been built from the polynomial kernel of degree 3 based on the 27 elemental descriptors introduced in Section 2.1.Denoting the group of 27 elemental descriptors as D 1 , a group of candidate functions with polynomial of degree 2, denoted as D 2 , can be defined as a commutative element-wise vector product d i d j , with d i , d j ∈ D 1 .The number of descriptors in D 2 is 378.The group of candidate functions with polynomial of degree 3, denoted as D 3 , can be defined in a similar way, as a commutative element-wise product of three vectors The number of descriptors in D 3 is 3,303.Therefore, all the candidate functions are a union of D 1 , D 2 and D 3 , which is denoted as D. D is a dense matrix of size 1,050 × 3,708 for the fused data set, and of size 525 × 3,708 for both monazite and xenotime data set configurations.
As described in Section 2.2, a feature selection step has been performed on D through the LASSO+ 1 method.Increasing the penalty parameter λ, the size of feature space is reduced as more candidate functions are removed.In the numerical experiments, λ is increased from 0.001 to 0.096 in incremental steps of 0.005.We performed the same feature selection step not only for the fused data, but also separately

(c) xenotime
Fig. 3: LASSO+ 1 results with descriptors used in KRR for monazite and xenotime data configurations.The results are plotted in Fig. 3, which shows the changes of the errors (MAE, MSE, ME) and the size of the reduced feature space (marked as desc.).From Figure Fig. 3 we infer that with the increase of λ, the feature space size can be quickly reduced to less than 30.This is good news, since sparsifying a feature space of size ∼ 30 through LASSO+ 0 into a determined size smaller than 5 is still feasible2 .At the same time, we observe that the all the errors increase with the increase of λ.This behavior is to be expected and is the direct consequence of the reduction of the feature space size.Unfortunately, the errors increase too quickly and show an exponential growth that plateaus already for small values of λ.For example, for fused data, the MAE increases from ∼ 0.03 to more than 0.25 when λ is only increased from 0.001 to 0.021.All the while the MSE and ME increase to 0.20 and 2.5, respectively.When increasing λ from 0.021 to 0.096, the feature space size continues to be reduced, however, the changes of error flatten.Similar trends can also be observed separately for the monazite and xenotime data.This result signals that the generated feature space doesn't capture well the functional dependence of the excess enthalpy on the descriptors.In other words, we cannot find a simple functional dependence of H E and need many hundreds of functions to faithfully predict the enthalpy.
In order to confirm our concerns, we perform a LASSO+ 0 step on the reduced feature space separately for the fused, monazite, and xenotime data set configurations.At most three leading terms have been selected, which are listed in Table 5, along with their corresponding MAEs, MSEs, and MEs.As expected, the errors of candidate functions selected by LASSO+ 0 are more than one of order of magnitude larger than the ones derived by KRR.What is more remarkable is the fact that the errors don't seem to decrease much as we consider more terms.In fact, for the fused data configuration, the MAE and ME actually increase going from 2 to 3 leading terms.The errors for the other data configuration seems a bit better but are still far from what we would like to observe.From these tables and the plots in Figure 3, we conclude that the choice of prior candidate functions cannot be effectively exploited Table 5: LASSO+ 0 results for Fused, monazite and xenotime with descriptors used in KRR by the sparsification process.

Sparsification with Descriptors built with Prior Knowledge.
In order to overcome the shortcomings of the sparsification process seemingly caused by the choice of descriptors, we simplify the form of the descriptors and exploit the existing knowledge from the application domain.Based on the insight provided by the simple model with the Margules interaction parameter W and the expression in Eq. (1.1), we make three additional hypothesis: i) the polynomial degree of m and 1 − m may not be in accord with each other and the polynomial degree of the elemental property k , ii) negative power of the descriptors may appear in the Margules parameter and, iii) the volume V , coordination number R , and mixing ratio m play a special role than the other elemental properties and may contribute with monomials of degree higher than 3.
The direct consequence of i) is that m and 1 − m have been decoupled from the elemental descriptors and included as descriptors on their own.This may seem a strange choice since they do not depend on the lanthanides but are the same for all.On the other hand, decoupling the mixing ratio allows more freedom in the way it appears in the degree 3 polynomial functions that are part of D. The indirect consequence is that we do not have anymore three types of descriptors for each elemental property k but only two: we drop the weighted quadratic mean, convert the weighted mean nb.Leading terms MAE MSE ME 0.0483 0.0051 0.4407 Table 6: LASSO+ 0 results for Fused, monazite and xenotime with descriptors defined only by arithmetic means and absolute difference to a simple arithmetic mean, k and keep the absolute difference ∆ k (see Table 7).
The second hypothesis, implies that the inverse of each elemental property ∆ k and k are also included explicitly in the descriptors.Additionally, due to the third hypothesis, we include as descriptors monomials of degree higher than one for m, V , and R. In particular, for m and 1 − m we include up to quadratic terms (and their inverse), and for V and R we include up to cubic terms (and their inverse) All these descriptors build up a basic feature space D 1 of size 58.Analogously as done for the original descriptors, the group D 2 and D 3 are built as the elementwise product of two or three features out of D 1 , respectively.Feature space D 2 is of size 1647 after removing 6 features with standard deviation being 0, such as the element-wise product of 1 m and m.The size of feature space D 3 is 30856.The sum of all candidate functions are collected into D = D 1 ∪ D 2 ∪ D 3 .The size of data D is 1050×32561 and 525×32561 for fused and for monazite/xenotime data configurations.We run the same LASSO+ 1 as in Section 3.2.1, the results of which are shown in Fig. 4. Compared to the Fig. 3 the results we obtained are quite more promising.For fused data in Fig. 4, the size of reduced feature space drops down quickly below 40 already at very small value of λ.Afterwards, with the increase of λ, the decrease of the size of reduced feature space slows down, which implies that the reduced feature space contains always important features for the target problem.Meanwhile, with the increase of λ, the errors increase moderately and linearly, which is a second sign that the feature space is reduced into a good choice of representative functions.
After applying LASSO+ 0 on the vector of descriptors spanning the reduced space, the first three leading terms and their corresponding errors for fused, monazite and xenotime data configurations are shown in Table 6.It is important to notice that while the errors may still be quite high for keeping only the leading term, they decrease rapidly when we include higher terms.Moreover the first leading term for the monazite and xenotime data configurations resemble very closely the expression of Equation (1.1).We will analyze in more details the results and interpret their physical meaning up to five leading terms in the next section.
4. Numerical results.As illustrated in Sec.1.1, the discrepancies between existing models and the data makes the computation of the excess of enthalpy for solid solutions of both monazite and xenotime, a clear cut example to demonstrate the validity of our statistical approach in retrieving an explainable expression for H E .We will (1) show which descriptor provides the most reliable leading term for the mixing enthalpy between ionic radii of the mixing cations and the volumes of the pure phases, (2) identify the first sub-leading term, which enhances the accuracy of the H E prediction of xenotime-type solid solutions, (3) provide a exhausting statistical analysis for additional sub-leading terms.The (1) statement is inferred directly from the direct inspection of the tables for cases (ii) and (iii) applied to the fused data set (Tables 8 and 9).From these two  tables, one observes that the functions of one up to five leading terms are exactly the same except that wherever in one table the ionic radius R appears in the other the volume V appears in exactly the same terms.Moreover, all of the coefficients of the leading terms are exactly the same across the two tables as well as the all the errors.This obviously points out to the fact that only either ionic radius or volume should be included in the list of elemental properties.In order to understand which among these two elemental properties should be eliminated, we look next to the cases (ii) and (iii) Comparing the errors, it is immediately obvious that descriptors without the volume V elemental returns always larger error than descriptors without the ionic radius R elemental for both monazite and xenotime data sets.This is a strong indication that V should be preferred over R as the elemental property of choice.This concludes statement (1) and excludes from further analysis cases (i) and (iii).Next, we look at cases (iii) and (iv) applied to all three sets of data (Tables 14, 10, 15, 12, 16, and 8).The very first observation is that no matter what data set or case one looks at, all the one-term function are of the form m(1 − m)(∆V ) 2 .Unequivocally this is the first leading term describing the excess of enthalpy H E .We also notice that sub-leading terms for monazite and xenotime data sets are dominated by the volume elemental but this dominance manifests itself differently for each distinct data set and with errors that vary from case to case.In the case of xenotime, the prediction of H E with only one term is not sufficient and additional terms are needed to get satisfactory accuracy.While the leading term is also m(1 − m)∆V 2 , the prefactor coefficient is substantially larger that the one obtained for monazite (1.23 vs. 0.94).Although this difference could be possibly explained with on average larger Young's moduli of xenotime phases than monazite (see Eq. 1.1), this demonstrates that additional terms are essential to obtain an equation that consistently described both structures.When we look at the solutions for xenotime alone, different terms appear, usually containing some power of the difference and average of volume and/or Young's modulus.Only simultaneous fit to monazite and xenotime data sets could shed a light on the composition of the second leading term.Looking at the fused data, it becomes clear that not only the volume but also the Young modulus Y elemental plays a central role in sub-leading terms.In addition, comparing Tables 16 and 8, it is immediately clear that using descriptors of just V and Y returns consistently smaller errors for all sub-leading terms.Incidentally, Young's moduli and volumes/ionic radii are the elemental properties used in models of [21] and [16].From this analysis we infer that we should focus on the leading and sub-leading terms provided by table 16, where all elementals apart from V and Y are considered.The leading second term contains product of ∆V and ∆Y .With combination of m(1 − m)∆V 2 and ∆Y ∆V / V 2 one can satisfactorily express H E for both sets of data.This indicates that in addition to difference in volume, the difference in Young's modulus plays a significant role in determining the excess enthalpy of mixing.To better understand the role of this second term, we plot in Fig. 5a the variation of Young's modulus as a function of ∆V for the two phases.For the considered range of elastic moduli and volumes, there is a clear linear relationship between the two.However, the linear relationship is much steeper in case of xenotime than monazite, with the slope about three times higher in the former case.This makes the ∆Y ∆V / V 2 term much larger (by a factor ∼ 3) in case of xenotime, putting it on equal footing with the leading term.This explains why missing this term, there is a factor of ∼ 2 difference between the ab initio data and the discussed, prior theoretical models (see Fig. 1).
In general, adding more terms improves the fit only marginally (see Fig. 5b).On the other hand, one can observe from table 16 that for any choice of number of terms, the factor ∆Y ∆V / V 2 is always present, confirming its importance in contributing to the expression of H E .This concludes the statements (2) and (3).

Summary and conclusions.
In this work we report on a 3-step approach that combines distinct methods from classical Machine Learning to reach a scientifically consistent explainable methodology.First, we use a KRR approach on the generated data and evaluate which kernel returns the least amount of error.Then, using the results of KRR, we reverse engineer the model and proceed to sparsify the vector of coefficients using a so-called LASSO+ 0 procedure.Finally we integrate domain-specific knowledge to force an a-posteriori scientific consistency of the reverse model.
In order to demonstrated the feasibility and potential of this methodology, we have applied it to the computation of the excess of enthalpy of formation H E of solid solutions of lanthanide Orthophosphates.This particular class of materials is present in nature in two distinct crystal structure-monazite and xenotime-for which no single formula is capable of describing accurately H E .Applying our machine learning based 3-step method to a set of in silico data, we were able to retrieve subleading corrections to known expressions for H E , which represent an important step in resolving a conflicting description of the excess of enthalpy for both type of structures.We expect that the importance of accounting for the gradient of elastic moduli when estimating the excess enthalpy of mixing will trigger follow-up theoretical studies

Fig. 4 :
Fig. 4: LASSO+ 1 results with descriptors defined only by arithmetic means and absolute difference.
The Young's modulus as a function of volume for monazite (green circles) and xenotime (blue squares) phases.The lines represent the linear regression fits.The mean absolute error in kJ/mol of the LASSO + 0 sparsification for the monazite, xenotime and fused cases obtained with the set of 1 to 5 leading terms.

Table 2 :
Descriptor types depending on lanthanide pairs (L i , L j ), elemental k , and mixing ratio m.

Table 3 :
Kernel Ridge Regression results on the Excess Formation Enthalpy dataset for different Kernel types.The first two rows are polynomial kernels of degree 2 or 3 respectively.The errors displayed are Mean Squared Error (MSE), Mean Absolute Error (MAE) and Maximum Error (ME).T =Training dataset, P=Prediction dataset.Units are [kJ 2 /mol 2 ] (MSE) or [kJ/mol] (MAE & ME).Note that the Laplacian kernel has by far the lowest regularization strength (λ = 10 −20 ) which leads to a perfect fit on the training data but by also to the worst performance on the testing dataset.This is a strong sign of overfitting.

Table 4 :
The MAE over the five different sets ranges between Errors for cross-validation of polynomial kernel of degree 3 with five different combinations of data points evenly split between T and P. For each new regression, we used the same hyperparameters λ, γ and c as in the original one.
0.052 − 0.082 kJ/mol, confirming the quality of the original regression.

Table 7 :
Two basic descriptors of elemental property k

Table 9 :
Fused: all elemental descriptors except volume (ii) use all modified descriptors excluding the one based on ionic radii elementals R 8 and R 9 , (iii) use all modified descriptors excluding the one based on the volume elemental V , and (iv) using only descriptors based on the volume V , mixing ratio m and Young modulus Y elementals.In addition, we used three different set of data-monazite, xenotime, fused-for each of the four cases (i)-(iv).In total, we produced 12 separate sparsification scenarios, each specifying functions up to five leading terms and the relative errors.

Table 10 :
Monazite: all elemental descriptors except Ionic radius coordination