Machine learning prediction of the mechanical properties of refractory multicomponent alloys based on a dataset of phase and first principles simulation

In this work, a dataset including structural and mechanical properties of refractory multicomponent alloys was developed by fusing computations of phase diagram (CALPHAD) and density functional theory (DFT). The refractory multicomponent alloys, also named refractory complex concentrated alloys (CCAs) which contain 2–5 types of refractory elements were constructed based on Special Quasi-random Structure (SQS). The phase of alloys was predicted using CALPHAD and the mechanical property of alloys with stable and single body-centered cubic (BCC) at high temperature (over 1,500°C) was investigated using DFT-based simulation. As a result, a dataset with 393 refractory alloys and 12 features, including volume, melting temperature, density, energy, elastic constants, mechanical moduli, and hardness, were produced. To test the capability of the dataset on supporting machine learning (ML) study to investigate the property of CCAs, CALPHAD, and DFT calculations were compared with principal components analysis (PCA) technique and rule of mixture (ROM), respectively. It is demonstrated that the CALPHAD and DFT results are more in line with experimental observations for the alloy phase, structural and mechanical properties. Furthermore, the data were utilized to train a verity of ML models to predict the performance of certain CCAs with advanced mechanical properties, highlighting the usefulness of the dataset for ML technique on CCA property prediction.


Introduction
Complex concentrated alloys (CCAs) (Yeh et al., 2004a;Tsai and Yeh, 2014;Ye et al., 2016;Miracle and Senkov, 2017), the multicomponent alloys containing five or more elements with equal or near-equal concentrations, have recently received increased attention due to their new and important properties, such as high strength at both room temperature and elevated temperatures (Senkov et al., 2011;Kang et al., 2018), exceptional ductility (Yao et al., 2014), and toughness (Patriarca et al., 2016). Numerous studies on CCAs were motivated by the possibility that the high configurational entropy may simply favor a single phase, such as face-centered cubic (FCC) or body-centered cubic (BCC) phases (Yeh et al., 2004b). Thus, research on CCAs has become tightly associated with finding single-phase solid solutions by controlling their configurational entropy.
Based on the elements contained in CCAs, they can be classified into: 1) 3d transition metal alloys (formed of 4 or more of the following elements: Al, Co, Cr, Cu, Fe, Mn, Ni, Ti, and V), 2) refractory metal alloys (formed of 4 or more of the following elements: Cr, Hf, Mo, Nb, Ta, Ti, V, W, Re, and Zr), and 3) other alloys that include light metal and lanthanide transition metal (Miracle and Senkov, 2017). 3d transition metal alloys, for example, Ni-based alloys have been developed for high temperature applications in aircrafts, power generation turbines, rocket engines and other challenging environments (Ezugwu et al., 1999;Griffiths, 2019;Morinaga, 2019). A recent key goal for generating metallic alloys with high melting temperatures, which could potentially be employed in nuclear reactors and comparable applications, has motivated the development of refractory alloys. Alloys with single phase or dual phase were reported to have high strength (Li et al., 2016;Singh et al., 2018;Maresca and Curtin, 2020) and high hardness (Borkar et al., 2016). Additionally, the variety of refractory elemental characteristics offers significant design flexibility for refractory multicomponent alloys. For instance, BCC MoNbTaVW has demonstrated high Vickers micro-hardness of 11.4 GPa at 1,150°C (Xin et al., 2018) and strong yield strength of 1,246 MPa at room temperature which decreases to 842 MPa at 1,000°C (Senkov et al., 2011). This demonstrates promising mechanical property of refractory CCAs.
CCAs with simple crystal symmetry and remarkable mechanical properties is one of the areas that draw attention of scientists worldwide. The principal components analysis (PCA) technique was employed to predict the single phase of the multicomponent alloys (Zhang et al., 2008;Murty et al., 2014;Zhang et al., 2014). By using this statistical technique, the variables of the dataset can be reduced into principal components. The original database is preserved as much as correlation will allow in the principal components, which are made up of orthogonal linear combinations of the original variables. Based on the PCA technique, the mixing entropy (ΔS mix ), valence electron concentration (VEC), atomic size difference (δ), and mixing enthalpy (ΔH mix ) were utilized as critical conditions for the formation of the CCA solid solutions. The formation of BCC CCAs requires that the following conditions are satisfied (Zhang et al., 2008;Zhang et al., 2014): −15 ≤ ΔH mix ≤ 5 kJ/mol, δ ≤ 6.6%, 12 ≤ ΔS mix ≤ 17.5 J/(Kmol) for CCAs that contain 5 or more elements, and VEC < 6.87. On the other hand, the calculation of phase diagrams using CALPHAD was widely used to predict phase stability of CCAs and to understand their formation mechanisms. Thermo-Calc's High Entropy Alloy database (Andersson et al., 2002;Chen et al., 2018) was used in CALPHAD software along with the high entropy alloy database. These have been claimed to lead to good agreement with the experimental observations on the phase of refractory CCAs, such as MoNbTaTiVW and Ti x NbMoTaW (Andersson et al., 2002;Gao et al., 2015;Zhang et al., 2015;Yao et al., 2016b;Yao et al., 2017;Chen et al., 2018;Han et al., 2018).
Thanks to the increase of computational capacity, the utilization of machine learning (ML) accelerates the study of CCA phases (Lederer et al., 2018;Huang et al., 2019;Zhou et al., 2019;Zhang et al., 2020). Additionally, ML has also been used in prediction of CCAs with predefined properties, such as high strength and high hardness, (Chang et al., 2019;Himanen et al., 2019;Wen et al., 2019;Hu et al., 2022;Vazquez et al., 2022) and high elasticity (Kim et al., 2019). However, designing CCAs with desirable properties by ML urgently requires statistical analysis of these alloys. Many of the current databases for ML studies were built using only mathematical models such as rule of mixture (ROM) (Couzinié et al., 2018;Roy et al., 2020;Li et al., 2021). ROM is a weighted mean method used to predict the properties of alloys, the parameter of an alloy f mix can be estimated by equation f mix C i f i , where C i and f i are the atomic fraction and the parameter of element i. While forming alloys lattice distortion may occur because of atomic level mismatches between components (e.g., atom size, valence electrons, etc.). In this case, the potentially novel mechanical, electronic, and thermal properties of CCAs may be missed, leading to significant deviation of mathematical models from reality. Therefore, physics-based optimizations which accurately characterize atomic interactions and atomic scale features are critically needed for building databases. For example, Lederer et al. (2018) used the Lederer-Toher-Vecchio-Curtarolo (LTVC) method to create a dataset for predicting refractory alloys with stable single phase by incorporating ab initio computed energies into a mean-field statistical mechanics model. To create a comprehensive dataset that can be used to train ML models to predict performance, more computational studies of CCAs are needed.
In a recent research, the phase and melting temperature of quaternary and quinary refractory CCAs with equivalent atomic numbers were reported by using CALPHAD, and the results were compared with those obtained by ROM (Shaikh et al., 2020), Frontiers in Metals and Alloys frontiersin.org demonstrating the advance of CALPHAD in CCA prediction. In this work, we integrated CALPHAD and the density functional theory (DFT) to examine the structural and mechanical properties for not only quaternary and quinary, but many more binary and ternary refractory alloys with stable single phase as well. The calculated structural and mechanical properties were compared with ROM and experimental observations. A dataset was built based on the calculations, and it was used to train ML models to predict mechanical properties of CCAs such as hardness and elastic constants.

Methodology
Since most pure refractory metals have stable BCC crystals, it is desirable that multicomponent alloys which contain only refractory elements have a predominantly BCC crystal structure. In this paper, the general procedures for building the dataset of structural and mechanical properties for BCC refractory multicomponent alloys are described as follows: 1) construct possible prototype binary, ternary, quaternary, and quinary alloys based on the Special Quasi-random Structure (SQS) (Zunger et al., 1990). The binary and ternary SQSs were provided by MedeA software, and the quaternary and quinary SQSs were generated through Alloy Theoretic Automated Toolkit (ATAT) (van de Walle et al., 2002;van de Walle, 2009) ( Figure 1A). The reliability of SQS models on calculating the vibrational, electronic, and mechanical properties of alloys were validated by Gao et al. (2016) through hybrid Monte Carlo/molecular dynamics simulations. 2) analyze the possibility of forming stable solid state for each configuration based on the critical factors for forming solid solutions of high entropy alloys; 3) calculate phase diagram using CALPHAD to determine the solid solution phases at various thermodynamic conditions, screen out the alloys that have only BCC phase at high temperature ( Figure 1B); and 4) predict the structural and mechanical properties of alloys with stable BCC phase by DFT calculations ( Figure 1C). The dataset

FIGURE 1
Workflow of machine learning based prediction of CCAs with advanced performance.

Frontiers in Metals and Alloys
frontiersin.org would be used to train ML models to anticipate CCAs with advanced mechanical properties once it had been built (Figures 1D,E). When building the dataset by DFT calculation, two questions that have come up are answered: 1) Do alloys with less than five different types of elements still adhere to the critical factors (VEC, , ΔS mix , and ΔH mix ) obtained by PCA? 2) How much of an advantage do DFT calculations have over the ROM method?
The possible SQS configurations of the alloys include AB, A 3 B, ABC, A 2 BC, ABCD, and ABCDE, in which A, B, C, D, E represent the refractory elements Cr, Hf, Mo, Nb, Re, Ta, Ti, V, W, and Zr. There were 1,077 alloys altogether, with initial configurations for 135 binary, 480 ternary, 210 quaternary, and 252 quinary alloys. CALPHAD calculations helped to eliminate alloys with stable single BCC phase at high temperature.
The structural and mechanical properties of BCC refractory multicomponent alloys at ground states were checked by the DFT (Hohenberg and Kohn, 1964;Kohn and Sham, 1965) calculations. The unit cell of each BCC alloys defined by SQS was analyzed using Vienna Ab Initio Simulation Package (VASP 5.4) (Kresse and Furthmuller, 1996). The electron-ion interactions were described by the projector augmented wave (PAW) (Perdew et al., 1992), while electron exchange-correlation interactions were described by the generalized gradient approximation (GGA) (Perdew et al., 1996) in the Perdew-Burke-Ernzerhof (PBE) scheme (Monkhorst and Pack, 1976). The relaxation of the alloy atomic structures was performed using congregate-gradient algorithm (Gonze, 1997) implemented in VASP. An energy cutoff was set to be 300 eV for the plane wave basis in all calculations, and the criteria for the convergences of energy and force in relaxation processes were set to be 10 -5 eV and 10 -5 eV/Å, respectively. A smearing parameter of~0.2 eV was used for the Methfessel-Paxton (Methfessel and Paxton, 1989) technique.
Bulk modulus (B), shear modulus (G), and Pugh's ratio (B/G) (Pugh, 1954) of all alloys screened out by CALPHAD were calculated at 0 K, using the Voigt-Reuss-Hill averaging scheme (Zuo et al., 1992). In addition, Young's modulus (E) and Poisson's ratio (]) were calculated using the following equations: E 9BG/(3B + G) and (3B − 2G)/2(3b + G) . The Vickers hardness (H v ) was obtained by Tian's model (Tian et al., 2012). Consequently, a dataset of alloys which contains properties including the mentioned features can be built. In this dataset, T m implies the alloy temperature resistance, E describes the tendency of alloys to deform when stress is applied along a given axis, B denotes the deformation in all directions, and G represents deformation at constant volume. All features are essential for quantifying the alloy resistance to deformation.
The dataset was further screened by the Pearson correlation coefficient (Schober et al., 2018) in Pandas library to determine the association between any two features: where n is the sample size, x and y are the mean values of two input features, σ x and σ y are the standard deviation of the two features. When the correlation coefficient's absolute value is near to 1, it suggests that the properties are tightly connected. A correlation coefficient that is close to zero, on the other hand, indicates completely unconnected facts.
Through the Scikit-learn library (Pedregosa et al., 2011), the dataset was used to train the ML models, which comprise the Neural Network (NN), Random Forest (RF) regressor, Gradient Boosting Regressor (GBR), and XGBoost (XGB) (Chen and Guestrin, 2016). Information is sent from the input layer, hidden layer, and output layer by the NN model in order to create the output. The RF model uses a large number of decision trees in an ensemble technique to increase prediction accuracy and decrease over-fitting by averaging the trees. The GBR model is a kind of ensemble model that consists of an iterative collection of tree models and is able to draw lessons from the mistakes made by the preceding model. The XGB mode is a potent machine learning technique that quickly decides by efficiently and effectively deploying boosted decision trees. 90% of the data were used for training ML models that were used to predict the performance of refractory alloys, and 10% were used for testing and validating the outcomes. With a cross-validation score of 5, the GridSearchCV function from the Sklearn package was utilized to enhance the machine learning model. Each ML model's performance is assessed using the mean absolute error, average coefficient of determination, and root-mean-squared error.

Results
According to the PCA technique as mentioned before, the parameters VEC, and ΔH mix of alloy should reach the following requirement to have stable BCC phase: VEC < 6.87, −15 ≤ ΔH mix ≤ 5 kJ/mol, and δ ≤ 6.6%. 545 out of 1,077 alloys were predicted to have stable BCC phase based on the PCA analysis. CALPHAD was then employed to calculate the phase diagram of alloys. It was found that most alloys, especially at low temperature, have more than one stable phase. Possible phase at low temperature may include BCC, hexagonal close-packed (HCP), sigma phase, and so on. Given that some pure refractory metals (Hf, Re, Ti, and Zr) are HCP crystals and others (Cr, Mo, Nb, Ta, V, and W) are BCC crystals, the observation is probable. Alloys made up of different element types tend to have more stable phases at low temperature due to their complex Frontiers in Metals and Alloys frontiersin.org interactions. While the proportion of BCC phase increase with temperature for most refractory alloys. As a result, 393 refractory alloys appeared to have only BCC phase at high temperature. Figure 2 demonstrates the transition temperature and melting temperature of alloys with only BCC phase present prior to melting, where the transition temperature represents the temperature at which other phases dissolve. The binary alloys were marked by black triangles; ternary alloys were marked by red triangles; quaternary alloys were marked by green triangles; and quinary alloys were marked by blue triangles, respectively. It has been found that BCC crystal formation in multicomponent alloys is promoted when only BCC type elements are present. In this work, the phase diagram of alloys was investigated in the temperature range 0°C-3,500°C, where the transition temperature is 0°C for 111 out of 393 alloys. It indicates that 111 multicomponent alloys exclusively contain BCC phase.
As shown in Figure 2, the melting temperature of all multicomponent alloys are above 1,300°C. These refractory alloys exhibit extremely high temperature resistance, 245 of them even have high melting temperature above 2000°C. Some of high melting temperature alloys are shown in the figure for reference. It should be noticed refractory alloys with high concentrations of W, Re, and Ta are anticipated to also have high melting temperatures since W has a high melting temperature above 3,000°C, followed by Re and Ta. For instance, the TaW 3 alloy has the highest melting temperature of 3,315°C. The ternary alloy MoTaW 2 has the highest melting temperature of 3,036°C. Even quaternary and quinary alloys MoReTaW, MoReTaVW and MoNbReTaW, also show high melting temperature above 2,500°C.

Phase prediction by CALPHAD and PCA
Significant mismatch was found between PCA and CALPHAD on predicting phase of alloys. As mentioned above, 545 alloys out of 1,077 appeared to have stable BCC structure based on PCA correlation studies of VEC, ΔH mix and δ . While according to the CALPHAD prediction, only 393 alloys have stable single BCC phase before melting. In detail, 18.7% binary, 20.5% trinary, 26.5% quaternary and 22.1% quinary alloys from the PCA estimation do not have a stable BCC single phase based on the CALPHAD calculations. Especially, the two methods differed significantly in predicting the phase of alloys containing Re, Hf, and Zr. It is reasonable since the PCA were studied based on only a small group of CCAs  (less than 100 alloys), in which insufficient data related to the Re, Hf, and Zr in their database were collected. While the database of CALPHAD calculation was built for CCAs involving a 15-element thermodynamic database. Meanwhile, nearly all of the stable solution phases of refractory binaries and trinaries in each of the evaluated systems are present in the database . In this case, the phases predicted by CALPHAD are more reliable.

Structure and mechanical properties by DFT and ROM
The SQS models for BCC crystal structures are shown in Figure 1A, where elements are represented in different colors. Based on SQS, there are 8 atoms per unit cell for AB binary alloys, 16 atoms in AB3, 36 atoms in ABC, 32 atoms in A2BC, 64 atoms in ABCD, and 125 atoms in ABCDE. The calculated structural and mechanical properties, as well as the melting temperature obtained by CALPHAD, are listed in the Supplementary Table S1. The structural properties of refractory alloys, including density and volume, obtained by DFT and ROM are compared in Figures 3A,B. The diagonal line in Figure 3 indicates excellent matching of the density and volume calculated by DFT and ROM, respectively. It is no surprise that the obtained refractory alloys are made of refractory elements with large densities. Both density and volume data are very close to the diagonal lines with trend functions of Density DFT = 1.007 × Density ROM + 0.131 and Volume ROM = 0.939 × Volume DFT + 0.859, and correlation coefficients of 0.996 and 0.995, respectively. These indicate that structural properties predicted by DFT and ROM are similar. The maximum difference of density between DFT and ROM calculation is 7.54%, and the maximum difference of volume between these two methods is 6.82%. Table 1 lists the DFT, ROM, and experimental density of selected refractory alloys. The error
Frontiers in Metals and Alloys frontiersin.org  percentage (%), which accounts for the deviation of DFT/ROM from experimental data are also listed for comparison. The low error percentages are less than 5.5%, indicating that both DFT and ROM calculations are close to experimental data. Particularly, the overall error percentages for DFT to experiment data are less than 1%, whereas they are greater than 2% for ROM estimation, demonstrating higher accuracy of DFT. Different understanding of the mechanical properties based on DFT and ROM are presented in Figure 4, where the elastic constants C 11 , C 12 , and C 44 , bulk moduli, shear moduli and Young's moduli determined by DFT and ROM are demonstrated. Even through the correlation coefficients of certain parameters are close to 1, The significant difference between DFT and ROM can be obtained. The data in each figure are scattered with the correlation coefficients less than 0.97. The correlation coefficients of shear moduli and Young's moduli are much lower, which represent that DFT and ROM simulation produced unrelated results. Furthermore, the tendencies of data in each figure are off the diagonal line with the coefficient of tendency less than 0.925, therefore indicating difference of DFT and ROM on calculating the mentioned property of alloys. The poor correlation coefficients, especially for C 44 , shear moduli and Young's moduli, represent the difference between DFT and ROM for predicting mechanical properties. Table 1 lists Young's moduli of various refractory alloys that were determined experimentally, by DFT, and ROM calculations in order to more thoroughly assess the predictions made by these methods. It is clearly shown that DFT calculations have a lower error percentage than that of ROM, is much lower than that of ROM, which indicates that DFT calculations are more in line with experimental data for Young's modulus. This is reasonable since DFT takes into account how atoms interact physically while ROM calculations average the mechanical properties mathematically.

Evaluation of the quality of the dataset
As discussed above, in this paper, a dataset that contains phase, structural, and mechanical properties of refractory alloys was built based on CALPHAD and DFT calculations. The correlation between each key parameters in the dataset are shown in the Heatmap diagram in the upper part of Figure 5, and the associated data scatter plots are given at the lower part of the matrix. In the heatmap diagram, the data dots in scatter plots matrix that are near to the diagonal or anti-diagonal lines show the absolute value of correlation coefficient close to 1, which implies features are highly correlated. On the other hand, the correlation coefficient close to 0 represents disordered data in the Frontiers in Metals and Alloys frontiersin.org scatter plots matrix, which indicates the feature pairs are not correlated or, at most, weakly correlated. As shown in Figure 5, the majority of the analyzed traits had correlations between 0.3 and 0.8. These findings show that no irrelevant or redundant features exist in the developed database, suggesting that the DFT dataset of refractory alloy properties could yield reliable predictions for brand-new high-performance CCAs by ML.

ML study based on the dataset
The ML method is a rapidly developed technique for predicting materials with advanced performance. The dataset produced in this study has been utilized to predict the properties of CCAs such as hardness and elastic constants based on the workflow as shown in Figure 1. Various ML models were trained to investigate the mechanical properties of CCAs as illustrated in Figure 1D. Based on the dataset, the Neural Network (NN) model was trained to predict the Vickers hardness of alloys. It was predicted that C 0.1 Cr 3 Mo 11.9 Nb 20 Re 15 Ta 30 W 20 have hardness of 686 HV by (Bhandari et al., 2021), which lead to an error around 10% to the experimental test of 622.60 HV (Tian et al., 2012). The dataset was further used to train various ML models, including random forest regressor, gradient boosting regressor, and XGBoost regression models, to predict the mechanical properties of CCAs. For example, the elastic constants in the dataset were used to train those ML models (Bhandari et al., 2022) which were evaluated by the root-mean-squared error, the average coefficient of determination, and mean absolute error. It is found that gradient boosting regressor has higher prediction accuracy on elastic constants. The elastic constants of NbTaTiV predicted by gradient boosting regressor based on the dataset matches the experimental values well (Lee et al., 2020). Both examples highlight the excellent quality of the dataset and the potential of training ML models to predict CCA properties.

Conclusion
In this work, a dataset for 393 refractory alloys containing 2 to 5 different element types was assembled by combing CALPHAD and DFT simulations. For each refractory alloy, the phase type, atomic structure, and mechanical properties were determined, which include melting temperature (T m ), volume (V), density (ρ), total energy (E tot ), elastic constant (C 11 , C 12 , and C 44 ), bulk modulus (B), shear modulus (G), Young's modulus (E), Poisson's ratio (]) and Vickers hardness (H v ). For predicting the stable single-phase of alloys under high temperature, CALPHAD and PCA techniques were evaluated. Since its database includes more information about the phase of refractory alloys, CALPHAD calculations are more trusted than the current PCA results for predicting the phase of alloys at various temperatures. The structural and mechanical properties were determined by DFT and ROM were compared. It is found that the DFT prediction of the structural properties of FIGURE 5 Heatmap diagram and scatter plots matrix for one to one correlation between features.
Frontiers in Metals and Alloys frontiersin.org refractory alloy are comparable to those predicted by ROM, while the DFT prediction are more precise on mechanical property predictions. The dataset has been employed on predicting CCAs with advanced mechanical properties by ML technique such as hardness and elastic constants. Since CCAs performance predicted by ML trained by the refractory alloy dataset are compatible with experiments, the refractory alloys dataset can support the refractory alloy design based on ML model training and property prediction.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.