A Data-Driven Framework for the Accelerated Discovery of CO2 Reduction Electrocatalysts

Searching for next-generation electrocatalyst materials for electrochemical energy technologies is a time-consuming and expensive process, even if it is enabled by high-throughput experimentation and extensive first-principle calculations. In particular, the development of more active, selective and stable electrocatalysts for the CO2 reduction reaction remains tedious and challenging. Here, we introduce a material recommendation and screening framework, and demonstrate its capabilities for certain classes of electrocatalyst materials for low or high-temperature CO2 reduction. The framework utilizes high-level technical targets, advanced data extraction, and categorization paths, and it recommends the most viable materials identified using data analytics and property-matching algorithms. Results reveal relevant correlations that govern catalyst performance under low and high-temperature conditions.


INTRODUCTION
CO 2 emissions are the main cause of human-made global warming (Al-Ghussain, 2019). To avert the direst consequences of this global change, the Paris Agreement calls for a net 80-95% reduction of CO 2 emissions by 2050 (Rogelj et al., 2015). The rapid development of sustainable energy sources and environmentally benign storage and conversion technologies is thus a foremost goal in scientific research and technology development, pursued collectively in countries around the world.
CO 2 can be used as a renewable feedstock for the production of synthetic fuels or fuel precursors such as CO, CH 3 OH, and CH 4 , addressing the problem of the intermittency of renewably generated energy from wind turbines and solar cells (Qiao et al., 2014;Lu and Jiao, 2016;Zhu et al., 2016;Liu et al., 2017;Wang et al., 2017). This energy storage pathway renders the CO 2 reduction reaction (CO 2 RR) a crucial and extensively researched electrochemical process (Lin et al., 2020;Mandal, 2020).
CO 2 RR processes inside an electrochemical cell require stable, cost-effective and highly performing electrocatalyst materials. The challenge of optimizing catalytic materials, electrodes and devices for the CO 2 RR, calls for further investigation into factors that control their catalytic activity and stability. The electrocatalytic media are usually heterogeneous composites that embed the active material into a host medium with suitable transport properties for gaseous reactants, liquid products, dissolved ions, and electrons. These media can undergo significant changes in structure and composition under operation through various phenomena such as Ostwald ripening, particle detachment or coagulation in nanoparticle-based catalyst layers; surface reconstruction, oxidation or passivation by irreversible adsorption; or electrolyte disintegration. Besides, inhibited mass transport due to non-optimal wetting of the porous electrode could cause additional voltage loss or limit the current density (CD) that a cell could attain.
A recently performed cost-benefit analysis has shown that electrochemical CO 2 conversion processes need to be economically viable at the system level, while the faradaic efficiency (FE) and energy efficiency (EF) must be maximized at the component and cell levels (Kibria et al., 2019;Lin et al., 2020). The hydrogen evolution reaction (HER) is an unwanted parallel process in CO 2 reduction cells, which impacts the yield of synthetic fuel or fuel precursor production (Goyal et al., 2020). Minimization of hydrogen production requires electrocatalyst materials that are highly selective in terms of the reaction pathway to support.
The integration, testing and qualification of new catalyst materials is a tedious and time-consuming process as there are limitations even for the best catalysts due to specific compatibilities that are required with other components in a membrane electrode assembly (MEA), single cell or stack of the electrochemical device. Challenges in this context involve reactant and product transport as well as water and heat balances. These phenomena are coupled across multiple components and interfaces in a cell, and they determine 3D distributions of local reaction conditions in active electrode media. Assessing the impact of a new catalyst material on performance at cell and stack levels is thus a complex undertaking. An electrocatalyst material may show markedly improved activity and selectivity in a well-defined lab set-up under precisely controlled reaction conditions; but this improvement may not transpire when the material is incorporated into a real cell and tested under realistic conditions. Complex electrocatalytic media cannot be studied solely with computational studies based on density functional theory (DFT). Usually, the complexity of materials, components and physicochemical phenomena to be considered as well as the interplay of solvation effects, charge transfer, and electric field effects at the interface, warrant a well-devised hierarchical framework in modeling and simulation. This framework should interweave computational approaches, including DFT as well as classical simulations, microkinetic modeling of reaction mechanisms, interface and charge transfer theory, and continuum modeling of transport processes at the electrode level, to rationalize local reaction conditions, decipher reaction mechanisms and calculate reaction rates. Considering all of these aspects, the theory-driven approach towards the development of highly active, selective, and stable electrocatalysts for the CO 2 RR remains a highly challenging task (Elouarzaki et al., 2019;Ju et al., 2019).
The discovery and scale-up of integrated materials, i.e., those materials that are integrated into a component, cell, and device to fulfill certain functionalities at the device level, require significant capacities for characterization, testing, and optimization at all structural levels. The discovery-to-demonstration pipeline of new electrocatalyst materials, including fabrication scale-up and integration with other cell components is thus more complex to follow through than it is for simpler, so-called "molecular materials," where minimal integration and optimization is required beyond materials properties (Elouarzaki et al., 2019;Ju et al., 2019). Apart from performance metrics related to activity, yield and selectivity, the degradation of cell components, overall system durability and overall cell lifetime present essential issues to be addressed, which are related to the stability of a catalyst material for relevant environmental conditions and operating regimes.
The key attributes of successful design of CO 2 reduction cells include high mass activity of electrocatalysts to perform well at low overpotential and reasonable materials cost, catalyst layer microstructure to facilitate charge and mass transfer, wellattuned wettabilities of porous transport media to optimize the water distribution across the cell, and mechanical and chemical durability. New approaches in materials design and integration are needed to realize the selective transformation of CO 2 into desired products in scale-up pilot or industrial setups.
Numerous investigations have recently been made to design, synthesize and develop new CO 2 RR electrocatalysts (Lu and Jiao, 2016;Liu et al., 2019). Machine learning (ML) and datadriven methods provide a powerful set of methods and tools to accelerate materials discovery (De Luna et al., 2017;Cao et al., 2018). Fundamentally, ML is the practice of using statistical algorithms to parse data, learn from a set of indicators (performance metrics) and then make a fast determination or a prediction of target performance properties of any new data sets. ML in materials science is mostly concerned with supervised learning. One must realize that the selection of high-quality (accurate) datasets in addition to an appropriate set of descriptors is more important than the selection of the ML algorithm itself. The former would be considered as the first step for building any ML application. The suitable ML model, denoted as classification, regression, or rank ordering model, depends on the desired outcome (Elouarzaki et al., 2019).
Describing all the complexities of the electrochemical interface within a DFT model, considering the number and type of components (catalyst, solvent molecules, ions, etc.), as well as the fundamental physics involved (electric fields, solvation free energy, charge transfer kinetics etc.), is challenging due to computational limitations.
Classification models are designed to allocate a substance to a given number of categories such as active and inactive catalysts; they can be used to separate groups of molecules or materials according to the presence or absence of a target property. For instance, CO 2 RR electrocatalysts can be classified based on their Faradaic efficiency or product selectivity. In this context, several statistical tools, in particular, regression models attempt to determine a function that can represent a continuous hypersurface to relate indicator variance to observable electrocatalytic properties. Regression models are used where prediction and discovery of a missing physico-chemical property such as performance or selectivity are needed (Varnek et al., 2007). Ranking models put out the order of electrocatalysts for a specific property; they are highly useful for electrocatalyst design and discovery, where the priority of one property over another is more important than its exact value (Goldsmith et al., 2018;Lamoureux et al., 2019;Schleder et al., 2019).
ML models have been utilized in a variety of energy material applications to design and discover novel electrocatalyst materials with superior performance (e.g., higher energy density and higher energy conversion efficiency) (Meyer et al., 2018;Zahrt et al., 2019). Such models can have a transformative impact on the development of low cost CO 2 RR catalysts with high product selectivity and maximal performance (Goldsmith et al., 2018;Kitchin, 2018;Schlexer Lamoureux et al., 2019;Gusarov et al., 2020;Smith et al., 2020). For instance, ML models have been used to disentangle catalyst-adsorbate interactions for various reactions, including CO 2 RR (Ma et al., 2015;Tran and Ulissi, 2018). A combination of advanced optimization tools based on ML and other conventional approaches has been developed to predict electrocatalyst performance for CO 2 reduction and H 2 evolution (Tran and Ulissi, 2018).
In this work, we demonstrate a data-driven framework for materials screening, which is particularly applied to low and high temperate catalysts for CO 2 reduction (Garza et al., 2018;Kibria et al., 2019;Malek et al., 2019;Chou et al., 2020). A viable electrocatalyst for the CO 2 RR must satisfy performance metrics related to current density, faradaic efficiency, energy efficiency, overpotential, production rate, and chemical stability. Correlations among these performance metrics at low or high temperature remain largely unknown and require extensive data analytics.
Our data-driven methodology is designed with the objective of integrating domain-specific data sources in order to eliminate difficulties in data collection and interpretation from multiple sources and data types. The integration process consists of a combination of "modular" sub-processes to build "standardized energy materials data" in real-time with advanced filtering, scaleup and cognitive insights, ML, and fundamental data analytics functionalities, including visualization and big-data management tools. The recommendation system and decision module utilize high-level technical targets as input data, which can be displayed in the form of radar (or spider) charts; advanced data extraction and categorization using deep learning techniques; propertymatching algorithms to search for the best viable materials that satisfy selected high-level technical targets; and finally a multi-parameter optimization to recommend top choices in connection with ML algorithms.

Application-Driven Architecture
In order to offer scale-bridging capabilities to connect crucial steps in materials design-to-device integration, an application-driven architecture has been introduced and demonstrated (Malek et al., 2019). The central part of this architecture is an embedded master data lake, consisting of large-scale metadata for electrocatalyst materials, which is assembled from various types and sources of materials data. Key technical targets such as activity (i.e., the faradaic efficiency), stability, and selectivity are usually defined at cell and device level and may also correlate differently at low or high temperatures with physicochemical properties of electrocatalysts and the operating conditions at cell or device levels (Chan and Li, 2014;Nitopi et al., 2019). Figure 1 illustrates the functional layers of the ML-enabled data analytics approach and its underlying workflow. The workflow comprises various layers including user-defined or default data sources and databases, analytics modules, and self-driving algorithms, which are commonly used in materials discovery approaches, regardless of the corresponding field of application. The complexity with scale-up and discovery of integrated materials also implies the need for ad hoc communication among parallel or series of synthesis and characterization steps or equipment, in-device component integration, and device testing or validation. This allembracing workflow along the complete development pipeline can enhance data communication and promote understanding of correlations among structure, functional properties, and performance indicators at all scales from materials discovery to device performance and optimization.

Master Data Lakes
A vital prerequisite for any type of ML application is the provision of a suitable dataset for a given domain. The search for new electrocatalyst materials essentially needs a minimal and sufficient set of performance indicators from the "chemical domain" and the "property domain" of different electrocatalyst materials (Flores-Leonar et al., 2020).
The master database is built from materials datasets collected from a wide range of sources and user-types, namely 1) unpublished records of academic researchers, 2) published articles, and 3) other public records and industry reports. The details of the data retrieval from images, tables and texts are described in ref Malek et al. (2019). The resulting database is stored in excel or CSV format with predefined and standardized headers that include metadata preprocessing and cleaning.
In this article, the CO 2 RR experimental databases were generated from literature sources on the basis of seven input variables: electrocatalyst type, faradaic efficiency, applied potential, current density, type of electrolyte, major product, and temperature. Each experimental data point is characterized by a set of performance indicators for catalyst formulation and reaction conditions, either as continuous values (such as current density) or as categorical values (such as catalyst type). The ranges of the corresponding input variables are summarized in Table 1.

Machine Learning Algorithms
ML classification models could be used to identify and classify materials or map them in terms of their properties (descriptors), which is the first essential requirement prior to any ML-based prediction. We use the Scikit-learn package in the ML modules (Pedregosa et al., 2011). The ML algorithms employed for classification of electrocatalyst and product type include logistic regression (LR), linear discriminant analysis (LDA), k-nearest neighbors (KNN) classifier, and random forest (RF) classifier. In addition, we tried to classify groups of products by putting all possible products into two or three different larger groups of products. In order to compare the predictability of different models for finding missing data, four ensembles of ML algorithms were evaluated. The regression algorithms include Bagging Regression (BR), Gradient Boosting Regression (GBR), Random Forest Regression (RFR), and Extra Trees Regression (ETR). BR is an ensemble method that fits regressors on random subsets of the original dataset and makes a final decision based on aggregated prediction. The bagging method increases the robustness of the original set of models by introducing randomness during the training process and then ensembling their predictions. GBR builds a model in a forward stage-wize style, which enables optimization on any differentiable loss function. RF is a typical ensemble learning model that operates by building a set of decision trees and yielding average predictions of a separate tree. Random decision forests are superior to decision trees due to the ability to solve the over-fitting issue. Finally, extra trees implement a meta-estimator that fits several random decision trees on different sub-samples of the dataset and utilizes the mean of trees to boost the predictive performance and reduce the variance. ETR and RFR models have shown to be promising in the modeling of chemical systems. Each algorithm was trained on the training data for the CO 2 reduction reaction. The algorithms were then implemented to predict faradaic efficiency, applied potential (AP), and current density for the test dataset. We used the ML hyperparameter optimization module to tune hyperparameters automatically.
The accuracy score (%) (i.e., the ratio of correct predictions to the total number of predictions) is used as a performance metric for the evaluation of each classification algorithm. The performance of each ML algorithm for prediction was evaluated by using several statistical indicators such as the mean squared errors (MSE), the root mean squared error (RMSE), and the coefficient of determination (R 2 ), in which y i and y i are the true and predicted values, respectively, y i is the mean of the true values, and n is the number of samples.

Modular Design
The complexity of the materials design-to-device integration calls upon a modular approach, in which various data management tasks and data analytics tools are built and tested in isolation, as stand-alone-modules. The suitable modules are then called and integrated into the main platform depending upon the application area, required analysis tools, and type of meta-data that the user needs for the analysis. In the following, we describe the adaptation of each module and their inter-dependencies for the analysis of electrocatalytic materials for the CO 2 RR.

Classification and Materials Data Extraction
This module utilizes a classification algorithm that categorizes catalyst materials in the form of performance range (e.g., potential or current density) or selectivity or type of products.
The reference values for high-level technical targets are based on a "performance matrix" that is provided as the default for a particular field of application or as a user-entry table for the target values. These initial values can be seen as the first set of keywords for data mining and data discovery from the literature for a given material application field or sub-classes therein such as low-temperature catalysts or high-temperature catalysts. The extracted data is then mapped on these key technical parameters and other crucial measurement conditions for each class of materials.

Materials Property Prediction
This module can predict a specific electrocatalytic property such as the faradaic efficiency as a function of input or exploratory variables using embedded ML models. The results of these ML prediction models can refine the usefulness and relevance of the user input variables. The module also helps fill missing data points related to performance indicators or target properties in the database and thus enriches the master database. In this context, electrode type, current density, voltage, polarization resistance, conductivity, electrolyte type and composition, temperature, type of product, and (rarely) faradaic efficiency are among the key factors that can influence CO 2 RR performance.

Recommendation System and Decision Models
The performance tuning algorithm is the first layer of the recommendation module that uses the complete dataset to find the best electrocatalyst material based on performance and stability metrics' target values. It displays the information using standard visualization tools, for example, using a radar chart. A radar chart is a typical visualization tool employed in benchmarking electrocatalyst materials for the purpose of quality and performance improvement of a system of materials or an electrochemical device (Basu, 2004). The use of radar charts makes two significant contributions: first, it provides a simple 2D visual representation of multiple performance indicators without the need of using dimensionality reduction on multivariate data; second, the enclosed area, formed by spikes (or axes), can be considered as an intuitive electrocatalyst performance indicator. The ML-powered recommendation module uses the power of regression modeling to predict values for the missing data as accurately as possible. Supplementary Table S1 shows the sample data statistics used to train the regression models for predicting the missing data, specifically for applied potential, current density, and faradaic efficiency. Datapoints for four types of electrocatalyst material were selected, as there was not enough data for predicting other variables in the CO 2 experimental database.

Data Matching and Validation
The ultimate criterion for ML-based predictive capabilities is experimental validation, which demonstrates how computer algorithms lead to real discoveries. After predicting the best candidate electrocatalyst material, the prediction can be validated by direct comparison to experimental data for the same or almost the same set of conditions and materials specifications (Malek et al., 2019).
In our predictive algorithm, CO 2 electrocatalyst materials are generally categorized into three main groups: metallic, nonmetallic, and molecular catalysts. Each category of electrocatalyst materials exhibits distinct physicochemical and electrocatalytic properties. Therefore, it is possible that the performance of an electrocatalyst material is restricted and limited to the group of catalyst materials it belongs to. Here, we used ML classification models to sort different electrocatalyst materials into different groups based on their performance. The numerical data are normalized between 0 and 1, and we encoded the categorical data using "OneHotEncoder" from the Scikit-learn data preprocessing module (Pedregosa et al., 2011).
Most of the data in our Master database at low temperature are for Cu electrocatalyst, with the key properties of AP, CD, and FE, type of electrolyte, and type of product. Material properties predictions thus focus on these attributes. Figure 2 shows the workflow of material recognition. In order to identify an electrocatalyst material for a given electrochemical process, it is expected that the performance metrics of the chosen electrocatalyst meet or exceed the target values set by the user. For this purpose, one needs to consider the key performance metrics, i.e., faradaic efficiency, current density, applied potential, selectivity, and production rate, to select the best electrocatalyst material. In practice, keeping track of all variables and establishing correlations among optimization parameters in an electrochemical reaction path is a difficult task; once a set of properties is set to the optimum values, other properties of the catalyst can have values which are below user requirements. We attempt to address this challenge by introducing a penalty function for any value less than the desired value for a target application variable.

Materials Recommendation
The recommendation process shown in Figure 2 initially takes input from the user-specified target values. The recommendation then selects the "best" electrocatalyst or recommends electrocatalyst materials primarily based on the targets for the set of performance metrics defined by the user. Global target values are provided as default if no user-entry target values are available.
In order to minimize the optimization effort and for fast and better identification criteria, the user is provided with one of the following identification schemes: 1) find any electrocatalyst material for some desired value of a metric, with any chemical product; 2) find any electrocatalyst material for target value metrics for a specific set of chemical products; 3) find some desired performance metrics, within specific electrocatalyst material groups, with any product; 4) find some desired properties within specific electrocatalyst material groups, for a certain set of chemical products. The user is given target values for selected metrics, electrocatalyst type, and chemical products, where a user is able to filter data based on products and electrocatalyst material or simply select all the possibilities. If the user provides target values for all metrics, the recommendation algorithm selects an electrocatalyst material with properties equal (with less than 10% deviation) or better than the user target. If the user provides target values for a few properties and not all the properties, then the algorithm uses default global target values for those target properties that are not provided by the user.
Here, a simplified, yet straightforward method for selecting an electrocatalyst material is employed by using a radar chart to identify the material, which encloses the graph's maximum area. Although this heuristic method can be seen as practically useful, it may lead to a biased selection with few performance indicators at high values, while others remain at low values. It ignores the ranking and importance of different variables.
Our optimization algorithm employs a special scoring factor where it scores positive values for properties that are higher or equal to the user target values and penalize properties that are less than the user target values. The value of the penalty function becomes more valuable for performance indicators that significantly less than the actual target values. This sub-routine recommends catalyst materials that exhibit high values in one or multiple attributes from the performance matrix table.
The scoring factor is defined by, where k is the number of target properties (P). If P ij ≥ P User Targetj difference ij P ij − P User Targetj If P ij < P User Targetj difference ij 5 (Pij−PUser Targetj ) − 1 Here P ij is the default target value of property j for row number i and P User Targetj is the user defined target value for property j.
The constraint for the penalty function is set at 5, representing the maximum error tolerated. Once the scoring factors for each row in the database are calculated, the algorithm recommends electrocatalyst materials with high score values, as illustrated in Supplementary Figure S1. Low-Temperature Electrocatalyst Materials Figure 3 shows the visualization of data, which is distributed among applied potential, current density, and faradaic efficiency for different types of electrocatalysts at low-or hightemperature. The diagonal graphs represent the density plot of each respective feature, providing useful information by giving a density of plots in the form of bar charts. Among the possible choices of electrocatalysts at low temperature, mainly four types of Cu-based electrocatalysts are used for the classification task. The dataset is divided into training and test datasets. The dataset consists of 228 different Cu electrocatalyst materials, among which training and test datasets account for 183 and 45 data points, respectively. Each data point consists of a set of properties for a given material. The same material may appear in different data points with different operating conditions. The materials space is then scanned using a set of descriptors, such as selectivity for a given product or performance indicators against a reference target range. The latter is performed using machine learning techniques. Model performance for classification of the type of electrocatalyst and type of products was evaluated through the calculation of an accuracy score. As illustrated in Table 2, the key indicators (AP, CD, FE, Product selectivity) have high cross-validation scores, which can vary according to the ML algorithms. The LR and LDA classifiers are found to return the highest accuracy score of 81%, determining the type of electrocatalyst. QDA classifier has an accuracy score of 32%, which is remarkably lower than that for other classifiers.
As shown in Table 3, the indicators of AP, CD, FE, and type of electrocatalyst yield a higher accuracy for classification of a group of two products (CH 4 , C 2 H 5 OH) in comparison with two other groups, each consisting of three different products. RF and LDA classifiers return value of 1 and 0.93, respectively, for the accuracy score of all test cases. In general, RF classifier has the best performance among other algorithms for the classification of the type of products regardless of the number of products.
LR, LDA, QDA, and GNB algorithms were unable to distinguish and single out one group of products, including Frontiers in Energy Research | www.frontiersin.org April 2021 | Volume 9 | Article 609070 8 those with three different products. Additionally, GNB returns an accuracy score of 26%, the lowest of all six algorithms. It is obvious that better performance of ML algorithms can be achieved for the group with two products than for the groups with three different products. The latter can be understood from the comparison of the values of accuracy score for classification of the type of electrocatalyst or products reported in Table 2 and Table 3. One would need more indicators such as the reaction conditions (pH, mass loading of catalyst, production rate, and concentration) for each reaction in order to have a better performance with the classification scheme. Table 4 lists the performance of predictive analytics using MSE for various experimental numerical values, i.e., AP, CD, and FE. ETR is seen to have a better predictive capability with a minimum error, which is considered more accurate than other algorithms. In order to quantitively obtain a prediction model for FE, AP and CD, we employed the BR, GBR, ETR, and RFR  algorithms. Models were based on the training data (80% of the full dataset), where 20% of is used to evaluate the test data.
The scatter plots of the outputs vs. the actual values for the training, testing, and overall data sets using RFR and ETR algorithms are presented in Figure 4. The coefficient of determination (R 2 ) indicates a strong correlation between outputs for CD and AP and actual values. The AP, CD, and FE results clearly show excellent agreement between the actual values and RFR, GBR, and ETR predictions, with R 2 > 0.90 and MSE < 0.008 for all of the ensemble modeling cases. The R 2 and MSE of test data for faradaic efficiency with ETR and RFR have better performance than that for other regressors.
Success with ML depends on the number of descriptors and their correlations, as well as available large training data. The true benefit of structure-property relationships revealed through ML models lies in the multi-variant correlations and their interpretation in terms of the fundamental materials properties.
The missing values in the primary database can nonetheless be filled with values extrapolated from ML by building a model that relates known indicators of materials to target properties. Our ML model has successfully predicted different properties like FE or CD, or classification of the type of electrocatalyst, or major products related to specific type of catalyst. The latter process has been carried out iteratively. After filling missing values, the database is ready to screen the electrocatalyst performance through means of analytical and visualization tools.
Utilizing all available and supplemented databases, rapid screening of electrocatalyst materials was carried out, while the user would be able to specify target values for various properties. The optimization algorithm proposed in this work uses a scoring factor based on a rank-ordering approach. The best electrocatalyst material for selected chemical products was then estimated for a class of materials or products. Figure 5 shows the radar charts of the best electrocatalyst materials based on the target attributes selected by users or directly from a global target, which is set as a default. The figure indicates that Pt is the catalyst of choice when no specific fuel products are considered.

High-Temperature Electrocatalyst Materials
Despite recent advances in electrolytic systems for CO 2 conversion at high temperature (>800°C), the overall efficiency and performance of the system remain far from being sufficiently understood for commercialization and practical usage (Gorte et al., 2000). Among the technological shortcomings are low conversion efficiency and high degradation rates of materials and components, including membrane and electrocatalysts. The latter is mainly due to the fact that the high catalytic conversion will inherently result in low electrochemical stability of catalyst materials at higher temperatures. The fundamental understanding of the elementary kinetic processes involved in CO 2 electrochemical conversion at high temperature is a subject of ongoing research (Tran and Ulissi, 2018). Notably, the cost-effectiveness of catalytic processes at high temperature primarily depends upon the trade-off between the system efficiency and 3 | The results of cross-validation with six different classification algorithms against the type of products in three classes [a group of (CH 4 , C 2 H 4 , C 2 H 5 OH) (CH 4 , C 2 H 5 OH, C 3 H 7 OH), and (CH 4 , C 2 H 5 OH)]. production cost of the fuel, while the operating condition of the solid oxide electrolyzer cells (SOECs) remains very narrow due to high heat requirements and the sensitivity to temperature fluctuations (Ma et al., 2015). CO is the major product as all other competing chemical reaction products are desorbed from the surface to produce CO at high temperatures. Therefore, additional down-stream processes need to be performed in order to achieve other products such as methanol. For coelectrolysis of CO 2 and H 2 O, SOEC provides high flexibility in the carbon to hydrogen ratio (C/H) and, thus, state-of-the-art  technologies such as Fischer Tropsch (FT) synthesis can be utilized downstream for achieving high product flexibility Zheng et al., 2017).
Here, we present preliminary results and a discussion for a data-driven analysis of selected electrocatalyst systems in SOECs that address a few of the above technological challenges. In hightemperature electrolysis of CO 2 , the co-electrolysis process in the presence of steam is taking place at temperatures >600°C. Hightemperature CO 2 electrochemical conversion using SOEC generally has a better selectivity compared to that at low temperatures. Correlations among AP, CD and FE at low or high temperature are not known yet and require extensive data analytics.
State-of-the-art high-temperature electrocatalyst materials in SOECs contain Ni-YSZ. A key factor for the stability and activity of these materials at high-temperatures is Ni% in the range of 40-60%. This range is required to fulfill the catalytic reforming and satisfies the matching requirement of the thermal expansion coefficients of the catalyst layer and the YSZ electrolyte (Gorte et al., 2000). Similar to solid oxide fuel cell (SOFC) electrodes, electrocatalytic reactions in SOECs take place at the triple phase boundary (TPB) where the Ni phase provides electrons, and YSZ particles offer the required oxygen ion vacancies for the reduction of adsorbed CO 2 and the removal of oxygen ions, respectively.
Recent progress suggests that the electrochemical reduction of CO 2 in solid oxide electrolysis cells takes place at high current densities. Degradation rates are higher in electrolysis mode compared to those in fuel cell mode based on enhanced effects of metal particle migration and/or oxidation, carbon deposition, grain coarsening, and contamination by impurities. This adds complexity to the choice of electrocatalyst materials and, thus drives significant research activity. In particular, electrochemical reduction of CO 2 in the temperature range of 573-873 K is worth exploring in order to match the temperature levels of electrolysis with required downstream FT-processes; however, there are no proper material systems for electrodes and electrolytes in that temperature regime at the current stage.
Here we consider a few conventional classes of electrode materials and explore the impact of Ni or Ti addition in various proportions on the overall catalytic activity via extensive data analytics. Figure 3B provides scatter plots and distributed values for applied potential, current density, and faradaic efficiency for Ti and Ni-YSZ catalyst systems. Ti-based electrocatalyst exhibits different dependencies for applied potential and faradaic efficiency compared to that for the Ni-YSZ system, while both catalyst materials are relatively similar in view of current densities. Overall, the Ti-based catalytic system shows high correlations among FE and CD, in particular in the range of data obtained at higher applied potentials (>2 V). Figures 3A,B clearly reveal differences in the correlations among key attributes such as FE and AP for catalysts at low and high temperatures. The correlations are more pronounced among FE and AP for hightemperature electrocatalysts, whereas CD and AP are the main indicators at low temperature. Among all electrocatalyst materials studied at high temperatures, Ni-YSZ shows the highest correlation between FE and AP, although the correlation factors can vary depending upon Ni ratios and type of electrolytes or products, as illustrated in the binary correlations in Figure 6.
The dataset for high-temperature catalysts consists of 180 test data points distributed among five different catalyst types. This amount of data is insufficient for accurate prediction of missing properties in the data set, and thus further predictions using ML techniques and identification thereof are not feasible based on the existing size of the dataset. Moreover, the atomic ratios of the composite electrocatalysts are not taken into consideration in these databases. The current results, however, will be expanded in the future to generate further insights for the correlation of key attributes at high-temperatures using larger and more diverse training and test data sets.

Recommendation and Decision System
Here, we only focused on high-level correlations among selected indicators. Supplementary Table S2 provides the complete test data and other operational conditions that are assumed for each data point. The type of electrolyte is another important factor to be considered as it influences the extend of correlations among FE and AP for various high-temperature electrolysis technologies and the respective electrocatalysts. In particular, future work can include the analysis for the following use cases and comparison based on phase ratios and catalyst types such as Ag, Ni|YSZ or Ag| YSZ and for at least one cell configuration such as Ag/GDC|YSZ| YSZ/LSM|LSM [La 0.8 Sr 0.2 Cr 0.5 Mn 0.5 O 3 −δ(LSCM)]. Further analysis is still ongoing to improve the test and training databases for high-temperature catalysts and provide a robust recommendation framework for this system. Here, the analysis is primarily built upon existing and extracted historical data. There is an emerging need for employing sophisticated decision algorithms and recommendation systems to "close-the-loop." Such algorithms will emerge from predictive models of key materials properties under different experimental conditions or modeling assumptions. They also identify weighting factors that govern specifications and limitations imposed at the components and device-level. Such algorithms are trained over time as more historical data and use cases become available.

CONCLUSION
The discovery and optimization of electrocatalyst materials are driven in large part by collecting and analyzing experimental data. The ML-assisted development of electrocatalysts is still an emerging field despite its success in molecular and material science; it cannot yet lead directly to novel electrocatalyst materials.
In this article, we proposed a recommendation framework for the benchmarking of existing electrocatalyst materials. A multi-attribute decision process was adopted, which was mapped on radar charts, from which the analysis of best-performed electrocatalyst is carried out based on user-entry or global technological targets. This recommendation framework provides the choice of dimensions, indicators, and appropriate correlations for benchmarking purposes and materials screenings process, purely based on historical data. With the availability of reliable process and materials economic data, the latter can lead to comprehensive techno-economic insights into what performance levels are required for commercially viable electrocatalyst materials for the use in electrochemical energy conversion and storage devices.
We used ML to supplement missing data in CO 2 RR databases prior to deploying ML algorithms to identify the best catalytic system. The ML module is primarily built for the classification and prediction of electrocatalyst materials. Different models for classification of the type of electrocatalyst materials and chemical products are used with reasonable accuracy within the limit of available test and training data. Among different regression algorithms, the Random Forest model showed a better capability for the prediction of electrochemical attributes. The proposed recommendation system provides interactive visual analysis of different indicators for the exploration of uploaded electrocatalyst data. High-level correlation analytics was also provided for catalyst materials at high temperatures, and the intensity of correlations was compared to that for catalyst materials at low temperature.
Finally, rapid screening and benchmarking studies of electrocatalysts material via data-driven visualization can significantly reduce the discovery time for the best materials and to understand or compare vital performance trends and correlations for given classes of materials, from initial discovery to component or device integration and for full-scale component or device production. The major limitations of the framework presented here are the incompleteness of datapoints, un-clarity or lack of consistency around key numerical or categorical attributes, and missing values for the attributes that are collected from the literature. The framework, however, can be applied to other sustainable electrochemical processes such as electrochemical NH 3 synthesis through N 2 and H 2 O electrolysis.
The interactive visualization tools assist researchers in discovering trends and patterns hidden with the electrocatalyst material based on historical experimental and modeling data. Further ML and analytics functionalities are currently under development, which will offer higher accuracy and better inter-operability of the recommendation framework for ideacreation and the screening of electrocatalyst materials for various applications.

DATA AVAILABILITY STATEMENT
The original underlying data presented in the study are included in the article/Supplementary Material, further inquiries for access to Github can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
All authors contributed to the writing and editing of this manuscript. KM and ME contributed equally to the design of the initial concept and implementation of the research method. AM, KM, and ME collectively led and designed the underlying data analytics concept and ML methodology, and backbone of the data visualization and analysis tools.

FUNDING
This work was supported by the German-NRC collaboration project.