Pitting Judgment Model Based on Machine Learning and Feature Optimization Methods

Pitting corrosion seriously harms the service life of oil field gathering and transportation pipelines, which is an important subject of corrosion prevention. In this study, we collected the corrosion data of pipeline steel immersion experiment and established a pitting judgment model based on machine learning algorithm. Feature reduction methods, including feature importance calculation and pearson correlation analysis, were first adopted to find the important factors affecting pitting. Then, the best input feature set for pitting judgment was constructed by combining feature combination and feature creation. Through receiver operating characteristic (ROC) curve and area under curve (AUC) calculation, random forest algorithm was selected as the modeling algorithm. As a result, the pitting judgment model based on machine learning and high dimensional feature parameters (i.e., material factors, solution factors, environment factors) showed good prediction accuracy. This study provided an effective means for processing high-dimensional and complex corrosion data, and proved the feasibility of machine learning in solving material corrosion problems.


INTRODUCTION
Corrosion damage seriously reduces the strength and service life of pipelines in oil and gas fields, which makes the problem of pipeline corrosion increasingly serious (Soares et al., 2009;Jiménez-Come et al., 2012). Among all corrosion types, pitting corrosion is one of the most destructive and dangerous corrosion forms (Bhandari et al., 2015;Kolawole et al., 2016). After oil and gas pipeline corrosion and perforation, the leaked oil and gas will seriously pollute the environment and have the possibility of explosion, which directly and indirectly leads to serious economic losses and restricts the development of oil and gas industries (Ghidini and Donne, 2009).
Reliable corrosion warning method and advanced anti-corrosion measures are the key to ensure the safe operation of pipelines and prevent corrosion and leakage accidents. Therefore, it is of great practical significance to better judge the pitting corrosion of pipeline steel for the research and development of anti-corrosion technology and the prediction of structural integrity (Balekelayi and Tesfamariam, 2020). Pitting, however, is a complex process that includes many complicated phenomena, such as mass transfer, metal dissolution and passivation, etc.), the influencing factors of pitting corrosion are also many, such as metal components, medium temperature, pressure, pH, the type and concentration of ions (Choi et al., 2005;Li et al., 2012), which makes the modeling of pitting on more difficult.
The corrosion rate of a specific location sensitively dependent on many local micro materials and environmental conditions. therefore, at the macro level, pitting often occurs in the form of random and probability, which makes the statistical method was used to quantify and simulation of local corrosion, especially the theory of extreme value analysis Vajo et al. (2003) has been successfully applied to pitting corrosion of steel. Melchers (2008) showed that the Frechet extreme value distribution was more appropriate than Gumbel to represent the maximum pit depth. Kasai, et al. (2016) proposed a method combining extreme value analysis with Bayesian inference, which accurately predicted the actual maximum corrosion depth by using the maximum corrosion depth detected.
Due to its advantages in dealing with multi-dimensional, nonlinear and uncertain characteristics, machine learning (ML) methods have been gradually applied in the field of corrosion science in recent years (Hu et al., 2014;Bi et al., 2015), and have been successfully applied in some pitting corrosion related simulations. The pitting corrosion prediction model based on ML can not only describe the nonlinear relationship between the influencing factors and the target parameters, so as to realize the accurate prediction of the pitting information, but also can effectively extract the important feature information that reflects the health state of steel in the corrosion data (Diao et al., 2021). Valor, et al. (2010) established a stochastic model using Markov chains, which has been successfully applied to reproduce the time evolution of extreme pitting corrosion depths in low-carbon steel. Mohammad, et al. (2013) proposed a model using artificial neural network (ANN) to predict the characteristics of pitting corrosion, and further pointed out that by increasing the corrosion concentration and prolonging the immersion time, the pitting density and depth could be increased. However, the value of judgement of pitting initiation in pipeline steel anticorrosion work has rarely been reported.
In this study, we collected corrosion data of pipeline steels during immersion experiments, and established a machine learning model to judge the occurrence of pitting corrosion based on steel composition, environmental parameters and solution parameters. The method of processing highdimensional and complex corrosion data by reduction, combination and creation of features was studied, which improved the generalization ability of the model, and the key corrosion factors for judging the occurrence of pitting corrosion were extracted. The feasibility and advantages of machine learning model in solving the corrosion problem of materials were also discussed.

Establishing the Dataset
This section describes the details of collecting corrosion dataset that were used to train and test the prediction performance of the machine learning models developed. In the corrosion dataset, a total of 100 valid data were collected. Among them, 40 data are from literature (Yin et al., 2007;Liu et al., 2014a;Li et al., 2012;Liu et al., 2017;Santos et al., 2021), and the other 60 data are from corrosion simulation experiments accumulated in our laboratory over the years. As shown in Table 1, all the materials in the statistics are pipeline steels with a small amount of alloying elements,and each complete data sample is composed of 13 material features (i.e., C, Si, Mn, P, S, Cr, Ni, Cu, Mo, Ti, Nb, Al, V), eight solution features (i.e., Vs, Sal., Cl − , Ca 2+ , Mg 2+ , Na + , SO 4 2-), four environmental features (i.e., T, H 2 S, CO 2 , CO 2 /H 2 S), immersion time (i.e., t) and pitting information. Detailed data sets are shown in Supplementary Table S1.

Features Selection
The purpose of feature selection is to simplify the feature set as much as possible and reduce the adverse effects caused by noise and redundant features while maintaining the description ability of feature set. This improves the accuracy, interpretability and operational efficiency of the model .
In this section, feature variables are screened by combining feature importance calculation and Pearson correlation analysis. The former is based on the random forest model (RF model), which is composed of several simple classification and regression tree (CART) models. During the bootstrap sampling process, each CART model produces some data samples that are not selected for training. These data samples termed the out-of-bag (OOB) samples can be used to calculate feature importance (Zhi et al., 2019). For each CART, a disturbance is added to each input of OOB data and then calculate the variation amplitude of the predicted results. By comparing the amplitude of the variation, the importance of different inputs to the predicted target can be obtained. Finally, RF model obtains the average value of all CARTs' results and calculating the importance of each feature is completed. Pearson correlation coefficient is a statistic used to reflect the linear correlation degree of two random feature variables (Waldmann, 2019). The coefficient obtained by estimating sample covariance and standard deviation ranges from −1 to 1. The greater the absolute value is, the stronger the correlation between feature variables is. For some machine learning models, the correlation between different feature variables has an important impact on the prediction results. Based on the above two methods, some redundant information can be removed from the original feature set, so as to achieve the purpose of feature reduction.
Feature combination is also a common method in feature engineering. Using the traditional theoretical calculation formula or model, several original features are combined into a new feature with practical significance. In this study, on the one hand, pitting resistance equivalent numbers (PREN) is calculated based on Chen et al. (2021a). PREN is a value calculated on the basis of the mass fraction of certain elements in the metal, and is usually used as a method to compare the pitting corrosion resistance of alloys. A common PREN expression is expressed as following: On the other hand, the in-situ pH (pH IS ) of the solution is calculated using environmental and solution factors based on the Frontiers in Materials | www.frontiersin.org August 2021 | Volume 8 | Article 733813 electronic corrosion engineer (ECE) software (Jasim, 2019). Therefore, two feature parameters, PREN and pH IS , are added by the method of the above feature combination.
In the aspect of feature creation, we explore a feature parameter that can contain the information of each element of steel and reflect the uniqueness of different steels. In this study, two different feature creation methods are proposed for each material. The feature creation method Ⅰ is defined by Eq. 2, where Y a represents the element mass index of a material; M a1 ,M a2 , 。。。 M an are the atomic mass of elements a 1 ,a 2 ,. . .a n ; X a1 ,X a2 , 。。。。。。 X an represent the mass fractions of element a 1 ,a 2 ,. . .. . .a n . Method Ⅱ is defined by Eq. 3, where Y b Y c is defined as the mass index ratio of nonmetallic to metallic elements in a material; b 1 ,b 2 . . .b n represent the nonmetallic elements and c 1 ,c 2 . . .c n represent the metal elements. Two new features are generated.

Experimental Procedure
In this study, we first selected the appropriate dataset division ratio and machine learning classification algorithm through testing. Specifically, data of 40, 50, 60, 70, 80 and 90% were randomly selected from the original corrosion dataset after cleaning as the training set, and the remaining data as the test set. The training set was mainly used to optimize the classification model, and the test set was only used to identify the classification accuracy of the model. We prepared five machine learning classification models to be tested, including random forest classification model (RFC), support vector classification model with radial basis function kernel (SVC), gradient boosting decision tree classification model (GBC), naive bayes classification model (NB), and k-nearest neighbor model (KNN). Datasets of different proportions were input into different classification models for testing. During the training process, we used receiver operating characteristic (ROC) curve and area under curve (AUC) to evaluate the training effect of the model . Each group of tests was repeated for 100 times, and the best-performing dataset division ratio and classification model were selected according to the average score.
Secondly, in terms of feature reduction, we conducted feature importance calculation and Pearson correlation analysis for all feature parameters (i.e., 13 material features, eight solution features, four environmental features and immersion time). noise and redundancy features were eliminated to form feature combination Ⅰ and based on this feature combination, pitting judgment model Ⅰ was established.
Thirdly, in the aspect of feature combination, two feature parameters (i.e., PREN and pH IS ) were added by using the traditional theoretical calculation model. For feature creation, we converted the information of each steel element into two feature parameters (i.e., Y a and Y b Y c ). The four new feature parameters were combined with feature combination Ⅰ, and then the feature combination Ⅱ was formed after removing the features that contributed less to the target parameter, and the pitting judgment model Ⅱ was established. The performance of the two models was compared, and the improvement of the model's generalization ability was demonstrated.
In the process of feature selection, model optimization and evaluation, F1 score was employed for the evaluation standard. In short, the F1 score is a measure of the classification problem and is a harmonized mean of precision and recall. Its value is approximately close to 1, indicating that the model has better performance (Lim and Chi 2021). For a binary classfication problem, a 2 × 2 confusion matrix is formed based on the forecast labels and actuality labels (as shown in Table 2), where the true positive (TP) refers to correct judgment of a positive sample (e.g., a case of pitting is correctly predicted) and a false positive (FP) means failure to judge a positive sample (e.g., a case of pitting is wrongly predicted). Similar definitions can be given to the false negative (FN) and true negative (TN). Further, precision, recall and F1 score can be respectively calculated by the following formulas: This research was based on python programming language, using Spyder3.3.6 software, and all machine learning algorithms involved in the research process were executed by the scikit-learn library. The main algorithm parameters are as follows: max_depth 40 and n_estimators 100 in RFC model; C 170 and gamma 0.5 in SVC model; max_depth 10 and n_estimators 100 in GBC model; k 2 in KNN model, and all other parameters in the model are set to default values.

Selection of Dataset Division Ratio and Machine Learning Models
Based on the five classification models, the influence of different training set proportion on model performance was explored, and the results were shown in Figure 1. We randomly selected a specified proportion of test sets and repeated the test 100 times to evaluate the prediction performance of the model according to the average score. On the whole, as the proportion of the training set gradually increased, the prediction performance of the model gradually improved. This was because the amount of data in the training set was usually proportional to the effective information contained in it. Therefore, a larger proportion of the training set was highly likely to improve the comprehensive prediction performance of the model. However, when the proportion of training set increased to more than 80%, the F1 score of KNN model decreased significantly, while the F1 score of SVC model and NB model decreased slightly. This may be due to the overfitting of these algorithm, and thus, the generalization ability of the model is significantly reduced (Deng et al., 2015). Therefore, the division ratio of the training set selected in this study was 80%.
In the process of determining the partition ratio of training set, it was found that the RFC model had the best comprehensive performance. In order to further confirm the best model for predicting pitting, the ROC curve and AUC value of the five classification models were respectively drawn and calculated. The ROC curve, which defines false positive rate (FPR) as the X axis and true positive rate (TPR) as the Y axis, describes the relationship between TP and FP. The closer the ROC curve is to the upper left corner, the better the performance of the model (He et al., 2021). AUC is the area under the ROC curve and the larger the AUC, the higher the model performance. Figure 2 (A-E) were the ROC curves drawn based on the five different models (RFC model, SVC model, GBC model, KNN model and NB model). Among them, the method of five fold cross validation was used in the process and the blue line represented the average ROC curve. By comparison, the curve based on RFC model was closer to the upper left corner, which proved that this model had the best performance. In addition, the average AUC based on the RFC model was 0.84. Meanwhile, other classification models adopted the same method, and the calculated results of average AUC were shown in Figure 2F. The red lines represented the error range for 100 repetitions. As can be seen from the figure, RFC model had the best predictive performance, followed by NB model and GBCmodel, SVC model and KNN model had the lowest average AUC value. Combined with the above results, the RFC model was selected for subsequent studies.

Effect of Feature Engineering on Model's Performance
In the first step, the pearson correlation analysis method was used to reduce features. Specifically, input the original 13 material features, eight solution features, four environmental features and immersion time into the RF model, and the calculation results of feature importance were shown in Figure 3. To ensure the generalization ability of the model, we only selected the features with importance values above 0.02. (i.e., CO 2 , T, CO 2 / H 2 S, H 2 S of environmental features; Cl − , Sal., Na + , Ca 2+ , Mg 2+ , HCO 3 − of solution features; t). The combined importance of the selected 11 features exceeds 0.85, and they contain most of the information related to pitting.
In terms of environmental features, CO 2 is usually present in corrosive solution in the form of a dissolved gas. HCO 3 − and H 2 CO 3 is formed when CO 2 reacts with water and H + produced in the ionization reactions of them can result in local acidification and pitting corrosion (Chen et al., 2021b). The solubility of H 2 S in water is higher than that of CO 2 . With the increase of the concentration of H 2 S, H 2 S decomposes into more H + and HS − , which can change the local acidity of steel surface and promote the anodic dissolution process, thus affecting the pitting susceptibility of steel (Zhao et al., 2020). In addition, no matter in the corrosion process dominated by CO 2 or H 2 S, the non-dense or non-uniform corrosion products formed on the surface of the steel can accelerate the development of pitting corrosion (Liu et al., 2017). Temperature is also a key factor affecting pitting, as many materials do not pitting below a certain temperature (critical pitting temperature), which has been demonstrated to exist (Mendibide and Duret-Thual 2018).   In terms of solution features, it is generally believed that Cl − has a great influence on the pitting susceptibility of steel. In other words, the higher the content of Cl − , the looser the corrosion product scale formed on the steel surface and the more serious the cracking is. The Cl − reaching the steel surface through the corrosion product scale can accelerate the local anode reaction, produce pitting pits and develop rapidly along the longitudinal direction (Liu et al., 2014b). Ca 2+ and Mg 2+ also have the ability to influence pitting susceptibility of steel significantly given that the presence of divalent salts can reduce CO 2 solubility (i.e., CaCO 3 in the case of Ca 2+ presence and MgCO 3 in the case of Mg 2+ presence) (Hua et al., 2018). Salinity refers to the total ion content in the solution, and the increase of its content can also change the solubility of CO 2 and H 2 S, thus affecting the development of pitting corrosion (Han et al., 2011).
Then, we calculated the pearson correlation coefficient based on our dataset of solution features and environment features. As shown in Figure 4, the color (blue or red) indicates the direction of the relationship (positive or negative), and the intensity of the color indicates how strong the relationship is (white for completely unrelated and dark blue or red for perfectly correlated). Strong correlations occur between Sal., Na + , and Cl − , mainly because Cl − and Na + were usually very high in the solution being counted, and the salinity was almost composed of these two ions. Sufficient information could be obtained by selecting only one feature from a combination of features with strong correlation, and the importance of feature was usually proportional to the effective information contained in it (Wang et al., 2020). Thus, Cl − was retained and Sal. and Na + were discarded. Another feature combination with strong correlation was Ca 2+ and Mg 2+ , which had a similar effect on the pitting susceptibility of steel. Ca 2+ was also retained according to the above idea. The feature combination Ⅰ (i.e., CO 2 , T, CO 2 /H 2 S, H 2 S of environmental features; Cl − , Ca 2+ , HCO 3 − of solution features; t) was determined.
Two feature parameters, PREN and pH IS , were added by feature combination,and using feature creation method Ⅰ and Ⅱ, two new feature parameters were obtained, namely Y a and Y b Y c .
The four newly generated feature parameters (i.e., PREN, pH IS , Y a , Y b Y c ) were combined with feature combination Ⅰ, and the feature importance was calculated and sorted ( Figure 5). pH IS and Y b Y c had great influence on the pitting judgment model, especially pH IS , while the importance values of PREN and Y a were relatively low. Pourbaix. (2009) have showen that the pitting potential of carbon steel becomes negative with the decrease of pH, which increases the susceptibility of pitting induction. To sum up, the two feature parameters (i.e., PREN and Y a ) were removed, and the feature combination Ⅱ including CO 2 , T, CO 2 /H 2 S, Cl − , Ca 2+ , HCO 3 − ,t, pH IS and Y b Y c was selected as the input features of the pitting judgment model.
Based on above two different groups of input features (feature combination Ⅰ and Ⅱ), pitting judgment models Ⅰ and Ⅱ were individually established by RF model. Table 3 lists the predictive performance of each model. Each prediction process was repeated 100 times. By comparison, pitting judgment model Ⅱ with increased pH IS and Y b Yc had a stronger performance, and the average F1 score for the training set and test set reached 0.996 and 0.987, respectively. As shown in Figure 5, the performance improvement of Model Ⅱ was mainly due to the two increased features, especially the pH IS , which contributed greatly to judging  whether pitting occurs. Therefore, we employed this model as the preferred model of pitting judgment. As shown in Figure 3, the two most important feature parameters are CO 2 content and T for judging the occurrence of pitting. We tried to explore the law of pitting occurrence only through these two feature parameters. The relationship between T and CO 2 content with the occurrence of pitting is displayed in Figure 6. Surprisingly, both 3D scatter plot and the projection drawing of T and CO 2 content are disable to classify the occurrence of pitting. Pitting and nonpitting overlap each other, suggesting that the parameters of T and CO 2 content are not enough to distinguish the occurrence of pitting. Some other features also contribute to affect the pitting process. As we know, the development of pitting is an extremely complex process, and the influence of many factors must be considered comprehensively, which is exactly the advantage of machine learning model compared with traditional theoretical model.

Generalization Capabilities of Machine Learning Model
25 new rows of immersion test corrosion data (all parameters within the range) were collected (from our lab) as the validation set to verify the generalization ability of the model. The methods of feature reduction, combination, and creation were used to transform it into a feature set of the same type as feature combination Ⅱ, and then the pitting corrosion of each sample was predicted by the optimized model. As shown in Table 4, the pitting judgment model still shows a high prediction accuracy.

CONCLUSION
In this study, we proposed a machine learning model based on experimental data to judge the occurrence of pitting for pipeline steel. Machine learning algorithm and feature engineering correlation method are used to analyze the relationship between the occurrence of pitting and input features such as material factors, solution factors and environmental factors. For this kind of material, CO 2 , T, CO 2 /H 2 S, Cl − , Ca 2+ , HCO 3 − ,t, pH IS and Y b Y c are considered to be the key factors to judge whether pitting happens or not. The generalization ability of the model is enhanced by replacing alloying element content with specific input parameters. Finally, the F1 scores of the optimized models were all greater than 0.97. Based on these results, machine learning method provides an effective means for processing high-dimensional and complex corrosion data, and can be a useful tool for further exploration of material corrosion problems.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.