Development and Validation of the Predictive Model for Esophageal Squamous Cell Carcinoma Differentiation Degree

The diagnosis of the degree of differentiation of tumor cells can help physicians to make timely detection and take appropriate treatment for the patient's condition. In this study, the original dataset is clustered into two independent types by the Kohonen clustering algorithm. One type is used as the development sets to find correlation indicators and establish predictive models of differentiation, while the other type is used as the validation sets to test the correlation indicators and models. In the development sets, thirteen indicators significantly associated with the degree of differentiation of esophageal squamous cell carcinoma are found by the Kohonen clustering algorithm. Thirteen relevant indicators are used as input features and the degree of tumor differentiations is used as output. Ten classification algorithms are used to predict the differentiation of esophageal squamous cell carcinoma. Artificial bee colony-support vector machine (ABC-SVM) predicts better than the other nine algorithms, with an average accuracy of 81.5% for the 10-fold cross-validation. Based on logistic regression and ReliefF algorithm, five models with the greater merit for the degree of differentiation are found in the development sets. The AUC values of the five models are 0.672, 0.628, 0.630, 0.628, and 0.608 (P < 0.05). The AUC values of the five models in the validation sets are 0.753, 0.728, 0.744, 0.776, and 0.868 (P < 0.0001). The predicted values of the five models are constructed as the input features of ABC-SVM. The accuracy of the 10-fold cross-validation reached 82.0 and 86.5% in the development sets and the validation sets, respectively.

The diagnosis of the degree of differentiation of tumor cells can help physicians to make timely detection and take appropriate treatment for the patient's condition. In this study, the original dataset is clustered into two independent types by the Kohonen clustering algorithm. One type is used as the development sets to find correlation indicators and establish predictive models of differentiation, while the other type is used as the validation sets to test the correlation indicators and models. In the development sets, thirteen indicators significantly associated with the degree of differentiation of esophageal squamous cell carcinoma are found by the Kohonen clustering algorithm. Thirteen relevant indicators are used as input features and the degree of tumor differentiations is used as output. Ten classification algorithms are used to predict the differentiation of esophageal squamous cell carcinoma. Artificial bee colony-support vector machine (ABC-SVM) predicts better than the other nine algorithms, with an average accuracy of 81.5% for the 10-fold cross-validation. Based on logistic regression and ReliefF algorithm, five models with the greater merit for the degree of differentiation are found in the development sets. The AUC values of the five models are 0.672, 0.628, 0.630, 0.628, and 0.608 (P < 0.05). The AUC values of the five models in the validation sets are 0.753, 0.728, 0.744, 0.776, and 0.868 (P < 0.0001). The predicted values of the five models are constructed as the input features of ABC-SVM. The accuracy of the 10-fold cross-validation reached 82.0 and 86.5% in the development sets and the validation sets, respectively.

INTRODUCTION
Esophageal squamous cell carcinoma (ESCC) is one of the most common malignant tumors in China, which has a high mortality rate (McCormack et al., 2017;Domingues et al., 2019;Hou et al., 2019). The degree of tumor cell differentiation of esophageal squamous cell carcinoma is an important reference information in cancer diagnosis and treatment. High differentiation means that the tumor cells are more similar to normal cells, the tumor is less malignant and less likely to metastasize. It is less sensitive to radiotherapy and chemotherapy and has better prognosis.
The difference between low-differentiated cells and normal cells is very big, and the malignancy of tumor is relatively high. It is easy to metastasize in the clinical process, and it is more sensitive to radiotherapy and chemotherapy, so the prognosis is poor. As long as early detection and timely treatment can be done, the metastatic speed of tumor can be slowed down through integrated treatment of traditional Chinese and western medicine, and achieve better clinical efficacy (Cong et al., 2019).
Cancer cells have the characteristic of differentiating into normal cells (Tamaoki et al., 2018). In medicine, this feature is used by doctors to determine the degree of differentiation of tumor cells. After the patient's biopsy pathology, the malignancy and differentiation of the tumor are confirmed, by observing the characteristic state of tumor cells under a microscope. The traditional method of determining the degree of differentiation is complicated and needs to rely on human experience to make decisions (Maehara et al., 2018;Jadcherla et al., 2019). In this paper, we aim to develop a new model to predict the degree of differentiation of esophageal cancer patients based on blood indicators and tumor size parameters. The prediction model can better predict the degree of esophageal cancer tumor differentiation, which can assist professional physicians in making decisions and improve the clinical treatment effect.
The original dataset is clustered into two distinct datasets by the Kohonen algorithm. The first dataset is used to develop the prediction model for the degree of esophageal squamous cell carcinoma differentiation and the second dataset is used to validate the prediction model. First, in the development sets, the Kohonen clustering algorithm is used to cluster multiple indicators significantly associated with esophageal squamous cell carcinoma. Thirteen indicators significantly associated with the degree of esophageal squamous cell carcinoma differentiation are found. Based on these 13 indicators, 10 classification algorithms are used to predict the degree of differentiation. The results show that ABC-SVM predicts better than the other nine algorithms, with an average accuracy of 81.5% for the 10-fold crossvalidation. Then, logistic regression and ReliefF algorithm are used to find five models that have greater predictive value for the degree of esophageal cancer differentiation. The AUC values of the five models in the development sets are 0.672, 0.628, 0.630, 0.628, and 0.608, with P-values less than 0.05. The AUC values of the five models in the validation sets are 0.753, 0.728, 0.744, 0.776, and 0.868, with P-values less than 0.0001. The results are shown that the five models have some predictive value for the differentiation of esophageal squamous cell carcinoma. The five models are constructed as ABC-SVM predictive features. The 10fold cross-validation accuracy is achieved at 82.0 and 82.5% in the development sets and validation sets, respectively. The new features are constructed by the five models which have a high correlation with the degree of tumor differentiation of esophageal squamous cell carcinoma. And the ABC-SVM algorithm is used to predict the degree of tumor differentiation of esophageal squamous cell carcinoma which can achieve good results.
The main focus of this article is to investigate the indicators significantly associated with the degree of esophageal squamous cell carcinoma differentiation and to develop the model to predict the tumor differentiation of esophageal squamous cell carcinoma. By using Khonen clustering algorithm, ABC-SVM algorithm, logistic regression, ReliefF algorithm, and ROC curve method, the method for predicting the degree of esophageal squamous cell carcinoma differentiation is proposed. The main contributions of this article can be summarized as: (1) Thirteen indicators associated with the degree of esophageal squamous cell carcinoma differentiation are found in the development sets and are validated in the external validation sets.
(2) Five models with predictive value for esophageal squamous cell carcinoma differentiation are found in the development sets and are validated in the external validation sets.
(3) Based on five prediction models, new features of differentiation degree are constructed and the degree of esophageal squamous cell carcinoma differentiation is well-predicted by ABC-SVM.
The rest of this paper is organized as follows. In section 2, the original data is analyzed and clustered. Thirteen indicators that are significantly correlated with the degree of differentiation are found and validated in section 3. Section 4 provides details of the process of developing and validating five models significantly associated with the degree of esophageal squamous cell carcinoma differentiation. And the five models are constructed as new features and are studied in development sets and validation sets. The conclusions are drawn in section 5.

Clustering of Data Sets
In order to ensure that the predictive model has some predictive power and application, the method of Khonen clustering is used to cluster all the samples from the original dataset into two different categories of dataset. One type is used as the development sets to develop the prediction model for the degree of esophageal cancer differentiation. Another type is used as the validation sets to validate the developed prediction model. The Kohonen neural network contains an input layer and a mapping layer. It is able to leverage network architecture to discover features and correlations of data sets (Pastukhov and Prokofiev, 2016). Data with similar characteristics are aggregated and clustered. The Kohonen algorithm is based on the principle of clustering objects with the same characteristics into one class. It not only handles large amounts of multivariate data with high dimensionality, but also preserves the important information implied by the original data (Kumar, 2017;Pasa et al., 2017).
According to the competitive learning algorithm, the connection weights of the winning network's output neurons become stronger and stronger. In order to reduce the distance between the winning neuron and the input vector, the connection weights of neighboring neurons around the winning neuron are adjusted to be closer to the original input vector. Eventually different categories are gradually formed (Sun et al., 2020;Yang et al., 2020). As shown in Figure 1, a Kohonen neural network of 36-21 structure is established. The flow diagram of Kohonen neural network algorithm is shown in Figure 2. Algorithm 1 presents the main procedures of the Khonen neural network algorithm. In the Kohonen algorithm, N is set as 221 and m is given as 21. i is regarded as 21 and l is set as 221.  Table 3, the original dataset is partitioned into two separate datasets by the Kohonen clustering algorithm. These are two different classes of datasets, one as development sets FIGURE 1 | Khonen neural network of 36-21 structure. η is the learning rate. k represents the k-th node of the output layer and ω is regarded as the connection weight value. X stands for the initial vector and i is the i-th node of the input layer.

Clustering of Correlation Indicators
To ensure the rapidity and validity of the predictive model, 21 indicators need to be screened. The Kohonen clustering algorithm is used to screen the indicators that are significantly associated with the differentiation of esophageal squamous cell carcinoma. In the Kohonen algorithm, N is set as 21 and m is given as 114. i is regarded as 114 and l is set as 21.
According to the clustering results, 21 indicators are clustered by Kohonen clustering algorithm in the development sets, and finally 13 indicator that are significantly associated with the degree of esophageal cancer differentiation are found. The 13 2: Randomly set the vector of the initial connection weight value between the mapping layer and the input layer. The initial value η of the learning rate is 0.7, η ∈ (0, 1). The initial neighborhood is set to N k0 . 3: Input of initial vector X (2) 4: Calculate the distance between the weight vector of the mapping layer and the initial vector where i is the i-th node of the mapping layer, i = 1, 2, ..., 21, l is the training data, l = 1, 2, ..., 221. E is calculated by (4), where k is the k-th node of the output layer, k = 1, 2, ..., 36, w ki is the connection weight value of the i-th neuron of the input layer and the k-th input neuron of the mapping layer. 5: Weight learning where t is the number of learning cycles (t = 50). W v is the weight of the connection between the neurons surrounding the winning neurons and the initial vector. η is a constant of [0, 1], which gradually decreases to 0 by (6), δ vk represents the value of the proximity relationship between the neuron k and the adjacent center v, as in (7), where D vk represents the distance of the output neuron k from the center of the network topology to the adjacent center v. R is the radius of the winning neighborhood N kt of neuron k. 6: Winning neurons are labeled. 7: End

Correlation Indicators Validation and ESCC Differentiation Prediction
In recent years, machine learning technology has developed rapidly and has outstanding performance in many fields ( In order to verify the correlation between these thirteen indicators and the degree of differentiation of esophageal squamous cell carcinoma, the degree of differentiation is predicted based on 21 and 13 indicators, respectively. In this study, 10 different classification algorithms are used to predict the differentiation of esophageal squamous cell carcinoma. Ten classification algorithms used in this paper are SVM (Vadali et al., 2019), Quadratic Discriminant Analysis (QDA) (Siqueira et al., 2017), CART (Cheng et al., 2017), Linear Discriminant Analysis (LDA) (Liu et al., 2016), KNN (Suyundikov et al., 2015), Ensemble (Xiao et al., 2018), ELM (Sachnev et al., 2015), Particle Swarm Optimization-Support Vector Machine (PSO-SVM) (Jiang et al., 2010), Genetic Algorithm-Support Vector Machine (GA-SVM) (Tao et al., 2019), and ABC-SVM (Alshamlan et al., 2016). Thirteen and twenty-one indicators are used as input characteristics, respectively. And the degrees of differentiation of esophageal squamous cell carcinoma are used as the outputs. The average accuracy of the 10-fold crossvalidation of 10 classification algorithms are shown in Table 5.
Cross-validation is used to test the accuracy of the algorithms. Ten-fold cross-validation is a commonly used method to test the classification performance of classifiers. Based on the large datasets, different algorithms are tested and it is shown that 10fold is an appropriate choice to obtain the best error estimate. The dataset is divided into 10 parts, and nine of them are rotated for training and one for validation. The mean of the 10 times results is used as an estimate of the accuracy of the algorithm.
As shown in Table 5, the degree of differentiation of esophageal squamous cell carcinoma is predicted by 10 different classification algorithms in the development sets and validation sets, based on 21 and 13 indicators, respectively. The results show that ABC-SVM has a higher average accuracy than the other nine algorithms for 10-fold cross-validation and is more efficient in training. In the development sets, the average accuracy of the 10-fold cross-validation of ABC-SVM predicting the degree of differentiation based on 13 indicators is 81.5%. The average accuracy rate based on 21 indicators is only 75.0%. In the validation sets, the average accuracy of the 10-fold crossvalidation of ABC-SVM predicting the degree of differentiation based on 13 indicators is 80.0%. The average accuracy rate based on 21 indicators is only 76.0%. Thus, these 13 indicators screened by the clustering algorithm have a higher correlation with the degree of tumor differentiation of esophageal squamous cell carcinoma. The 13 indicators screened by the clustering algorithm not only improved the prediction accuracy of the differentiation degree, but also reduced the training time of the classification algorithm and improved the training efficiency of the classification algorithm.
ABC algorithm is a widely used optimization algorithm. It can solve high dimensional and complex problems, and achieves good results in our study. In order to improve the classification efficiency of the SVM, the SVM parameters c and g are optimized by ABC. ABC-SVM not only has a few control parameters, but also achieves global optimum in solving complex highdimensional problems. The ABC-SVM also has some limitations. It has slow running time when solving large sample data. Moreover, SVM is sensitive to the choice of parameters and kernel functions.

Development of the Predictive Model for the Degree of Differentiation of ESCC
The data from the development sets are analyzed by using the multiple logistic regression approach. A multiple logistic regression model is a linear regression model with multiple independent variables. It is used to reveal the linear relationship between the dependent variable and multiple variables (Linfante et al., 2016;Sun et al., 2017). Its mathematical model can be formulated as where Y is the dependent variable. β 0 stands for constant.
x 1 , x 2 , ..., x p represent independent variables. β 1 , β 2 , ..., β p are the weighting coefficients of the corresponding independent variables. Thirteen indicators significantly associated with the degree of esophageal squamous cell carcinoma differentiation are used as inputs and the degree of differentiation as the output. The resulting model can be expressed as Model 1 = 6.59 * X 1 + 3.85 * X 2 + 4.69 * X 3 + 10.005 * X 4 + 11.74 * X 5 + 47.24 * X 6 + 5.51 * X 7 + 107.13 * X 9 + 5.571 * X 11 − 9.66 * X 12 − 13.01 * X 13 (9) where X 1 represents the tumor site and X 2 represents the tumor length. X 3 is the tumor width and the X 4 is the tumor thickness. X 5 represents the WBC count and the X 6 represents

where SVM is Support Vector Machine and QDA is Quadratic Discriminant Analysis. The CART represents Classification And Regression Tree and the LDA represents Linear Discriminant Analysis. KNN is K-Nearest Neighbor and Ensemble is Ensemble Bagged Tree. The ELM represents Extreme Learning Machine and the PSO-SVM represents Particle Swarm Optimization-Support Vector Machine. GA-SVM is Genetic Algorithm-Support Vector Machine and ABC-SVM is Artificial Bee Colony-Support Vector Machine.
the lymphocyte count. X 7 is the monocyte count and X 9 is the eosinophil count. X 11 represents the red blood cell count and X 12 represents the PT. X 13 represents the INR. The receiver operating characteristic (ROC) curve is used in a wide range of applications. As a common analytical tool, the ROC curve not only to describe the discrimination accuracy of the prediction model, but also to find critical thresholds for classification (Mas, 2018;Obuchowski and Bullen, 2018;Sun et al., 2018;Luquefernandez et al., 2019). In this study, the ROC curve is used to test the predictive ability of the model. The ROC curve of Model 1 in the development sets is shown in Figure 3A. The ROC results for Model 1 in the development sets are shown in Table 6. The value of area under curve (AUC) is 0.672, larger than 0.5. P = 0.0007. It follows that Model 1 has some predictive value for the differentiation of esophageal squamous cell carcinoma.
Then, the Kohonen algorithm is used again to obtain 5 indicators that are significantly related to the degree of  differentiation. In the Kohonen algorithm, N is set as 13 and m is given as 114. i is regarded as 114 and l is set as 13. Thirteen indicators are clustered by Kohonen algorithm to obtain five relevant indicators, which are tumor thickness, monocyte count, eosinophil count, basophil count, and INR. Multiple logistic regression is used, with five correlation indicators as inputs and the degrees of differentiation as outputs. The Model 2 and the Model 3 are obtained. The Model 2 can be expressed by Equation Model 2 = 5.85 * X 1 + 3.66 * X 2 + 38.66 * X 3 + 8.32 * X 4 − 12.38 * X 5 (10) Model 3 = 5.85 * X 1 + 38.67 * X 3 + 8.32 * X 4 − 12.38 * X 5 where X 1 is the tumor thickness and X 2 is the monocyte count. X 3 represents the eosinophil count and X 4 represents the basophil count. X 5 is the INR. ReliefF is a feature weighting algorithm and runs efficiently. It has no restriction on the data type and assigns higher weight to all features that are highly correlated with the category (Palmamendoza et al., 2018;Urbanowicz et al., 2018). Algorithm 2 presents the main procedures of the ReliefF algorithm.  Table 6. The value of area under curve (AUC) for Model 4 is 0.628, larger than 0.5. P = 0.0142. The value of area under curve (AUC) for Model 5 is 0.608, larger than 0.5. P = 0.0417. Therefore, Model 4 and Model 5 have some predictive value for the differentiation of esophageal squamous cell carcinoma.
Model 4 = 1.08 * X 1 + 1.5 * X 2 + 2.09 * X 3 + 0.56 * X 4 − 1.5 * X 5 (12) Algorithm 2 In the same sample sets of R, find the k nearest neighbors H j (j = 1, 2, ..., k) of R. In the different sample sets of R, find the k nearest neighbors M j (C) of R; 5: for A = 1 to N All indicators do Model 5 = 1.7 * X 1 + 1.5 * X 2 − 0.28 * X 3 − 0.92 * X 4 + 0.84 * X 5 where X 1 is the tumor thickness and X 2 is the monocyte count. X 3 represents the eosinophil count and X 4 represents the basophil count. X 5 is the INR.

Validation of the Predictive Model for the Degree of Differentiation of ESCC
In this paper, in order to test the validity of the models, the five models obtained in the development sets are used for testing and evaluation in the validation sets. The ROC curves of Model 1, Model 2, Model 3, Model 4, and Model 5 are shown in Figures 4A-E, respectively. The ROC analysis of the five models in the validation sets are shown in Table 7. The results show that the five models in the validation sets also have good predictive value. The AUC values of the five models in the validation sets are 0.753, 0.728, 0.744, 0.776, and 0.868. The P-values for five models are less than 0.0001. Therefore, the five models have some application performance and potential predictive capability.

Constructing New Features to Predict the Degree of Differentiation of ESCC
To better achieve the accurate prediction of the degree of esophageal squamous cell carcinoma differentiation, new features are constructed based on the five models obtained in this paper.
The predict values of the five models are taken as input features and the degrees of differentiation as the outputs. The ABC-SVM algorithm is used to predict the degree of differentiation of esophageal squamous cell carcinoma. In the development sets, the average accuracy of the 10-fold cross-validation of the ABC-SVM based on the five models features is 82%. The average accuracy of the 10-fold cross-validation of the ABC-SVM based on 21 and 13 indicators is only 75 and 81.5%, respectively. In the validation sets, the average accuracy of the 10-fold cross-validation of the ABC-SVM model based on the five model features is 86.5%. The average accuracy of the 10-fold cross-validation of the ABC-SVM based on 21 and 13 indicators is only 76 and 80%, respectively. As shown in Table 8, the prediction accuracy is improved by combining the features of the five models. It reached 82 and 86.5% in the development sets and validation sets, respectively. Based on the new features of the five model constructs, the operational efficiency of the ABC-SVM is enhanced, and the prediction accuracy of esophageal squamous cell carcinoma differentiation is effectively improved.

CONCLUSIONS
In this paper, the Kokonen clustering algorithm, ABC-SVM, logistic regression, ReliefF, and ROC are used to analyze and predict the tumor differentiation of esophageal squamous cell carcinoma. Thirteen indicators significantly associated   ABC-SVM in the development sets and validation sets is 82.0 and 86.5%, respectively. In this study, tumor differentiation of esophageal squamous cell carcinoma patients is effectively analyzed and predicted, which can assist physicians in their diagnostic decisions and provide timely diagnosis and effective treatment of patients.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because the data used in the study are private and confidential data. Requests to access the datasets should be directed to Junwei Sun, junweisun@yeah.net.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The ethics committee waived the requirement of written informed consent for participation.

AUTHOR CONTRIBUTIONS
All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.