Machine Learning-Based Radiomics Predicting Tumor Grades and Expression of Multiple Pathologic Biomarkers in Gliomas

Background The grading and pathologic biomarkers of glioma has important guiding significance for the individual treatment. In clinical, it is often necessary to obtain tumor samples through invasive operation for pathological diagnosis. The present study aimed to use conventional machine learning algorithms to predict the tumor grades and pathologic biomarkers on magnetic resonance imaging (MRI) data. Methods The present study retrospectively collected a dataset of 367 glioma patients, who had pathological reports and underwent MRI scans between October 2013 and March 2019. The radiomic features were extracted from enhanced MRI images, and three frequently-used machine-learning models of LC, Support Vector Machine (SVM), and Random Forests (RF) were built for four predictive tasks: (1) glioma grades, (2) Ki67 expression level, (3) GFAP expression level, and (4) S100 expression level in gliomas. Each sub dataset was split into training and testing sets at a ratio of 4:1. The training sets were used for training and tuning models. The testing sets were used for evaluating models. According to the area under curve (AUC) and accuracy, the best classifier was chosen for each task. Results The RF algorithm was found to be stable and consistently performed better than Logistic Regression and SVM for all the tasks. The RF classifier on glioma grades achieved a predictive performance (AUC: 0.79, accuracy: 0.81). The RF classifier also achieved a predictive performance on the Ki67 expression (AUC: 0.85, accuracy: 0.80). The AUC and accuracy score for the GFAP classifier were 0.72 and 0.81. The AUC and accuracy score for S100 expression levels are 0.60 and 0.91. Conclusion The machine-learning based radiomics approach can provide a non-invasive method for the prediction of glioma grades and expression levels of multiple pathologic biomarkers, preoperatively, with favorable predictive accuracy and stability.


INTRODUCTION
Gliomas are the most common brain tumors and are often classified as World Health Organization (WHO) grades I-IV, depending on the different tumor cells, and the degree of abnormality (1,2). As a tumor's grade increases, gliomas process more aggressively (3). Treatment options and responses differ from glioma grades (4). Pathological findings are the premise of rational treatment. Usually, glioma grades are confirmed by pathological examination during surgery or biopsy (5). Then, a following immunohistochemistry (IHC) test determines the molecular biomarkers of tumor tissues at the microscopic level. These pathologic biomarkers, typical proteins, are useful indicators for diagnosis, prognosis, or treatment response (6). However, obtaining such information for gliomas requires invasive approaches. The surgical decision making could be difficult and time-consuming for many patients. Those patients who are not eligible for a surgery or seek nonsurgical treatment may have limited treatment options without pathological guidance. Therefore, presurgical glioma grades and the expression of biomarkers are valued and preferred with non-invasive approaches.
At present, the medical imaging can differentiate the tumor phenotype and intra-tumor heterogeneity (7). Conventional magnetic resonance imaging (MRI) is routinely used in the diagnosis and management of glioma patients. T1-weighted contrast-enhanced MRI (T1C) is the current standard for initial brain tumor imaging (8). Radiomics can generate image features with high dimensional data from the intensity histogram, geometry and texture analyses on the entire tumor volume (9). With the emergence of Artificial Intelligence (AI) technologies, advanced informatics tools have become accessible to facilitate machine learning (ML) based radiomics applications using image features as the data source (10). Radiomics is gaining ground in oncology and have the potential to accurately classify or predict tumor characteristics.
Radiomics approaches have been applied for the predictions of glioma grades or differential diagnoses (11,12). Several studies have reached a prediction accuracy of above 80% using popular ML models. The commonly and frequently used ML algorithms in radiomics include Logistic Regression (LR), Random Forests (RF), Support Vector Machine (SVM), and etc. Each ML method has their own advantages in the classification. For example, LR fits the variables coefficients and predicts a logit transformation of the probability of being one class or the other. SVM separates the classes by finding an optimal hyperplane. RF uses bootstrap aggregating to decision trees and improves classification performance.
When compared to tumor grading, to make predictions at a molecular level is more challenging. Kickingereder et al. reported the association between established MRI features and cancer gene variations (EGFR amplification and CDKN2A loss), but failed to build a sufficient ML model to predict the molecular characteristics (13). In clinic, pathologic biomarkers are more frequently tested for than genetic testing. IDH1 is one important glioma biomarker and IDH1 mutation along with 1p/19q is a part of the molecular diagnosis in the updated 2016 WHO classification (14). Ki67, S100, and GFAP are also the common protein targets for gliomas. IDH1, Ki67, and GFAP were once considered as the golden triad of glioma IHC (15) Ki67 is highly correlated to proliferation that may indicate the tumor grades and prognosis (16)(17)(18). S100 has been implicated in the regulation of cellular activities, such as metabolism, motility, and proliferation. Under the pathological conditions of tumor and inflammation, the concentration of the S100 protein increases to the micromole level, which stimulates microglia and astrocytes, and increases the expression of pro-inflammatory cytokines (19)(20)(21)(22)(23). GFAP is the most widely used markers of astrocytes (24). Under the condition of injury (trauma or disease), the expression of GFAP in astrocytes rapidly increases (25). GFAP is often used to reveal the astrocytic lineage of glial cells and glial tumor cells, and plays a more significant role in tumor pathology, when compared to the differential diagnosis of astrocytoma. Ki67, S100, or GFAP may not be a reliable diagnostic biomarker for gliomas, because their roles in gliomas are still under investigations, while controversies have been observed in experiments (26). However, there is no doubt that these proteins can provide some insights into the tumor intramicroenvironment.
So far, it is not surprising to know that most radiomics studies favor the prediction of the IDH expression for molecular diagnosis (11,27), with a few reports on Ki67 (28). In order to expand predictive effects of radiomics, the investigators aimed to assess the prediction feasibility of glioma grades and the pathologic biomarkers of Ki67, S100, and GFAP in gliomas. The investigators believed that the combination of multiple biomarkers can increase the predictive power, and the information obtained can help in understanding the underlying pathologic process in gliomas. The investigators designed the present retrospective study and extracted hundreds of radiomic features from the T1C images of 367 glioma patients. Three machine-learning-based models (LR, SVM, and RF) were built to perform the tasks: (1) classify the glioma grades, and (2) predict the expression levels of Ki67, S100, and GFAP. This study demonstrated that multiple pathologic biomarkers in gliomas can be estimated to the certainty levels of clinical using common ML models on conventional MRI data and pathological records.

Study Cohort
The investigators retrospectively collected a data set of 420 glioma patients, who had pathological reports and MRI scans performed between October 2013 and March 2019, from the Second Xiangya Hospital of Central South University. The patients who met the following criteria were included: (i) a histopathological diagnosis of primary glioma based on the WHO classification, (ii) the availability of IHC profiles of biomarkers (S100, GFAP, and Ki67), (iii) preoperative MRI data of post-contrast axial T1-weighted (T1C), and (iv) age > 18 years old. Patients were excluded due to the following: (i) secondary gliomas or postoperative recurrence of gliomas, (ii) obvious artifacts in MRI. Ethics approval was obtained for the present study from the Ethics Committee of the Second Xiangya Hospital, Central South University.

Pathological Evaluation
Patient demographics (age and gender), and histopathologic diagnosis and IHC results were obtained from a surgical pathology report. On these reports, the diagnosis included a specific glioma type by cells (e.g., astrocytoma and oligodendrogliomas) and a given WHO grade (I-IV). The IHC results were presented in the list of glioma biomarkers (e.g., S100, GFAP, or Ki67) and their own expression profile in tumor cells. It is noteworthy that the list was not standard and varied upon the request or availability of the biomarkers at that time.
For example, few patients received an IDH1 test before 2017, but after 2016, the WHO classification standard was published, and IDH1 tests became common. So, a patient might have a different set of tested biomarkers, and the number of cases can differ for each biomarker. Their IHC results depended on the scoring system used. The expression levels were usually evaluated by the staining intensity of positive cells, and points were assigned to describe these positive cells by count (e.g., 0 points as negative (−), 1 point as positive (+), 2 points as medium positive, and 3 points as high positive), percentage (e.g., 0 points as none, 1 point less than 5%, 2 points approximately 5-25%, and 3 points above 25%), or the appearance of a clear brown color (e.g., 1 point for light yellow). In the study, the glioma grades were classified as low-grade (WHO I-II, benign) and high-grade (WHO III-IV, malignant), and expression levels of biomarkers were divided into two categories: a low expression scored less than 2 points and a high expression scored 2 points or above.

Imaging Post-processing and Radiomics Features Extraction
Magnetic resonance imaging scans were acquired from different scanners over time. The Picture Archiving and Communication System (PACS) exported the selected DICOM images to a local computer using the RadiAnt DICOM Viewer (Medixant, PL). In order to reduce the influence of different scanning parameters, post-processing and image registration were applied using the Advanced Normalization Tools (ANTS 2.1, PA). Then, the DICOM images were loaded into ITK-SNAP for segmentation and standardization (29). Two neuroradiologists (5 years of experience) drew the region of interest (ROI) around the tumor boundary on the T1C images. The neuroradiologists were blinded to the patient identification and diagnosis. After a joint effort, disagreements with the boundary were solved. The ROI segmentations were resampled to match the dimensions of the original images, and both images were saved in.narrd as the input for feature extraction.
The Pyradiomics extractor was customized to calculate and extract the features (10). All built-in filters [wavelet, Laplacian of Gaussian (LoG), square, square root, logarithm, and exponential] were enabled on five image feature classes [first order statistics, shape descriptors, and texture features on the gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), and gray-level size zone matrix (GLSZM)]. Feature definitions and calculation algorithms were available in the PyRadiomics documentation 1 .

Machine Learning
The feature importance and the following predictive ML methods were implemented using Python (version 3.7.0) with machinelearning library scikit-learn (version 23.0) (30). All features were standardized through Min-Max scaling. Features with all zero scores were removed. Clinical data (age and gender) were added in constructing the final prediction models.

Feature Importance
The feature importance helped in understanding the importance of the features, since a large number radiomics features with high-dimensional data are difficult to interpret. Three technique approaches were used to identify the important features. First, chi-squared (chi 2 ) tests were applied in the scikit-learn SelectKBest class to obtain a list of the top 15 best features. Second, the heatmap of correlated features was plotted to identify features highly correlated to predicting targets (glioma grade and biomarker expression) using the seaborn library. Third, a RF classifier was initiated and the in-build feature importance was used to extract the top features.

Predictive Machine Learning Models
Three frequently-used machine-learning based models of LR, SVM, and RF were built for four predictive tasks: (1) glioma grades, (2) Ki67 expression level, (3) GFAP expression level, and (4) S100 expression level in gliomas. Each sub dataset was divided into training and testing sets at a ratio of 4:1 (train_size = 0.8, test_size = 0.2). Principal Component Analysis (PCA) was applied for high-dimension reduction that maps n-dimensional features to k-dimensional features (n > k), resulting in brand new orthogonal features. For the unbalanced data in different classes, the synthetic minority over-sampling technique (SMOTE) algorithm was used to oversample the minority class (31). On training set, the grid search with crossvalidation was applied for hyper parameters tuning (RF and SVM), and k fold validation was used for LR. The accuracy score was compared with the result from their base models (default settings in scikit-learn) for model selection. The testing set was used for final model evaluation. The performance of the models was evaluated according to accuracy, the area under curve (AUC) of the receiver operating characteristic (ROC), sensitivity, specificity, the positive prediction value (PPV), and the negative predictive value (NPV). According to the AUC and accuracy, the best classifier was chosen for each task.

Statistics
One way-ANOVA or simple t-test was applied to test the differences among gender, age, glioma grade, and the expression levels of the biomarkers. Descriptive statistics was used to summarize the important features through filters and feature classes. All significant levels were tested at 0.05.

Subjects and Pathologic Biomarkers
A data set of preoperative MRI and surgical pathologic reports of 420 glioma patients were collected. A total of 51 patients were excluded for not meeting the inclusion criteria. Among these patients, 40 patients were under 18 years old, seven patients had quality issues on their MRI data, and four patients did not have an assigned WHO classification level in their records. The age of the enrolled 369 patients ranged within 18-75 years old (mean age: 45.63 ± 13.22 years old), and consisted of 210 males (age: 46.99 ± 13.24 years old), and 159 females (age: 43.84 ± 13.03 years old). The clinical characteristics of patients and the distribution of the selected biomarkers across glioma grades are presented in Table 1.
There was a significant age difference among male and female patients, as determined by one-way ANOVA [F (1, 367) = 5.17, P < 0.05]. Furthermore, there were significant differences in age, gender and tumor volume among glioma grades (WHO I-IV). Moreover, there were significant differences in glioma grade, tumor size, age and gender for the Ki67 expression. However, there were no significant differences in age, gender and glioma grade for S100 and GFAP expression. The t-test and one-way ANOVA results are shown in Table 2.

Features Importance
The investigators obtained the list of the top 15 important features based on the scores obtained from the chi-squared stats between each non-negative feature and the glioma grade, and S100, GFAP, and Ki 67 expression levels. The features and their scores are shown in Table 3. The scores ranged within 3.67-44.04. The mean score of the top important features was 9.30, with a standard deviation of 5.83. The frequent top features within the image type were exponential (23), wavelet (22), square (6), square root (3), original (3), gradian (2), and ihp-2D (1). For the feature classes, the frequent top features were divided as follows: glszm (27), glcm (9), glrlm (8), gldm (7), first order (7), and ngtdm (2). The heatmaps of the correlated features for glioma grade and the biomarkers of Ki67, GFAP, and S100 are presented in Figure 1. The RF model built-in feature importance is presented in Figure 2.

Prediction Machine Learning Models
The performance of the 12 predictive models is presented in Table 4. The RF models performed slightly better, when compared to the other models. The comparisons with accuracy and the results are presented below. Figure 3 shows the AUC_ROC for the RF classifier in sub test sets.

Glioma Grades
The sub data set was randomly split into the training set of 276 cases and the test set of 93 cases. With a PCA retention of 0.95, the PCA process reduced the dimensions Frontiers in Oncology | www.frontiersin.org

S100 Expression
A total of 338 patients had S100 test results, which included 323 low expression levels (<2 points) and 15 high expression levels (≥2 points). The class distribution was 323:15. The training set and test set were split into 270 and 68, respectively. After the SMOTE oversampling, the resampled number increased to 518. With a PCA retention of 0.95, the PCA process reduced the dimensions to 38 components, and these were used for the final prediction model for the S100 expression. After grid search with cross validation (cv = 5) or K fold validation (n_splits = 5), the selected classifier included: (1) LR (penalty = "l2, " C = 1.0), (2) SVM (C = 1, kernel = "rbf, " and gamma = "auto"), and (3) RF (min_samples_leaf = 1,min_samples_split = 2, and n_estimators = 100). Among these classifiers, the RF classifier achieved the best prediction performance on the S100 expression, based on the measurements (AUC: 0.60, accuracy: 0.91, averageweighted sensitivity: 0.88 specificity: 0.91, and f1 score: 0.90). It is noteworthy that the average-weight computes f1 for each class, and returns the average while considering the proportion for each class in the dataset. For S100 low expression levels: accuracy (0.95), sensitivity (0.94), specificity (0.97), and f1 (0.95). For high expression levels: none of the four high expression cases was correctly predicted.

GFAP Expression
A total of 367 patients had a GFAP test. Among these patients, there were 327 low expression levels and 40 high expression levels. The class distribution ratio was 327:40.

DISCUSSION
The machine-learning based radiomics approach was applied to predict glioma grades and the expression levels of pathologic biomarkers Ki67, GFAP, and S100 in low or high. The overall performance of the ML models was satisfactory. The RF algorithm was found to be stable and consistently performed better than LR and SVM. Feature importance varies on predictive tasks, glioma grade or specific protein expression. The most frequent important feature classes were textual and first order statistics.  We selected LR, SVM, and RF as classifiers mainly for their popularity. LR, SVM, and RF classifiers can work on non-text data set less than 100K. Whether the data is linearly divisible or not, the linearly separable models (LR, SVM), and the non-linear separable model (RF) are helpful to view the effect and avoid the impact due to poor data. LR shows a higher AUC, in GFAP's prediction model, but performs worst in S100's prediction. Comparing the overall results from three biomarker prediction models, the combination of PCA reduction and RF classification consistently performed best. It suggests a common ML pipeline that may be helpful in standardizing the prediction process of multiple protein expressions.
Also more recently, researchers have demonstrated achievements of deep learning (DL) in the image segmentation and glioma grades prediction (32)(33)(34)(35)(36)(37). Convolutional neural networks (CNNs) started outperforming other methods on several high-profile image analysis projects. DL has advantages in computation, as high-performance graphics processing unit (GPU) supports fast computing and less time on modeling. Like a kind of end-to-end learning, DL can automatically extract relevant functions from images, and tasks such as raw data processing and classification can be completed automatically. However, DL is complex and requires thousands of images to start with, otherwise due to a relatively small collection of images like ours, overfitting is more likely. The classic ML methods met our needs and suited the data. RF models performed well for predicting glioma grades and pathologic biomarkers S100, Ki67, and GFAP.
As it is known, the roles of these biomarkers can be complicated and controversial in laboratory experiments (26). In addition to the abilities of predicting tumor phenotypes, radiomics might offer a new approach to evaluate biomarkers, since their differentiation can be identified through the analysis of imaging features. The expression level of Ki67 was significantly correlated with the tumor grade and tumor volume, as well as the patient age and gender. A study once reported that the high level of Ki-67 expression was correlated to poor overall survival (OS) and progression free survival (PFS) (16). The accurate prediction of high level Ki67 is more meaningful than its low level expression to indicate poor prognosis for glioma patients.
The GFAP has been widely expressed in gliomas. Merely four patients presented as GFAP negative. The majority of the patients (323 of 367, 88%) had GFAP positive (+), and 327 patients with low expression GFAP (90%), combined with four that scored (−), were distributed all over the gliomas grades, including low grade (132, 40%), and high grade (195, 60%). The minority of the patients (40 of 367, 12%) had GFAP medium positive (++) or high positive (+++) distributed in low grade (15,37.5%) and high grade (25,62.5%). In the literature, a high GFAP expression is likely to be found in low grade gliomas. The present result was confusing, that is, the high and low expression levels of GFAP were more correlated to high grade gliomas. This result may echo that GFAP is not a direct predictor of low grade gliomas (15,26). On the classification report of the RF_GFAP model, the accuracy score of predicting a GFAP low expression was up to, while that of predicting high expression levels of GFAP was much lower. The overall prediction performance might not be meaningful, since GFAP was lowly expressed in 90% of patients, and the model could always answer 90% correctly. The same problem was found in the predictive model of S100. It required the rethinking of these two models. There was a need to determine which expression class is more valued. And then, as one solution, the ROC thresholds can tuned, increasing the sensitivity of the favored class.
The interpretation of the predicted results is complex, but may be helpful to understand the molecular mechanisms it underlies. In addition, the investigators selected CE MRI from several typical cases for demonstration, in which the different expression levels of biomarkers exhibited different imaging characteristics (Figure 4). For the high expression of S100 case (Figure 4A), the tumor exhibited an obvious rosette enhancement, no enhancement of internal necrotic components, and a few edema zones around it, and was diagnosed as glioblastoma (WHO IV grade). In the image of the tumor with a low expression of S100 (Figure 4B), the tumor mass effect was obvious, but there was no obvious enhancement, and the surrounding edema was not obvious, which was diagnosed as astrocytoma (WHO II grade). In this case, the positive correlation appeared as both the S100 and glioma grade moved in the same direction that was contrary to many observations. The study conducted by Wang et al. has proven that S100 is expressed in most gliomas, and that this is an important inducer of CCL2 (19). CCL2 participates in the transport of tumor-associated macrophages (TAM) in gliomas, which affects angiogenesis, invasion, local tumor recurrence and immunosuppression. This may explain the relationship between the degree of tumor enhancement and the expression of S100 in the present cases.
There are some limitations in our study. First, we only used conventional MRI sequences with a default set of tumor features extracted by Pyradiomics. Advanced MRI sequences (e.g., DWI, DKI, MRS, ASL, et al.) can reflect the microstructure and metabolic information of tumors. In future study, we will further investigate the molecular phenotype of gliomas using a multimode magnetic resonance scheme. Second, we only selected 3 common pathologic biomarkers for gliomas from a wide range of biomarkers either current available or under investigation. We have to develop an evaluation plan for other glioma biomarkers and find candidates that can be benefit from radiomics applications. Third, imbalance classes did not reflect the incidences of glioma in real world, where glioblastoma is the most common subtype, and grade I glioma is relatively rare in adults. We used the SMOTE algorithm to balance data, oversampling the minority class, but the differences in data distribution cannot be ignored. In our experiments, before and after the use of SMOTE, AUC was only changed slightly. A larger dataset from multiple sites is expected to complement predictive effects, and the resulting classifiers can be more accurate and stable. Fourth, after PCA reducing feature dimensions, a new set of features was less remained but difficult to interpret. A combination of hierarchical clustering on PCA may help us to select feature more efficiently. At the current stage, a real-world application is out of our scope, but further prospective assessment is warranted. Based on the results we obtained as a reference, we will extend the study to identify the best classifier algorithm and the best set of features to simplify the classification tasks. The standardized computation methods would greatly enhance the reproducibility of radiomics studies, and it may also lead to standardized software solutions available in clinical practice.
In conclusion, the machine-learning based radiomics application provided a non-invasive approach for the prediction of glioma grades and expression levels of multiple pathologic biomarkers, with favorable predictive accuracy and stability. The study also demonstrated the potential of radiomics for pathological assessment and individualized cancer treatment.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethics committee of the second Xiangya hospital of central south university. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.