Machine-Learning-Derived Nomogram Based on 3D Radiomic Features and Clinical Factors Predicts Progression-Free Survival in Lung Adenocarcinoma

Background To establish a machine-learning-derived nomogram based on radiomic features and clinical factors to predict post-surgical 2-year progression-free survival (PFS) in patients with lung adenocarcinoma. Methods Patients with >2 years post-surgical prognosis results of lung adenocarcinoma were included in Hospital-1 for model training (n = 100) and internal validation (n = 50), and in Hospital-2 for external testing (n = 50). A total of 1,672 radiomic features were extracted from 3D segmented CT images. The Rad-score was established using random survival forest by accumulating and weighting the top-20 imaging features contributive to PFS. A nomogram for predicting PFS was established, which comprised the Rad-score and clinical factors highly relevant to PFS. Results In the training, internal validation, and external test groups, 69/100 (69%), 37/50 (74%) and 36/50 (72%) patients were progression-free at two years, respectively. According to the Rad-score, the integral of area under the curve (iAUC) for discriminating high and low risk of progression was 0.92 (95%CI: 0.77-1.0), 0.70 (0.41-0.98) and 0.90 (0.65-1.0), respectively. The C-index of Rad-score was 0.781 and 0.860 in the training and external test groups, higher than 0.707 and 0.606 for TNM stage, respectively. The nomogram integrating Rad-score and clinical factors (lung nodule type, cM stage and histological type) achieved a C-index of 0.845 and 0.837 to predict 2-year PFS, respectively, significantly higher than by only radiomic features (all p < 0.01). Conclusion The nomogram comprising CT-derived radiomic features and risk factors showed a high performance in predicting post-surgical 2-year PFS of patients with lung adenocarcinoma, which may help personalize the treatment decisions.


INTRODUCTION
Lung cancer is the most common malignant tumor worldwide, accounting for 11.6% of all cancers, and 18.4% of all cancer deaths (1). Population-based screening improved the detection of early-stage lung cancer (2). TNM stage, as determined by medical imaging, has been widely considered an important predictor of prognosis and is used to guide therapeutic decision-making (3). However, even patients with the same TNM stage may have different prognoses due to tumor heterogeneity (4). In lung adenocarcinoma, the primary histological type of lung cancer, post-surgical outcomes vary among patients, and post-surgical recurrence is frequent because most cases have mixed subtypes (5), increasing the importance of a personalized post-surgical follow-up theme.
Machine learning algorithms have been developed to analyze the high-dimensional features of tumor images, which provide better specificity than naked-eye observation (6). Radiomics is a subfield of machine learning, in which interpretable quantitative features are extracted from medical images to characterize tumor heterogeneity. Tumor radiomics research has made some progress, such as predicting survival rate after chemotherapy ( 7) and immunotherapy (8), analyzing the tumor microenvironment (9), and promoting individualized treatment of patients by providing accurate and effective decision support (10). Previous studies have associated radiomic features with multiple clinical endpoints, such as survival and drug response, in patients with non-small cell lung cancer (NSCLC) (11)(12)(13). However, the concordance index of the radiomic models for predicting the survival in lung cancer was not optimal (11). There is no consensus whether radiomics-based methods can effectively predict the post-surgical prognosis of patients with lung adenocarcinoma.
Selection of appropriate follow-up plans after resection of lung adenocarcinoma requires accurate prediction of postsurgical progression-free survival (PFS) according to tumor characteristics and histological subtypes. The purpose of this study was to train and test a machine-learning-derived radiomics approach to predict the PFS of post-surgical patients with lung adenocarcinoma, and establish a nomogram based on radiomic score and multiple clinical PFS-related factors.

Study Sample
We searched the electronic health records of two medical centers. In the first center (Hospital-1, Shanghai General Hospital -North [city center]), 100 patients admitted from July 2017 to June 2018 were randomly selected as the training cohort, and 50 patients from July 2018 to December 2018 were randomly selected as the internal validation cohort. In the second center (Hospital-2, Shanghai General Hospital -South [Songjiang new city]), 50 patients from July 2016 to December 2018 were randomly selected for external testing. The inclusion criteria were: 1) patients whose lung tumor tissue was completely resected by surgical operation; 2) lung adenocarcinoma diagnosed based on hematoxylin-eosin gross pathological specimen and immunohistochemical staining; 3) ≥ 2-year follow-up to obtain PFS results (or shorter in case of earlier progression); 4) with pre-surgical thin-slice (<1mm) contrastenhanced CT scanning; 5) ≤ 2 months interval between CT scan and surgical resection.
The collected baseline data were age, gender, smoking status (non-smoker or smoker), 8 th edition cTNM stage, lung nodule type (solid or subsolid), histological subtype, and post-surgical treatment. The cTNM stage was determined by whole-body positron emission tomography (PET)-CT or whole-body CT for the head, neck, chest and abdomen, except the lower extremities. The histological subtype of adenocarcinoma was defined by the 2015 World Health Organization classification of lung cancer (14). The 150 patients in Hospital-1 underwent molecular testing for EGFR mutations (expression of exon-18, -19, -20, and -21) using the human gene mutation detection kit (Aide Biomedical Technology).
The study endpoint was PFS, defined as the period from tumor resection to tumor recurrence observed by PET-CT or whole-body CT, or death for any reason. The PFS results were obtained from patient medical records, including chest CT scans, clinical outcomes, or death records.
The institutional review boards approved this retrospective study and waived the requirement for patient informed consent in the two hospitals. Figure 1 shows the patient selection flow diagram. Figure 2 shows the study workflow.

CT Acquisition
Various CT scanners of different type were used in clinical practice, including three CT systems (Somatom Force, Siemens Healthineers; Revolution and HD750, GE Healthcare) in Hospital-1, and two CT systems (Somatom Flash, Siemens Healthineers; Revolution, GE Healthcare) in Hospital-2. All study patients underwent contrast-enhanced chest CT scanning after injection of 60-80 mL contrast-media (Iopamiro 300, Bracco) into the antecubital vein at 3-4 mL/sec. The image slice thickness was 0.6 mm or 0.625 mm. Supplementary Table 1 details the acquisition protocol and reconstruction parameters.

Tumor Segmentation and Radiomic Feature Extraction
A dedicated software package (Radiomics v1.2.3, Siemens Healthineers) running on a research platform (SyngoVia VB10, Research Frontier, Siemens Healthineers) was used to perform three-dimensional tumor segmentation and feature extraction. This package was developed based on the PyRadiomics library (http://www.radiomics.io) and subjected to the Image Biomarker Standardization Initiative (15). One radiologist with 6 years of experience in thoracic imaging semi-automatically segmented the tumor lesions on CT images by finding the lesion and clicking on it, blinded to the prognostic results. Then the software automatically extracted a total of 1,672 radiomic features for each lesion, including first-order, size and shape, and texture features. Supplementary Table 2 describes these features.

Feature Selection
Radiomic features with good stability and repeatability can be selected as candidates for further feature selection. To select these candidate features, 50 patients were randomly chosen from the training cohort. Two radiologists with 20 and 6 years of experience independently segmented the tumor lesions of these 50 patients. Spearman's rank correlation coefficients between the 1,672 features measured by the two radiologists were calculated to filter unstable features (16). The radiomic features with a Spearman's r > 0.8 were considered stable and reproducible, and were chosen to build the predictive model (17).

Model Establishment
A machine-learning method, the random survival forest was used to establish the R-model and generate a Rad-score. Random survival forest provides high predictive accuracy with nonlinear regression. It is a robust algorithm with an automated feature screening process suitable for right-censored correlated complex survival datasets compared with the commonly used least-absolute shrinkage and selection operator (LASSO)-Cox regression (18). Random survival forest is suitable for integrating highdimensional features like radiomics features for survival analysis and risk stratification. The random survival forest model was recently used to identify risk factors and generate radiomic signatures for different diseases (19). The robustness of the Rmodel was validated by a five-fold cross-validation approach for tuning the optimal hyperparameters. Feature importance was evaluated by the magnitude of the log-rank test statistic by permutation (20). Accordingly, the Rad-score (range from 0 to 1) was computed and generated by the random survival forest model to represent the average of expected number of events across all trees in the established forest (21). A higher Rad-score represented a higher risk of progression or shorter PFS.
A Kaplan-Meier curve with log-rank test and univariate Cox proportional-hazards model was applied to select candidate influential factors (clinical factors and EGFR mutation status) and establish a C-model. Factors with a p-value < 0.1 were included in the multivariate Cox proportional-hazards regression to establish the C-model. Prognosis prediction based on TNM staging was evaluated.
To establish the Combi-model and avoid multicollinearity and potential correlation between the Rad-score and clinical factors, a multivariate Cox proportional-hazards model with stepwise selection using minimum Akaike information criterion was implemented to integrate the Rad-score and high-relevance clinical risk factors.

Solid Tumor vs. Subsolid Tumor
Subgroup stratified analysis was implemented to verify that the radiomics model can independently predict prognosis in patients with solid and subsolid tumors. The patients in each group were classified by Rad-score as high or low risk of disease progression. The prognosis difference between risk levels was assessed in each group by Kaplan-Meier survival analysis with the log-rank test.

Statistical Analysis
Clinical characteristics between datasets were compared using independent sample t-test or Wilcoxon test for continuous variables depending on the normality test, whereas Chi-square or Fisher's exact test was used for categorical variables.
The R-score's predictive ability was evaluated by Harrell's concordance index (C-index) in the training, internal validation, and external test cohorts. The risk stratification capability was assessed by using the Kaplan-Meier survival curve and log-rank test. The optimal R-score cut-off values were determined by X-tile software (version 3.6.1, Yale University) (22) in the training cohort and applied to the internal validation and external test cohorts.
The proportional-hazards assumption was first verified by the Schoenfeld residuals test to evaluate model performance. Time-dependent receiver operating characteristic (ROC) curves, the integral of the area under the curves (iAUC) at various timepoints and C-index were generated and calculated in the training and internal validation cohorts to compare the prognosis discriminative ability of the three models (R-, Cand Combi-). The goodness-fit of all the three models was illustrated by the calibration curve by calculating the actual and predicted probabilities of PFS at 2 year. Decision curve analysis (DCA) was used to evaluate and compare the three models incorporating clinical net benefits. Finally, a nomogram of the Combi-model was built to visually represent the final predictive model making it convenient for clinicians to identify new patients.

Establishment and Assessment of the Radiomics Model
The feature stability analysis yielded 597 radiomic features out of 1,672 with a Spearman's r > 0.8 (Supplementary Table 2), that were considered stable features indicating good interobserver consistency. From these stable features, the 20 features highly contributive to PFS were selected to create an R-score using a random survival forest algorithm (Supplementary Figure 3). The optimal R-score cut-off value to discriminate high and low progression risk was set to 3.684 based on the training cohort determined by X-tile (22), and the patients were subsequently stratified as high or low risk for disease progression (Supplementary Figure 4). Kaplan-Meier analysis showed that this cut-off value could discriminate high-and low-risk patients in the training (p<0.001), internal validationt (p=0.0014), and external test (p=0.045) cohorts.
In the training cohort, the R-model reached an average AUC of 0.82 for discriminating patients with high-and low-risk for disease progression ( Figure 3A), and an iAUC of 0.78 (95%CI: 0.67 to 0.87) and 0.92 (0.77 to 1.0) at one and two years, respectively ( Figure 3B). In the internal validation cohort, Rmodel reached an average AUC of 0.77, and iAUC of 0.86 (0.74 to 0.99) and 0.70 (0.41 to 0.98) at one and two years, respectively ( Figure 3C). In the external test cohort, the R-model showed an iAUC of 0.92 (0.70 to 0.98) and 0.90 (0.69 to 1.0) at one and two years, respectively ( Figure 3D).
For the training and external test cohorts, the C-index was 0.781 and 0.778, respectively. All the C-indexes were higher than 0.7, indicating good predictive performance. The calibration curves showed visually good agreement between the predicted and actual labels in the training, internal validation and external test cohorts (Supplementary Figure 5).

Establishment and Assessment of Clinical Factor Model
Kaplan-Meier survival analysis showed that lung nodule type and cM stage were significantly associated with PFS. The PFS was significantly shorter for patients with solid tumors than those with subsolid tumors (p=0.002), and patients with distant metastasis than those without (p=0.011). Supplementary Figure 1 shows the corresponding Kaplan-Meier curves.
The univariate Cox regression analysis showed that sex, lung nodule type (subsolid or solid), cT, cN, and cM stage were significantly associated with PFS (all p<0.1) (Supplementary Table 3), so they were included for multivariate regression analysis to establish the C-model. The C-index based on TNM stage was 0.707 and 0.606 for the training and external test cohorts, respectively (Supplementary Figure 6).
The global Schoenfeld test (p=0.645) showed that the Cmodel met the proportional hazard assumption requirement (Supplementary Figure 7). Figure 4A shows the forest plots of the C-model. The calibration curves showed visually good agreement between the predicted and actual labels in the training and validation cohorts (Supplementary Figure 8). The Cindexes were 0.755 and 0.739 for the training and internal validation cohorts, respectively. In the multivariate regression analysis, lung nodule type (HR=4.23, p=0.009) and cM stage (HR=4.57, p<0.001) were significantly correlated with PFS, thus were included in the Combi-model.
Schoenfeld individual tests also showed the genetic factors model adhered to the proportional hazard assumption requirement (Supplementary Figure 9). The C-indexes of EGFR mutations were 0.612 and 0.528 for the training and validation cohorts, respectively. Because of these insignificant results, the genetic factor was not integrated into the Combi-model.

Establishment and Assessment of the Combined Model
The Combi-model and nomogram ( Figure 5A) were established based on Rad-score and the clinically relevant factors (cM stage, lung nodule type and histological type). A multivariate Cox regression model was used to evaluate the significance of Radscore and these clinical factors in predicting PFS. Forest plots ( Figure 4B) of the Combi-model illustrated that Rad-score (HR=0.034, p=0.001), cM stage (HR=3.822, p=0.002) and histological type (HR=0.191, p=0.002) were independent factors for PFS. Schoenfeld individual tests showed that the Combi-model fitted the proportional hazard assumption requirement (Supplementary Figure 10). The calibration curve showed that the calibration effect of the training and validation cohorts was very good (Supplementary Figure 11). The Cindexes of the Combi-model for predicting PFS in the training and external test cohorts were 0.845 and 0.837, respectively.
Finally, a nomogram combining Rad-score and clinical factors (cM stage, lung nodule type and histological type) was established ( Figure 5A). The calibration curves showed good consistency between the predicted PFS probability and the actual result at one and two years after surgery ( Figures 5B, C).

Comparison among the R-, C-, and Combi-models
The pairwise comparison of C-indexes among the R-, C-, and Combi-model is shown in Supplementary Table 4. The Combimodel performed better than the R-and C-models. In the training cohort, the C-index of the Combi-model was 0.845, higher than the R-(0.781) and C-models (0.755). In the external test cohort, the Cindex for the Combi-model was 0.837, significantly higher than the R-(0.778) and C-models (0.739) (all p<0.01). DCA was used to compare and visualize the clinical net benefits of the three models ( Figure 5D) and showed that the Combi-model gained more clinical net benefits than the other two models.

Solid Tumor vs. Subsolid Tumor
In the solid and subsolid tumor groups, Kaplan-Meier survival analysis showed that the Rad-score significantly discriminated the patients into high and low risk of progression (p=0.0045 and p<0.001, respectively) (Supplementary Figure 12).

DISCUSSION
In this study, we established machine-learning-derived radiomics models to predict PFS in patients with lung adenocarcinoma. The C-index of the radiomics model (Rad-score) reached 0.781 and 0.860 in the training and external test cohorts, respectively, higher than the TNM staging C-indexes of 0.707 and 0.606. The established nomogram combined Rad-score, cM stage, lung nodule type and histological subtype, and achieved a high Cindex of 0.845 and 0.837 in the training and external test cohorts, respectively. According to several previous reports, the C-index of radiomic models for predicting post-surgical prognosis of lung cancer was between 0.60 and 0.67, and increased to 0.72 when combined with clinical and genomic characteristics (12,23,24).
The predictive value of radiomics for the prognosis of lung cancer is controversial. Some studies showed that radiomics had great potential in evaluating the prognosis of lung cancer (25,26). Choe et al. reported that the radiomic features of contrastenhanced CT had a similar ability to assess prognosis in a clinicopathological model of patients with adenocarcinoma after pneumonectomy (27). Wang et al. showed that radiomic features could successfully stratify patients into low-and highrisk groups of post-surgical recurrence (28). In contrast, Li et al. showed no significant correlation between radiomic features and PFS (29), and no additional benefit for adding clinical factors to the prediction model. Botta et al. found that the combined radiomic and clinical factor model did not improve the performance when compared with a stand-alone radiomic or clinical factor model (30). Although researchers have extracted many radiomic features in oncology imaging (27,29,31), the clinical relevance of radiomic features to predict lung cancer survival remains a major question. This study confirmed the predictive value and achieved a high C-index of 0.860 in the external test cohort. One strength of this study is the availability of an external test cohort. An independent and external test cohort contributes to evaluate the generalizability of predictive models. Some methodological issues have been raised in prior studies on radiomics, including the heterogeneity in quantitative data dependent on different ac quisition p rotocols and reconstruction parameters (32). It has been widely accepted that a comprehensive evaluation is necessary for a clinical biomarker, i.e., internal and external validation (33). Many researchers, such as Choe et al., believed that their risk prediction model needed further external verification to confirm its value for survival stratification (27). Another strength is the 3D lesion segmentation with 1,672 radiomic features. Most previous radiomics-based studies only performed 2D lesion segmentation. In our study, 3D lesion segmentation was the basis for numerous high-dimensional radiomic features, that maximize the potential information underlying the images and thus improved model performance.
The 2-year PFS used in this study is a relevant and widelyapplied survival threshold for patients with lung adenocarcinoma (34,35). Hosny et al. evaluated the application of deep learning in predicting 2-year overall survival rate of NSCLC patients based on CT data (36), and suggested the prognosis-predicting advantage for post-surgical patients. There were other studies investigating 2-   (38). They reported the median time to recurrence was 17.5 months (range: 1.3-75.3 months). Thus, the investigation of 2-year PFS is clinically suitable for determining tumor recurrence, or progression-free survival for resected lung cancer.
The prediction of patient survival based on CT images is commonly used in clinical trials and medical practice. A wellestablished and -validated prognostic model based on radiomics can make the prediction more objective and accurate than traditional qualitative and quantitative methods (12). Some studies have shown that the radiomics method was more predictive than conventional clinical measurements to noninvasively describe tumor phenotypes (11,39). Thus, the clinical usage of imaging biomarkers may benefit the clinical practice because they are noninvasive, reproducible, low-cost, and do not require human input (40). At present, there is no consensus on the strategy for post-surgical follow-up interval or follow-up examination method. For example, the National Comprehensive Cancer Network guideline recommends that for patients with stage I-II NSCLC, a medical history collection, chest CT and physical examination should be conducted every 6 months for 2 to 3 years after surgery, and then once a year for history collection, chest CT and physical examination (41). However, the phase III randomized controlled clinical trial (IFCT-0302) was different (42). In the first two years, the patients were followed up every six months, then yearly until the fifth year. In each follow-up examination, patients underwent clinical examination, chest and abdominal CT scanning and bronchoscopy examination. Watanabe et al. studied the postsurgical recurrence pattern of NSCLC (18), and found that the peak of risk curve was at 6-8 months after surgery. The next significant peak occurred at the end of the second year. For women, the peak occurred at 22-24 months after surgery, which was about 16 months later than that for men. Their results suggested that post-surgical recurrence time of lung cancer is variable. Therefore, developing an individualized follow-up strategy for the patients with lung adenocarcinoma is essential in clinical practice. A more frequent follow-up schedule is necessary for patients with high post-surgical progression risk. We investigated the prognosis of solid and subsolid tumors (pure ground glass or part-solid attenuation), and found that subsolid tumors were associated with high PFS. Wang et al. reported that the overall 5-year survival of patients with stage I lung cancer was 83%, most of which were subsolid tumors (28). Although the image features differ between solid and subsolid tumors, in this study, regardless of tumor density type, the radiomics model demonstrated its ability to discriminate patients into high and low progression risk group (p<0.01). One reason for the model's success is that the model selected the most relevant image features to predict PFS for both kinds of tumors, thus improving its generalizability.
We did not observe a correlation between EGFR mutations and PFS, so the genetic factor was not included in the Combi-model. Song et al. used radiomics to predict PFS in patients with stage IV EGFR mutant NSCLC, who received tyrosine kinase inhibitors therapy (7). The predictive value of EGFR mutation for survival prediction remains controversial (43). Some authors have found that EGFR mutation is a significant predictor of overall survival and relapse-free survival in surgically resected adenocarcinoma (44,45), while others showed that EGFR mutation was insignificant for postsurgical survival (46). The differences between these studies may be due to the heterogeneous nature of lung cancer and the differences between study populations.
There were several limitations to this study. First, although the clinical data were from two large medical centers, more external test cohorts from different hospitals and regions are necessary to improve the robustness of these models. Second, the sample size of this study is not large. Although an external test helps to enhance the generalizability of the conclusions of this study, increasing the sample size would strengthen the robustness of models. Third, we only investigated patients with adenocarcinoma. Other types of lung cancer, such as squamous cell carcinoma, would show different tumor behaviors, which needs further investigation. Fourth, the study samples of this study were collected from the real-world clinical practice, in which a variety of postoperative radiotherapy, chemotherapy and targeted therapy were used. In order to evaluate the value of radiomics in predicting the prognosis of different treatment regimens, it is necessary to conduct well-controlled studies on specific treatment methods.

CONCLUSION
The machine-learning-derived radiomics model surpassed only TNM stage in predicting post-surgical PFS in patients with lung adenocarcinoma. The established nomogram comprising CTderived radiomic features and clinical risk factors demonstrated high performance in predicting post-surgical PFS, which was verified in the external test cohort. Our study shows that the high-dimensional information underlying CT images can be used to evaluate the post-surgical prognosis of lung adenocarcinoma, which may help clinicians to identify highrisk patients, make individualized treatment decisions, plan appropriate adjuvant therapeutic strategies, and personalize the follow-up scheme.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by IRB of Shanghai General Hospital. The ethics committee waived the requirement of written informed consent for participation. Written informed consent was not obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

AUTHOR CONTRIBUTIONS
Conception and design: XX. Administrative support: XX. Provision of study materials or patients: GL, ZX, and XX. Collection and assembly of data: GL, ZX, and XX. Data analysis and interpretation: GL, ZX, and XX. All authors contributed to the article and approved the submitted version.