Development and validation of a web-based predictive model for preoperative diagnosis of localized colorectal cancer and colorectal adenoma

Background Localized colorectal cancer (LCC) has obscure clinical signs, which are difficult to distinguish from colorectal adenoma (CA). This study aimed to develop and validate a web-based predictive model for preoperative diagnosis of LCC and CA. Methods We conducted a retrospective study that included data from 500 patients with LCC and 980 patients with CA who were admitted to Dongyang People’s Hospital between November 2012 and June 2022. Patients were randomly divided into the training (n=1036) and validation (n=444) cohorts. Univariate logistic regression, least absolute shrinkage and selection operator regression, and multivariate logistic regression were used to select the variables for predictive models. The area under the curve (AUC), calibration curve, decision curve analysis (DCA), and clinical impact curve (CIC) were used to evaluate the performance of the model. Results The web-based predictive model was developed, including nine independent risk factors: age, sex, drinking history, white blood cell count, lymphocyte count, red blood cell distribution width, albumin, carcinoembryonic antigen, and fecal occult blood test. The AUC of the prediction model in the training and validation cohorts was 0.910 (0.892–0.929) and 0.894 (0.862–0.925), respectively. The calibration curve showed good consistency between the outcome predicted by the model and the actual diagnosis. DCA and CIC showed that the predictive model had a good clinical application value. Conclusion This study first developed a web-based preoperative prediction model, which can discriminate LCC from CA and can be used to quantitatively assess the risks and benefits in clinical practice.


Introduction
Colorectal cancer (CRC) is the second leading cause of cancerrelated death (1).However, for patients with localized colorectal cancer (LCC), the 5-year survival rate after timely surgical treatment can reach 90% (2).Generally, endoscopy predicts potential malignant tumors based on the size and shape of colorectal tumors and ultimately guides tumor treatment (3).LCC and colorectal adenoma (CA) are local lesions that require different treatment approaches.Patients with LCC should undergo laparoscopic or open surgery as soon as it is practicable (4) and may require chemotherapy and adjuvant radiation before surgery (5).On the contrary, patients with CA can be treated using selective endoscopic removal based on their preference (6).However, LCC has obscure clinical signs, and is difficult to distinguish from CA, which depends on biopsy and pathological evaluation (7).
Pathological diagnosis is the gold standard for differentiating between benign and malignant colorectal tumors.However, an endoscopic biopsy is an invasive examination that can lead to complications, such as bleeding, perforation, and infection; thus, it is limited to the patient's willingness and compliance (8).Due to the advancements in molecular diagnostic technology, DNA (9) and microRNA (10) are now being used as CRC biomarkers.However, they are expensive with unstable diagnostic performance limiting their use in clinical settings.Therefore, fecal occult blood test (FOBT) (11) and serum carcinoembryonic antigen (CEA) (12) detection are preferred as CRC biomarkers because of their accessibility and affordability.However, for CRC, particularly in the early stage, a single detection biomarker has limited sensitivity and a high probability of misdiagnosis.To select the appropriate treatment method and reduce the rate of misdiagnosis of LCC, it is crucial to create a diagnostic prediction model with excellent diagnostic performance using readily accessible and affordable markers.
In previous studies, there are many potential predictive biomarkers for CRC preoperative diagnosis models, such as platelet-related parameters (13), red blood cell distribution width (RDW) (14), and hemoglobin (15).However, some of those studies had a small sample size (16) and almost all of them included patients with advanced CRC (17,18).To our knowledge, patients with advanced CRC present with more pronounced clinical symptoms; as such, less sensitive indicators have an exaggerated role in the prediction model, resulting in the reduced diagnostic performance of the LCC prediction model.
In this single-center retrospective study, we used clinical and laboratory data of patients from 2012 to 2022 to develop and validate the first web-based predictive model for preoperative diagnosis of LCC and CA.

Materials and methods
The design and reporting of this study were guided by transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) (19).

Study population
Demographic and clinical data of 2070 patients were collected from the medical records at the Dongyang People's Hospital between November 2012 and June 2022.The inclusion criteria were adult patients (≥ 18 years) who were diagnosed with CRC or CA using histopathology.The exclusion criteria were (1): preoperative anti-tumor treatment (2), patients diagnosed with advanced CRC (Tumor-Node-Metastasis [TNM] stage III-IV) according to TNM staging of the eighth edition of the American Joint Committee on Cancer (3), patients with other primary malignant tumors, and (4) patients with clinical data loss rate >10%.According to the inclusion and exclusion criteria, 500 patients with LCC and 980 patients with CA were finally included in this study (Figure 1).
The study was approved by the ethics committee of Dongyang People's Hospital.The data analysis was anonymous.Informed consent was waived since this is a retrospective study and the diagnosis as well as treatment of the patients were not affected.

Data collection
Preoperative data of eligible patients were extracted, including age, sex, routine blood parameters, FOBT, serum albumin, serum glucose, and tumor biomarkers.Routine blood parameters included white blood cell (WBC), neutrophil, and lymphocyte counts, as well as hemoglobin, RDW, platelet count, and mean platelet volume.Selected tumor biomarkers were CEA and carbohydrate antigen 199.FOBT-positive patients were defined as those with a positive immunochemical test and guaiac-based test ≥ 1+.When the proportion of missing values in the variables was less than 10%, the missing data were filled using multiple imputations in the two cohorts respectively (20).For outliers, which were defined as values other than 1-99% in continuous variables winsorized by 1% on both sides (21), artificial discrimination was used in categorical variables.In addition, data on smoking, drinking, diabetes, and hypertension statuses of the eligible patients were collected.A history of smoking or drinking was noted if smoking or the use of alcohol was reported in a patient's medical record.Data from the training cohort were used for the development of the prediction model.Univariable logistic regression analysis was used preliminarily to screen candidate variables.The least absolute shrinkage and selection operator (LASSO) regression analysis was then used to remove collinear independent variables to prevent overfitting (22).Subsequently, indicators with coefficients that were not zero were included in a multivariable logistic regression to complete the final selection of variables.Variables with statistical significance (P<0.05) were included in the prediction model.The prediction probability of the model was calculated using the following formula:

Statistical analysis
A web-based prediction model was developed based on the "DynNom" and "rsconnect" packages in R software (23).The prediction probability can be automatically calculated after inputting the expression of each variable.
The performance of the model in the training and validation cohort was assessed.After data were normalized, the discriminative ability was expressed using the area under the receiver operating characteristic (ROC) curve (AUC).The Youden index was used to determine the best cut-off value, and the corresponding sensitivity, specificity, accuracy, positive prediction value (PPV), negative prediction value (NPV), positive likelihood ratio (PLR), as well as negative effect ratio (NLR), were calculated.To examine the consistency between the actual risk of LCC and the probability predicted using the new model, calibration curves were plotted and the Hosmer-Lemeshow goodness of fit test was performed.The clinical utility of the decision curve analysis (DCA) and clinical impact curve (CIC) was used to demonstrate the clinical utility of the prediction model (24).

Patient characteristics
Demographic and clinical data were extracted from medical records of 2070 patients between November 2012 and June 2022.After data were excluded based on our criteria, records of 1480 patients (LCC: n=500 and CA: n=980) were finally included in this study (Figure 1).Among LCC patients, five (1.0%), 161 (32.2%), and 334 (66.8%) patients had TNM stage 0, I, and II disease, respectively.The patients were then randomly divided into the training (n=1036) and validation (n=444) cohorts at a ratio of 7:3.There was no statistical difference between the training and validation cohorts in each variable (Table 1).

Variable selection and development of a web-based predictive model
As shown in Table 2, in the training cohort, there were significant differences in 16 clinical parameters between LCC and Flow chart of the study.
The web-based dynamic prediction model developed using the selected variables can be used through the following link: https:// ly11219.shinyapps.io/dynnomapp/.The interface of this webpage is shown in Figure 2. Figure 2A displays the input nomograph interface, in which users can adjust the expression of each item.Figure 2B shows a graphical summary of the LCC probability and 95% confidence interval (CI) predicted for the three patients according to the nomogram.The page also provides a numerical summary, shown in Figure 2C.

Evaluation of the performance of the prediction model
The AUC in the training and validation cohorts was 0.910 (0.892-0.929) and 0.894 (0.862-0.925), respectively (Figure 3A).There was no significant difference in the diagnostic performance of the prediction model between the two cohorts (P=0.379).The optimal cut-off value of the probatility of the prediction model was 26.41%.The result of sensitivity, specificity, PPV, NPV, PLR, and NLR used to    4).The results of the Hosmer-Lemeshow goodness of fit test showed good consistency between the outcome predicted by the model and the actual diagnosis, which was reflected in both the training and validation cohorts (Figures 3B, C).When drawing the DCA to reflect the advantages of the new model, we added a comparison between the new model and the two common CRC screening indicators, CEA and FOBT.In the training and validation cohorts, the threshold probability was between 0.05 and 1.00.The performance of the prediction model was better than that of the CEA, FOBT, and two extreme cases (treat-none and treat-all) as shown in Figures 4A, B. For example, when the risk threshold was set to 0.3 (i.e. if the LCC probability of the patient was >30%), the patient would receive further treatment.In the training cohort, the net benefit of the new prediction model was 0.25, which was higher than that of of the FOBT (0.20), CEA (0.10), treat-all (0.05), and treat-none (0.00).4C, D).For instance, when the risk threshold was set to 0.4, almost 400 out of every 1000 persons in the training cohort were deemed at high risk, and approximately 350 of them were diagnosed with LCC.

Discussion
CA is a benign tumor in the colorectal region, of which only 5% will eventually advance to CRC, and the overall tumor progression is slow (25).Preoperative differentiation between LCC and CA is helpful to reduce unnecessary treatment and promote the early detection of CRC.In most cases, the differentiation between LCC and CA patients depends on invasive colonoscopy (26).In this study, we found that age, sex, drinking history, WBC, lymphocyte count, RDW, albumin, CEA, and FOBT were independent predictors of LCC in patients, and successfully developed a webbased prediction model.Through evaluating calibration and validation, we believe that our prediction model has a good discrimination performance and clinical application value.
To predict LCC and CA, a wide range of variables were considered in this model when selecting preoperative markers.Age and sex play a significant role in the diagnosis of many tumors, including CRC (27) and lung cancer (28).In this study, being female and older were considered risk factors for LCC.Previous studies reported smoking and drinking as risk factors for CRC, and these were associated with a poor prognosis (29-32).The current study also showed that individuals with a drinking history were at a higher risk of being diagnosed with LCC, but there was no association between LCC and smoking.Routine blood cell parameters are often used as inflammatory markers to reflect the patient's inflammatory immune status.Various studies have shown that higher neutrophil-lymphocyte and platelet-lymphocyte ratios or a high systemic immune inflammatory index (platelet count × neutrophil count/lymphocyte count) were associated with a higher tumor stage, worse differentiation level, and worse prognosis of CRC (33)(34)(35)(36).Our prediction model used single blood cell parameters, such as WBC, lymphocyte count, or RDW, instead of the ratio between parameters, which reduces the steps of numerical conversion and simplifies the calculation process.Serum albumin not only reflects the nutritional status of patients but also has a negative correlation with the inflammatory reaction in vivo.It is an independent predictor of the prognosis of CRC (37).In this study, patients with lower albumin levels had a higher probability of LCC diagnosis.CEA and FOBT are currently widely used non-invasive markers for screening CRC (12,38), which played a significant role in this prediction model and had better diagnostic performance than being used individually.
Previous studies attempted to use molecular detection for early diagnosis of CRC; however, this is expensive and incomparable to colonoscopy or fecal immunochemical tests (39-41).Some studies have used markers of the systemic inflammatory response as  diagnostic tools for CRC (18, 42).However, the results of these trials have limited diagnostic performance in individuals with early CRC because a large number of patients with advanced CRC were included.Additionally, the static nomograph model requires the manual calculation of the prediction probability corresponding to the total score, which is less intuitive.This study included nine easily available and inexpensive preoperative variables and developed a web-based prediction model.As a preoperative prediction tool, the model is not only easy to popularize and use, but also has a high prediction accuracy and good discrimination characteristics.
Our study has some limitations.First, since this is a singlecenter retrospective study, it is necessary to validate our model with data from external prospective studies.Second, neither the diagnostic information of patients with advanced CRC nor the prognosis was included in this study.In most cases, the model was mainly used as a tool for early screening of LCC and CA.Third, other indicators with potential predictive value, such as gene expression and coagulation markers, were not included due to restrictions posed by retrospective data as well as the feasibility of sample collection and costs.To validate the accuracy of our findings, a multicenter prospective investigation is required.

Conclusion
A web-based preoperative prediction model incorporating nine preoperative variables was developed.The model can directly and quantitatively assess the risks and benefits in clinical practice and has strong performance in recognizing LCC and CA.

14. 0 (
Stata Corp LP, College Station, TX, USA) and R software version 4.1.0.Continuous variables are expressed as means and standard deviations or medians and interquartile intervals as appropriate.A Wilcox test or Student's t-test was used to assess between-group differences.Categorical variables were compared using the chi-square test and are presented as quantities (percentages).

2
FIGURE 2 Web interface for distinguishing localized colorectal cancer (LCC) from colorectal adenoma (CA).(A) The area where the expression of each item can be adjusted by the user.(B) Graphical summary of the LCC probability and 95% confidence interval (CI) predicted by the prediction model.(C) Numerical summary of LCC probability and 95% CI.

3
FIGURE 3 Evaluation of the performance of the prediction model to distinguish localized colorectal cancer (LCC) and colorectal adenoma (CA).(A) The receiver operating characteristic (ROC) curve of the prediction model in the training and validation cohorts.(B) The calibration curve of the prediction model in the training cohort.(C) The calibration curve of the prediction model in the validation cohort.

TABLE 1
Preoperative baseline characteristics of patients in the training and validation cohorts.

TABLE 2
Univariate logistic analysis and LASSO regression analysis in the training cohort.
CI, confidence interval; OR, odds ratio; LASSO, least absolute shrinkage and selection operator.

TABLE 3
Multivariate logistic analysis in the training cohort.

TABLE 3 Continued
CI, confidence interval; OR, odds ratio.

TABLE 4
Predictive performance of the models used to estimate the risk of LCC.