A multi-variable predictive warning model for cervical cancer using clinical and SNPs data

Introduction Cervical cancer is the fourth most common cancer among female worldwide. Early detection and intervention are essential. This study aims to construct an early predictive warning model for cervical cancer and precancerous lesions utilizing clinical data and simple nucleotide polymorphisms (SNPs). Methods Clinical data and germline SNPs were collected from 472 participants. Univariate logistic regression, least absolute shrinkage selection operator (LASSO), and stepwise regression were performed to screen variables. Logistic regression (LR), support vector machine (SVM), random forest (RF), decision tree (DT), extreme gradient boosting(XGBoost) and neural network(NN) were applied to establish models. The receiver operating characteristic (ROC) curve was used to compare the models’ efficiencies. The performance of models was validated using decision curve analysis (DCA). Results The LR model, which included 6 SNPs and 2 clinical variables as independent risk factors for cervical carcinogenesis, was ultimately chosen as the most optimal model. The DCA showed that the LR model had a good clinical application. Discussion The predictive model effectively foresees cervical cancer risk using clinical and SNP data, aiding in planning timely interventions. It provides a transparent tool for refining clinical decisions in cervical cancer management.


Introduction
In terms of incidence (6.5%) and mortality (7.7%), cervical cancer is the fourth most common cancer in women worldwide (1).The incidence (5.2%) and mortality (5.3%) of cervical cancer in China are much higher than in developed countries (2).Cervical cancer develops due to a complicated interaction between elements influencing the virus's carcinogenic potential and host characteristics associated with susceptibility to chronic infection and tumor formation (3).Although persistent high-risk types of human papillomavirus (hrHPV) infection play a critical role in the development of cervical cancer, this alone cannot explain the malignancy (4).In addition to factors such as high-risk sex, sexually transmitted diseases, preterm births, multiple births, use of oral hormonal contraceptives and smoking that may affect human papillomavirus (HPV) infection (5,6), numerous studies have demonstrated the association between simple nucleotide polymorphisms (SNPs) as genetic factor, which may have an impact on gene expression or protein function, and cervical carcinogenesis.The effectiveness of the immune response to HPV antigens may be altered by genetic variations in human leukocyte antigen (HLA) molecules, which may retard the progression of cervical cancer (7).For instance, the HLA class II DRB1*1302 allele protects against the advancement of low-grade squamous intraepithelial lesion (LSIL) into grade 3 cervical intraepithelial neoplasia (CIN3) (8).Furthermore, cervical cancer has been associated with specific genetic SNPs crucial for DNA repair, apoptosis, and cell metabolism (9)(10)(11).
In addition to health education and HPV vaccine acting as primary prevention measures, the clinical screening and diagnosis of cervical cancer is primarily based on the three-step procedure (hrHPV test, Papanicolaou test, and colposcopy).Regular hrHPV test and Papanicolaou test are recommended for early cervical lesion detection of at-risk populations, and those with abnormal test results were referred to colposcopy to receive timely and reasonable treatment.In order to increase the risk awareness of individuals in the preclinical stage of cervical cancer, we developed a predictive warning model for clinical diagnosis of cervical cancer and precancerous lesions combining clinical and mutational features.We subsequently validated and evaluated these models, considering them as potential risk indicators and supplementary diagnostic tools.This approach may contribute to the development of a cost-effective screening test for early cervical cancer detection, ultimately benefiting public health.

Study population
The 474 subjects were recruited from the cervical specialist outpatient and inpatient departments of the Department of Obstetrics and Gynecology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology.There were 211 subjects in the patient group and 263 subjects in the control group.Inclusion criteria for the patient group were as follows: (i) age range of 18-75 years, (ii) Han Chinese ethnicity, (iii) no prior surgery, radiotherapy, chemotherapy, or other related treatment for cervical cancer or precancerous lesions, (iv) no previous personal history of other tumors, (v) pathology confirmed as CIN2, CIN3 or squamous carcinoma.All subjects in the patient group had their postoperative pathological results tracked, and the highest grade lesion was utilized as the final diagnosis.Inclusion criteria for the control group were as follows: (i) age range of 18-75 years, (ii) Han Chinese ethnicity, (iii) no prior history of cervical precancerous lesions or cancer, (iv) no previous family history of other tumors, (v) confirmed normal cervical findings through HPV and cytology screening at our hospital, or via cervical biopsy reviewed by two or more pathologists at our institution or through external biopsy review by a panel of two or more pathologists at our hospital.Exclusion criteria were as follows: Samples did not fit the aforementioned inclusion criteria and needed to be excluded, and subjects who were pregnant, lactating or not having sex should also be omitted.

Data collection and sample collection
Information on education level, history of cervical surgery, HPV infection, delivery number, menarche, menopause, dysmenorrhea, history of sex life, and demographic characteristics were obtained for all participants through an on-site questionnaire survey.Some questions were not responded to due to patients' privacy considerations, and missing values are addressed later in the data preprocessing.Four milliliters of peripheral venous blood were collected from each participant.Genomic DNA (QIAamp DNA Mini Kit (Cat.51,306, QIAGEN)) was extracted from the hematocrit brown layer of blood samples.The selection of SNPs was based on preliminary experimental results and previously published literature (12).Following target region sequencing, the final set of 59 SNPs was determined using criteria such as a p-value of Trend-test <0.1 and the exclusion of linked loci (Supplementary Table S1).To construct the library for sequencing, the genome DNA of each sample was randomly interrupted into fragments (around 250 bp to 300 bp) and adaptors were ligated to both ends of these fragments.After purification, the library was amplified using LM-PCR and hybridized with SureSelect Biotinylated RNA Library (BAITS) for fragment enrichment.These enriched fragments were amplified again by LM-PCR.After the quality check, the DNA library was ultimately sequenced.The DNA library was sequenced using the Illumina HiSeq 2000 platform after the quality test.
Data quality control includes the removal of adaptor-contaminated reads and low-quality reads (defined as N proportion ≥ 10% or ≥ 50% of bases with Q ≤ 5).The accuracy of the clean data should be >90% for Q20 and > 85% for Q30.

Data preprocess
Among the clinical data acquired from the on-site questionnaire, variables with missing data rates of >15% and samples with missing data rates of >50% were removed.We calculated the remaining missing data using the k-nearest neighbor (k-NN) imputation algorithm.SNPs with a mutation rate below 3% or above 97% were excluded.By incorporating the clinical information with SNPs, we acquired the final dataset as 9 clinical factors and 47 mutation variables for 464 samples.

Model building and statistical analysis
Figure 1 displays the comprehensive flowchart.After running each variable via univariate logistic regression, the variables with p-value <0.05 were selected for the subsequent analysis.The least absolute shrinkage and selection operator (LASSO) regression with ten-fold cross-validation was applied to dimensionality reduction.After further selection through backward stepwise regression, the retained features were modeled by using logistic regression (LR), support vector machines (SVM), random forests (RF), decision trees (DT), eXtreme Gradient Boosting (XGBoost) and neural network (NN).The initial cohort of 464 samples underwent a random split, resulting in two distinct datasets: a training set comprising 325 samples and a test set consisting of 139 samples, with a partition ratio of 7:3.The training set underwent an internal 10-fold cross-validation process, employing nine folds for model building and reserving one fold for validation.Performance metrics for both the training and validation sets were averaged over iterations within the crossvalidation.Specifically, nine folds of the training set, were used for training the model, maintaining a sufficient event size (145 events) relative to the 14 variables included, as recommended by the 10-EPV (ten events per variable) guideline (13,14).The optimal model was ) was used for all statistical analyses.R packages "simputation, " "glmnet, " "caret, " "pROC, " "broom, " "forestplot, " "grid, " "magrittr, " "tinytex, " "checkmate, " "rmda, " "e1071, " "randomForest, " "rpart, " "rpart.plot," "dplyr, " "xgboost, " "tidyverse" and "neuralnet" were used in this investigation.Statistical significance is indicated by bilateral p-value <0.05.

Basic characteristics of the study subjects
The demographic details and clinical features of the population in this study are shown in Table 1.Regarding age, BMI, education level, history of prior cervical surgery, HPV infection, number of births, number of live births, number of transvaginal births, age of menarche, duration of menstruation, and dysmenorrhea, there were no statistically significant differences between the training set and validation set in both patient and control groups (p-value >0.05), indicating that the two sets were comparable.

Variable selection
Nine clinical variables (education level, history of prior cervical surgery, HPV infection, number of births, number of live births, number of transvaginal births, age of menarche, duration of menstruation, and dysmenorrhea) showed statistically significant differences in the univariate logistic regression analysis (Table 2).After data preprocessing, LASSO regression (Figure 2) was used to filter variables among the 47 mutant loci and the 9 clinical variables indicated above.15variables (6 clinical variables +9 SNPs) with nonzero coefficients were obtained at lambda = 0.040.Stepwise regression was used to acquire 14 variables (5 clinical factors +9SNPs) to further minimize the variables and boost clinical utility.

Model construction and validation
Six models were developed using 10-fold cross-validation.Supplementary Table S2 displays the average evaluation metrics, demonstrating that the LR, XGBoost, and NN models outperformed the others.
Based on the 14 variables acquired above, we created a multivariate logistic regression model using the imputation approach, and the Akaike information criterion (AIC) was 363.82.Placing cut points at the maximum "Youden index" (0.288, which placed the best cutoff point with both high sensitivity and specificity), we divided samples into two groups (control, patient) to calculate confusion matrix according to yhat value.SVM has four widely used kernel functions-the linear function, polynomial function, sigmoid function, and radial basis function.By comparing the prediction accuracy of the four kernel function modeling in the e1071 package, the best SVM model-rbf kernel function model-was obtained at the cost of 1 with an accuracy of 0.645.
The error value of the RF model achieves the smallest when the number of decision trees is 16.As the number of decision trees rises, the model's error steadily declines.The three most crucial factors are HPV, dysmenorrhea, and history of cervical surgery after rating the critical components of the RF model (Supplementary Figure S1).
For the DT model, 0.011 was the ideal cp value.The input variables for the DT model were HPV and history of cervical surgery, and the final number size was 3 based on the matching ordering of significant characteristics.The Supplementary Figure S2 displays the decision tree model and outcomes.In the training dataset, the model successfully categorized 71.8% of the samples.
In comparison, the XGBoost model and the NN model exhibited superior performance, closely trailing the predictive capacity of the LR model, achieving an AUC of 0.80.rs3741378, rs2274933 and dysmenorrhea were assessed as top 3 important variables in XGB model, while HPV, rs148927246 and rs141000672 ranks top 3 in NN model.
The discriminability of the six models in the training and validation sets was assessed using the ROC curve analysis, and the AUCs were established (Figures 3A,B).We chose the logistic regression model as our final model for its relatively high performance, simplicity, and interpretability, making it practical for real-world applications (Table 3).

Evaluation of the LR model
The DCA analysis (Figures 3C,D) concluded a good performance of the LR model in terms of clinical applications.The forest plot for the LR model appears in Figure 4, rs141000672, rs2302694, rs77689370,

Discussion
This research aimed to uncover factors associated with cervical cancer and precancerous lesions, considering both external factors (such as an individual's education level, menstrual history, and marital status) and internal factors related to genetic susceptibility.We carefully analyzed mutations in 59 specific genetic markers (SNPs) and integrated these findings into our modeling approach.We explored several modeling methods and found that the XGBoost model, combining multiple factors, showed strong predictive abilities, making it our preferred choice for predicting cervical cancer development.
Several clinical diagnostic prediction models for cervical cancer have been developed recently.The study by Van      methylation analysis in urine for detecting CIN2, CIN3, and cervical cancer.Their findings, boasting an AUC of 0.84 along with high sensitivity and specificity, provide a promising perspective on detecting cervical cancer and precancerous lesions (15).Furthermore, an advanced Stacking-Integrated Machine Learning (SIML) model was developed to identify high-risk individuals for cervical cancer.This model achieved an AUC of 0.877, with a sensitivity of 81.8% and specificity of 81.9%, demonstrating its potential for accurate risk assessment based on demographic, behavioral, and clinical factors (16).Fu et al. asserted that a colposcopy-based multi-image deep learning model that incorporates the results of both an HPV test and a cytology test would produce results with higher sensitivity and specificity than the cytology-HPV diagnostic model or the colposcopy-based multi-image deep learning model applied independently (17).Another study successfully stratifies high-grade cervical lesions employing sequencing and machine learning as a valuable addition to the current comprehensive triage method (18).However, none of the aforementioned studies highlighted the integration of germline mutations detected through SNP analysis as a pivotal aspect for the early warning of cervical cancer, which is the distinguishing feature of our model.Our model uniquely incorporates germline mutations detected through SNP analysis, highlighting the novel inclusion of genetic susceptibility factors for early detection of cervical cancer.
According to previous research, HPV is unquestionably the most significant risk factor for cervical carcinogenesis.Cervical precancerous and invasive carcinoma need ongoing high-risk HPV infection of cervical basal epithelial cells with the capacity to divide and differentiate as well as integration of viral DNA with the host genome (19,20).As a result, high-risk HPV screening is especially relevant as a primary screening method for cervical cancer.The LR model considers a prior history of cervical surgery protective, which makes sense, given that this operation eliminates the anatomical components most likely to develop cervical cancer.
In the LR model, 6 SNPs in 5 genes were independently associated with cervical carcinogenesis, with each SNP exerting its slight effect.We found formerly published evidence on several of these SNPs associated with cancer.HSPG2 (rs141000672) encodes the perlecan protein, and SNPs of HSPG2 (rs12034979, rs6697265, rs6680566, and rs878949) had previously been identified as potential risk factors for the advancement of cervical lesions caused by HPV types 16, 18, and 52 infections (21).In our model, HSPG2 is thought to be a protective factor against cervical cancer.Multiple studies have shown that small structural differences in HSPG2 across diseases have antagonistic effects on tumor formation and metastasis; intact perlecan promotes the development of a vascular supply that supports tumor cell proliferation and the development of a variety of cancers, whereas bioactive perlecan fragments inhibit tumor development by targeting its vascular supply (22,23).LRP2 (rs2302694) mutations were detected in a range of cancers, with melanoma (28.18%), uterine sarcoma (17.99%), and lung squamous cell carcinoma (16.32%) with the highest mutation rates (24).A pan-cancer study found that LRP2 mutations were linked with increased immune cell infiltration, immune checkpoint gene expression, and significant enrichment of immune-related signaling pathways, as well as a better prognosis, compared to individuals who did not have LRP2 mutations (24).Depletion of Lama5 (rs148927246, rs2274933) in lymph node stromal cells controls immunological responses to T cell migration and function, encourages branching angiogenesis, and modifies Notch signaling, which facilitates colorectal cancer spread to the liver (25,26).This raises the possibility that it could impact cervix cellular immunity and angiogenesis and encourages cervical carcinogenesis.HPV infection type is connected with polymorphisms in HLA-DRB1 (rs77689370), which ultimately affects how quickly high-risk HPV-infected cervical lesions develop into invasive cervical cancer (27).The clockican family member NCAN (rs2228600) is primarily expressed in neural tissue.The carcinogenesis and malignancy of neuroblastoma (NB) are influenced by NCAN, which also promotes the growth and invasion of glioma cells (28,29).Among the six warning models, the decision tree model is clinician-friendly and has high clinical tractability for making decisions from root to leaf nodes, while it exhibited instability in generalization.NN showed competitive performance but slightly lower accuracy compared to LR.On the other hand, the diagnostic performance of the RF, SVM and XGB models was mediocre, and their "black box" characteristics limited clinical interpretability marginally.Relatively, the multivariate LR model is more interpretable and can assist physicians in anticipating the occurrence of cervical cancer.The 8 variants in the LR model were proven to be risk factors of statistical significance, including 2 clinical features and 6 SNPs.Advanced techniques like XGBoost and neural networks demonstrated superior predictive capabilities.Their outperformance in accuracy and AUC proves their ability to handle complex relationships in data, making them promising models that balance interpretability and performance.
All in all, our model is positioned as a valuable complement to established screening methods like the ThinPrep Cytology Test (TCT) and HPV testing, especially beneficial for individuals exhibiting positive HPV results alongside normal TCT findings.The potential lies in enhancing both sensitivity and specificity in clinical screening, facilitating the identification of high-risk individuals who might otherwise be overlooked.To ascertain its precise application in clinical practice, rigorous empirical research and thorough clinical validation are imperative.Despite these promising aspects, the present study has certain limitations.Primarily, the retrospective collection of clinical variables through self-reported data in a case-control setup may introduce measurement errors and recall biases, potentially impacting the predictive accuracy of the data.Additionally, the study's limited sample size significantly reduces statistical power within both the training and validation cohorts.Despite this limitation, we have endeavored to emphasize the robustness of our internal validation procedures, including the use of 10-fold cross-validation, to mitigate potential biases and enhance the reliability of our findings within the scope of available resources.
In summary, our model incorporating both clinical features and SNPs contributes valuable insights toward predicting cervical carcinoma.While showing promising predictive ability, further refinement and validation are essential to ascertain its full clinical utility.
selected based on these averaged metrics.Subsequently, the entire training set was utilized for final model training, while the test set served as an independent validation set.The model's performance was evaluated based on the area under the curve (AUC), accuracy, sensitivity, and specificity.By computing the net benefit of the training and validation cohorts and plotting decision curve analysis (DCA), the clinical utility of the logistic regression model was judged.R software (version 4.1.1

FIGURE 1 Flow
FIGURE 1Flow diagram of the whole research.
den Helder et al. showcased the application of hrHPV DNA testing and DNA

FIGURE 2
FIGURE 2The dimensionality reduction of 9 clinical features and 59 mutation features by LASSO.(A) Selection of the tuning parameter (λ) via 10-fold crossvalidation based on minimum criteria.Binomial deviances from the LASSO regression cross-validation procedure were plotted as a function of log (λ).The optimal λ value of 0.014 was selected.(B) LASSO coefficient profiles of the 68 variables.As the value of λ decreased, the degree of model compression increased and the function of the model to select important variables increased.

FIGURE 3
FIGURE 3The receiver operating characteristic (ROC) curves and decision curve analysis (DCA).ROC for the (A) training sets and (B) validation sets.Based on the area under the decision curve, DCA was used to assess the clinical utility of the logistic model.The area of the (C) training set and (D) validation set is greater than the "treat all" (gray) or "no treatment" (black) strategy.This indicates that the logistic model has good utility in clinical decision making.

FIGURE 4 Forest
FIGURE 4Forest plots based on the p-value, HR values (95% CI) of the 14 variables of the LR model.HR, hazard ratio; CI, confidence interval.

TABLE 1
Demographic and clinical characteristics of patients between training and validation cohorts.
Unless otherwise indicated, values are number of patients with percentage in parentheses.a Values are the mean with standard deviation in parentheses.b BMI (body mass index) = weight (kg)/height 2 (m 2 ).rs2274933 and HPV infection were identified as risk factors.

TABLE 2
Variate screening using univariate logistic regression.

TABLE 3
Performance of four models for predicting the occurrence of cervical cancer.