REVIEW article

Front. Oncol., 14 April 2025

Sec. Cancer Epidemiology and Prevention

Volume 15 - 2025 | https://doi.org/10.3389/fonc.2025.1555247

Methodological and reporting quality of machine learning studies on cancer diagnosis, treatment, and prognosis

  • 1. Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States

  • 2. Telehealth Unit, Universidad Nacional Mayor de San Marcos, Lima, Peru

Article metrics

View details

10

Citations

3,1k

Views

776

Downloads

Abstract

This study aimed to evaluate the quality and transparency of reporting in studies using machine learning (ML) in oncology, focusing on adherence to the Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Models (CREMLS), TRIPOD-AI (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis), and PROBAST (Prediction Model Risk of Bias Assessment Tool). The literature search included primary studies published between February 1, 2024, and January 31, 2025, that developed or tested ML models for cancer diagnosis, treatment, or prognosis. To reflect the current state of the rapidly evolving landscape of ML applications in oncology, fifteen most recent articles in each category were selected for evaluation. Two independent reviewers screened studies and extracted data on study characteristics, reporting quality (CREMLS and TRIPOD+AI), risk of bias (PROBAST), and ML performance metrics. The most frequently studied cancer types were breast cancer (n=7/45; 15.6%), lung cancer (n=7/45; 15.6%), and liver cancer (n=5/45; 11.1%). The findings indicate several deficiencies in reporting quality, as assessed by CREMLS and TRIPOD+AI. These deficiencies primarily relate to sample size calculation, reporting on data quality, strategies for handling outliers, documentation of ML model predictors, access to training or validation data, and reporting on model performance heterogeneity. The methodological quality assessment using PROBAST revealed that 89% of the included studies exhibited a low overall risk of bias, and all studies have shown a low risk of bias in terms of applicability. Regarding the specific AI models identified as the best-performing, Random Forest (RF) and XGBoost were the most frequently reported, each used in 17.8% of the studies (n = 8). Additionally, our study outlines the specific areas where reporting is deficient, providing researchers with guidance to improve reporting quality in these sections and, consequently, reduce the risk of bias in their studies.

1 Introduction

Cancer is one of the leading causes of disease burden and mortality worldwide. Early detection is crucial for improving clinical outcomes, yet approximately 50% of cancers are diagnosed at an advanced stage (1). Artificial Intelligence (AI), particularly machine learning (ML) models, has shown great potential in cancer diagnosis, prognostic, and treatment, standing out for its high accuracy, sensitivity, and ability to integrate into clinical workflows (2). This progress has led to a significant increase in studies and publications dedicated to the development and evaluation of these models in recent years (3).

The quality of reporting in these studies is critical, as clear and comprehensive documentation allows for the validation of results and their replication in different contexts (2). However, growing concerns have emerged regarding the completeness and accuracy of reports in ML-based research. Indeed, a meta-review of fifty systematic reviews, revealed that most reports of primary diagnostic accuracy studies were incomplete, limiting their utility and reproducibility (4). Furthermore, systematic reviews on treatment response in cancer patients have identified high heterogeneity and reporting issues (5, 6).

In response to these deficiencies, specific guidelines, such as the Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Models (CREMLS) and the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD+AI), have been developed to standardize the key aspects that should be included in ML study reports (7, 8). Nevertheless, the widespread adoption of these tools depends on their dissemination and proper training, which may delay their integration into recent scientific publications (9). In addition to ensuring the quality of reporting, assessing the risk of bias in studies using AI models for cancer patients is essential. The Prediction Model Risk of Bias Assessment Tool (PROBAST) is a tool designed to evaluate the risk of bias across four domains: participants, predictors, outcomes, and analysis (10).

Our study evaluated PubMed publications that applied ML for the diagnosis, treatment, and prognosis of cancer patients. We assessed reporting quality (CREMLS and TRIPOD+AI), risk of bias (PROBAST), and ML performance metrics.

2 Materials and methods

2.1 Eligibility criteria

We included primary studies that applied AI techniques to predict cancer diagnosis, treatment, and prognosis in patients. Studies were eligible regardless of participants’ age, sex, race, ethnicity, or other sociodemographic or clinical characteristics. Articles written in any language and employing observational or experimental study designs were considered. Reviews, viewpoints, and conference papers were excluded. Our study included research that evaluates the first case of a specific cancer, specific recurrent cancers, groups of cancers, or metastases.

2.2 Search strategy and sources

Our search strategy included terms related to oncology, AI, prognosis, diagnosis, and treatment. A search was conducted in PubMed, restricted to studies published between February 1, 2024, and January 31, 2025. The results were sorted by “most recent,” and the 15 most recent studies on prognosis, diagnosis, and treatment were selected, yielding a total of 45 articles. The most recent articles were chosen in order to reflect the current state of the rapidly evolving landscape of AI applications in oncology. Details of the search strategy are provided in Supplementary Material 1.

2.3 Selection process

The selection process was conducted using Rayyan (11). Two reviewers independently screened the title and abstract of each record identified to determine if the inclusion criteria were met. Subsequently, both reviewers independently evaluated the full text of records that passed the title and abstract screening. In cases of disagreement, a third reviewer made the final decision regarding inclusion.

2.4 Data collection process and data items

Data extraction was performed using a standardized Excel form. Two independent reviewers conducted the data collection process, and a third reviewer made the final decision in cases of disagreement. Each reviewer independently collected information using this form, which captured details such as the name of the first author, year of publication, article title, patient characteristics, type of AI tested or developed, clinical contexts, reported outcomes (AI performance metrics), reporting quality (CREMLS and TRIPOD+AI) (7, 8), and risk of bias (PROBAST) (10).

The extraction process was conducted in duplicate, and the level of agreement between independent evaluators was assessed in the first five included studies. The extracted information was highly consistent, achieving an agreement level above 80%.

2.5 Synthesis methods

2.5.1 Report quality

The TRIPOD+AI checklist consists of 27 key items designed to ensure the quality and transparency of reporting in prediction model studies (8). This updated version of TRIPOD extends the recommendations to include models developed using ML methods. The primary objective of TRIPOD+AI is to promote clear, comprehensive, and reproducible reporting in studies that develop or validate prediction models in healthcare. TRIPOD+AI results were categorized using the following response options: Yes, No/Not reported, or Not applicable.

The CREMLS guideline was developed to enhance the transparency and quality of reporting in ML studies for diagnostic and prognostic applications in healthcare (7). It comprises 37 items organized into five categories: study details, data, methodology, model evaluation, and model explainability. CREMLS results were assessed using the following response options: Yes, No/Not reported, or Not applicable.

Since our objective was to assess the reporting quality of studies using ML for diagnostic, prognostic, and treatment outcomes in cancer patients, no studies were excluded based on a negative evaluation of reporting quality.

2.5.2 Risk of bias

The PROBAST tool assessed the risk of bias (RoB) in studies on prediction models for diagnosis, prognosis, and treatment. PROBAST provides a structured framework for evaluating the internal and external validity of studies by identifying potential biases in participant selection, predictor and outcome measurement, analytical strategies, and data presentation (10).

The response options in PROBAST were: low RoB/low concern regarding applicability (+), high RoB/high concern regarding applicability (-), and unclear RoB/unclear concern regarding applicability (?).

2.5.3 AI performance metrics

The study presented the main metrics used to evaluate the performance of each AI model included in the analyzed articles across different phases (i.e., training, testing, validation). These metrics included sensitivity (also known as recall or true positive rate), specificity (true negative rate), overall accuracy (probability of correct classification), positive predictive value (precision), F1 score, and the area under the ROC curve (AUC).

2.6 Statistical analysis

All statistical analyses were performed using R. Our study employed a descriptive approach based on percentages and absolute frequencies of CREMLS, TRIPOD+AI, and PROBAST scores. No inferential approaches focused on hypothesis testing were used.

3 Results

3.1 Study selection

Our study searched PubMed for ML models related to diagnosis, treatment, and prognosis that met the inclusion criteria. To standardize selection, we included only the first 15 articles for each category: prognosis, diagnosis, and treatment. The screening process, which involved title and abstract review followed by full-text assessment, is shown in Figure 1.

Figure 1

3.2 Characteristics of the included studies

The first authors of the included studies were affiliated with institutions from 16 different countries. China had the highest number of publications (n=27/45; 60%), followed by Egypt (n=3/45; 6.7%). The most frequently studied cancer types were breast cancer (n=7/45; 15.6%), lung cancer (n=7/45; 15.6%), and liver cancer (n=5/45; 11.1%). More than half of the studies did not specify the source of funding (n=25/45; 55.6%). The individual characteristics of each study are summarized in Supplementary Table 1. The list of included articles is presented in Supplementary Table 2.

3.3 Report quality by CREMLS

Our study found that, in the study design section, all evaluated studies adequately reported the medical/clinical task of interest, research question, overall study design, and intended use of the ML model. However, only 51% of the studies provided information on whether existing model performance benchmarks for this task were considered.

In the data section, all studies reported information on methods of data collection and data characteristics. However, a large proportion did not report details on sample size calculation (98%), known quality issues with the data (69%), or bias introduced due to the data collection method used (62%).

In the methodology section, all studies reported the rationale for selecting the ML algorithm, and 98% described the method used to evaluate model performance during training. However, no studies reported strategies for handling outliers, and the majority did not provide information on strategies for model pre-training (92%) or data augmentation (79%).

In the evaluation section, nearly all studies reported some type of performance metrics used to evaluate the model (98%) and the results of internal validation (98%). However, no studies provided information on characteristics relevant for detecting data shift and drift, and 93% did not report or discuss the cost or consequences of errors.

In the explainability and transparency section, 82% of the studies reported the most important features and their relationship to the outcomes. Compliance with different criteria for each individual study is detailed in Figure 2.

Figure 2

3.4 Report quality by TRIPOD+AI

The included studies demonstrate high compliance with key reporting standards for predictive modeling. All studies (100%) clearly identified their nature as predictive modeling investigations, specified their objectives, and provided detailed descriptions of data sources and participant eligibility (Figure 3). Also, they precisely defined predictors and outcomes, justifying their selection and evaluation methodology. Data preparation and model validation were also well-documented, with comprehensive descriptions of analytical methods, including model type, hyperparameter tuning, and performance metrics. Furthermore, all studies addressed the interpretation of results and discussed limitations, biases, and generalizability. Ethical considerations were reported in most cases, with 96% of studies indicating approval from an ethics committee or an equivalent review process. The flow of participants and differences between development and evaluation datasets were also consistently documented.

Figure 3

Despite strong adherence to several TRIPOD+AI criteria, significant reporting gaps were identified. 98% of studies did not report potential health inequalities among sociodemographic groups or describe the demographic characteristics and qualifications of outcome and predictor assessors. Additionally, 98% of studies failed to indicate where the study protocol could be accessed, and 96% did not provide information on study registration or describe methods to address model equity or justify their approach. The availability of analytical code was also limited, with 93% of studies not providing details on access. A critical issue was the complete absence of patient and public involvement in study development (100%). Furthermore, 93% of studies did not compare development and evaluation data for key predictors, while 96% did not report heterogeneity in model performance. Lastly, 98% of studies did not present results on model updating, and 91% did not specify the level of user interaction or expertise required for model implementation.

3.5 Risk of bias by PROBAST

The methodological quality assessment using PROBAST revealed that 89% of the included studies exhibited a low overall risk of bias, suggesting adequate internal validity in most evaluated studies (see Table 1). Additionally, all studies demonstrated a low risk of bias in terms of applicability, indicating that the assessed predictive models are potentially transferable to real-world clinical settings.

Table 1

First author (year)ROBApplicabilityOverall
ParticipantsPredictorsOutcomeAnalysisParticipantsPredictorsOutcomeROBApplicability
Dong (2025)LowUnclearLowLowLowUnclearLowLowLow
Liang & Luo (2025)LowLowLowLowLowLowLowLowLow
Zhang et al. (2025)LowLowLowLowLowLowLowLowLow
Noman (2025)LowLowLowLowLowLowLowLowLow
Zhou (2025)LowUnclearLowLowLowUnclearLowLowLow
Liu (2025)LowLowLowUnclearLowLowLowLowLow
Franco-Moreno (2025)LowUnclearLowUnclearLowUnclearLowUnclearLow
Li (2025)LowLowLowLowLowLowLowLowLow
Hamano (2025)LowLowLowLowLowLowLowLowLow
Yin (2025)LowLowLowLowLowLowLowLowLow
Xiao (2025)LowLowLowLowLowLowLowLowLow
Liu (2025)LowUnclearLowLowLowUnclearLowLowLow
Su (2025)LowUnclearLowLowLowUnclearLowLowLow
Alshwayyat (2025)LowLowLowLowLowLowLowLowLow
Liu (2025)LowLowLowLowLowLowLowLowLow
Maleki (2025)LowUnclearLowLowLowUnclearLowLowLow
Park (2025)LowLowLowLowLowLowLowLowLow
EL kati (2024)UnclearLowLowUnclearLowLowLowUnclearLow
Wu (2025)LowLowLowHighLowLowLowHighLow
Shan (2025)LowLowLowHighLowLowLowHighLow
Ni (2025)LowLowLowLowLowLowLowLowLow
Rafiepoor (2025)LowLowLowLowLowLowLowLowLow
Shehta (2025)LowLowLowLowLowLowLowLowLow
Chiu (2025)LowLowLowUnclearLowLowLowLowLow
Wang (2025)LowUnclearLowHighLowUnclearLowHighLow
Natha (2025)LowLowLowLowLowLowLowLowLow
Gui (2024)LowUnclearLowLowLowUnclearLowLowLow
Şahin (2025)LowUnclearLowLowLowUnclearLowLowLow
Chen (2025)LowUnclearLowLowLowUnclearLowLowLow
Zhong (2025)LowUnclearLowLowLowUnclearLowLowLow
Pan (2025)LowLowLowUnclearLowLowLowLowLow
Wang (2025)LowLowLowLowLowLowLowLowLow
Wang (2025)LowLowLowLowLowLowLowLowLow
Nashat (2025)LowLowLowUnclearLowLowLowLowLow
Zhou (2024)LowLowLowUnclearLowLowLowLowLow
Torok (2025)LowLowLowLowLowLowLowLowLow
Bozcuk (2025)LowUnclearLowLowLowUnclearLowLowLow
Chen (2025)LowLowLowLowLowLowLowLowLow
Pan & Wang (2025)LowLowLowLowLowLowLowLowLow
Chufal (2025)LowLowLowUnclearLowLowLowLowLow
Miyamoto (2025)LowUnclearLowLowLowUnclearLowLowLow
Grosu (2025)LowLowUnclearLowLowLowUnclearLowLow
Huang (2025)LowLowLowLowLowLowLowLowLow
Zhang (2025)LowLowLowLowLowLowLowLowLow
Ramasamy (2024)LowLowLowLowLowLowLowLowLow

Risk of Bias by PROBAST.

ROB, risk of bias; Low, indicates low ROB/low concern regarding applicability; High, indicates high ROB/high concern regarding applicability; Unclear, indicates unclear ROB/unclear concern regarding applicability.

At the domain level, the risk of bias in the participant and outcome domains was minimal, with 98% of the studies showing a low risk of bias in these areas. Regarding applicability, 100% of the studies reported low concern for the applicability of participants, while 98% indicated low concern for the applicability of outcomes.

However, the predictor domain showed an unclear risk of bias in 29% of the studies, both in terms of bias risk assessment and applicability. Furthermore, in the analysis domain, only 76% of the studies exhibited a low risk of bias, suggesting that a significant proportion may have employed methodological practices that compromise the validity of their findings. Issues such as handling of missing data, predictor selection, and correction for overfitting require further attention to enhance the robustness of predictive models in future studies.

3.6 AI performance metrics

In terms of the AI model family used, 73.3% of the studies employed Supervised ML models (n = 33), while 24.4% used Deep Learning models (n = 11). Regarding the specific AI models identified as the best-performing, Random Forest (RF) and XGBoost were the most frequently reported, each used in 17.8% of the studies (n = 8). The performance metrics for each study are presented in Table 2.

Table 2

First author
(year)
TypeType of dataType of AI tested or developedAI models usedBest modelAI metrics
Dong (2025)PrognosisCervical screening data, including hrHPV full genotyping, cytology results, and gynecological examination findings.Supervised Machine LearningXGBoost, Support Vector Machine (SVM), Random Forest (RF), Naïve Bayes (NB)XGBoostAUROC: 0.989 (maximum), 0.781 (minimum); Brier Score: 0.118 (maximum), 0.01 (minimum)
Liang & Luo (2025)PrognosisStructured dataSupervised Machine LearningXGBoost, MLP, KNN, Random Forest (RF), Logistic Regression, AJCC StagingXGBoostAUC (Training: 0.95, Validation: 0.78)
Zhang et al. (2025)PrognosisStructured data (clinical, lab results)Supervised Machine LearningLogistic Regression, KNN, RF, SVM, XGBoost, LightGBMLogistic RegressionAccuracy: 0.765 (LR), Precision: 0.750 (LR), Recall: 0.710 (LR), F1-Score: 0.695 (LR), AUC: 0.812 (LR)
Noman (2025)PrognosisClinical and pathological features (tumor grade, ER status, HER2, lymph node involvement, tumor size, survival status).Supervised Machine LearningLightGBM, XGBoost, Random Forest, Support Vector Machine (SVM), Neural Networks (NN), K-Nearest Neighbors (KNN)LightGBM (for recurrence), XGBoost (for recurrence type differentiation)C-index: 0.837 (survival analysis), AUC: 92% (LightGBM for recurrence), Accuracy: 86% (XGBoost for recurrence type classification)
Zhou (2025)PrognosisClinical, demographic, psychological, and health-related data.Supervised Machine LearningLogistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting (XGBoost), Gradient Boosted Trees (GBDT), Multi-Layer Perceptron (MLP)Random Forest (RF)AUC: 0.925, Accuracy: 84.7%, Sensitivity: 90.5%, Specificity: 80.9%, Brier Score: 0.107
Liu (2025)PrognosisPathomic features extracted from whole slide imaging (WSI) histopathology.Deep Learning, Weakly Supervised LearningDenseNet121, ResNet50, Inception_v3, XGBoost, Random Forest, LightGBM, Cox RegressionXGBoost (for metastasis prediction)AUC: 0.981 (training), 0.804 (testing); C-index: 0.892 (training), 0.710 (testing)
Franco-Moreno (2025)PrognosisClinical and laboratory data (D-dimer, hemoglobin, cholesterol, platelets, leukocyte count, etc.)Supervised Machine LearningCatBoost, XGBoost, LightGBMCatBoostAUC-ROC: 0.86 (95% CI: 0.83–0.87), Sensitivity: 62%, Specificity: 94%, PPV: 75%, NPV: 93%
Li (2025)PrognosisClinical, pathological, and molecular data (tumor stage, lymph nodes, neural invasion, Ki67, biomarkers)Supervised Machine LearningRandom Survival Forest (RSF), Extreme Gradient Boosting (XGBoost), Decision Survival Tree (DST)Random Survival Forest (RSF)C-index: 0.968 (training), 0.936 (validation); AUC: 0.968 (training), 0.936 (validation); Brier Score: 0.067
Hamano (2025)PrognosisClinical and laboratory data (heart rate, respiratory rate, leukocyte count, albumin, C-reactive protein, etc.)Supervised Machine LearningFractional Polynomial (FP) Regression, Kernel Fisher Discriminant Analysis (KFDA), Kernel Support Vector Machine (KSVM), XGBoostKernel Support Vector Machine (KSVM)AUC-ROC: 0.834
Yin (2025)PrognosisComputed tomography (CT) images and clinical data (tumor characteristics, biomarkers).Deep LearningResNet50 + Multilayer Perceptron (MLP)ResNet50 + MLPAUC-ROC: 0.96 (training), 0.87 (internal testing), 0.85 (external validation)
Xiao (2025)PrognosisClinical and pathological data (tumor markers, blood tests, histopathological features).Supervised Machine LearningCox proportional hazards model, Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost)RFE-XGBoostAUC-ROC: 0.89 (95% CI: 0.82–0.94), Accuracy: 0.83 (95% CI: 0.76–0.90), F1 score: 0.81 (95% CI: 0.72–0.88), Brier Score: 0.13 (95% CI: 0.09–0.17)
Liu (2025)PrognosisClinical and laboratory data (blood tests, tumor characteristics, surgical variables).Supervised Machine LearningExtreme Gradient Boosting (XGBoost), Random Forest (RF), Support Vector Machine (SVM), k-nearest neighbor (KNN)XGBoostAUC-ROC: 0.939 (training), 0.876 (validation), Accuracy: 0.8908
Su (2025)PrognosisCT-based radiomics features and genomic data (gene expression analysis from TCGA).Supervised Machine LearningCox Regression, LASSO, Random Survival Forest (RSF)Radiomics-based Cox Regression ModelAUC-ROC: 0.899 (1-year), 0.906 (3-year), 0.869 (5-year) in the test set; C-index: 0.819 (training), 0.892 (validation), 0.851 (test)
Alshwayyat (2025)PrognosisClinical and pathological data (tumor characteristics, treatment modalities, ER/PR status).Supervised Machine LearningRandom Forest (RF), Gradient Boosting Classifier (GBC), Logistic Regression (LR), K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP)Random Forest (RF)AUC-ROC: 0.743, Accuracy: 70.78%, Sensitivity: 94.52%
Liu (2025)PrognosisClinical and demographic data (tumor characteristics, metastasis, lymph node involvement, treatment history).Supervised Machine LearningXGBoost, SHapley Additive exPlanations (SHAP) tool for model interpretation.XGBoostAUC-ROC: 0.813 (1-year), 0.738 (3-year), 0.733 (5-year) in training; 0.781, 0.785, 0.775 in validation.
Maleki (2025)DiagnosisClinical and laboratory data (paraneoplastic autoantibody panels from serum and CSF)Supervised Machine LearningNaive BayesNaive BayesAUC-ROC: 0.9795, Sensitivity: 85.71%, Specificity: 100%, Accuracy: 97.14%, Brier Score: 0.04
Park (2025)DiagnosisElectronic medical records (EMR) including medications, laboratory data, and clinical procedures.Supervised Machine LearningRandom Forest, Lasso Logistic Regression, AdaBoost, Decision Tree, Naïve Bayes, Multilayer PerceptronRandom ForestAUROC: 0.88 (training), 0.88 (validation); Sensitivity: 62%; Specificity: 94%
EL kati (2024)DiagnosisClinical and imaging data from WBCD and WDBC.Deep LearningDeep Neural Network (DNN) with three hidden layers, optimized with Gradient Multi-Verse Optimizer (GMVO).GMVO-optimized DNNAccuracy: 93.5% (WBCD), 96.73% (WDBC); Precision: 88.06% (WBCD), 93.38% (WDBC); Specificity: 93.06% (WBCD), 95.83% (WDBC); Sensitivity: 95.64% (WBCD), 98.25% (WDBC).
Wu (2025)DiagnosisMRI-based radiomics and deep learning features.Deep LearningDeep Transfer Learning (DTL) models: vgg19, GoogLeNet, Inception_v3, Vision Transformer (ViT).vgg19-combined fusion modelAUC-ROC: 0.990 (internal test), 0.988 (external test); Accuracy: 0.935 (internal test), 0.875 (external test); F1-score: 0.937 (internal test), 0.885 (external test).
Shan (2025)DiagnosisMultiphasic contrast CT images (arterial, venous, and delayed phases).Deep Learning3D-ResUNet (coarse-to-fine segmentation network)3D-ResUNetDice score: 0.8819; Precision: 0.905 (tumors >20 mm); Recall: 0.9728 (tumors >20 mm); F1-score: 0.9377 (tumors >20 mm)
Ni (2025)DiagnosisLDCT images, liquid biopsy (CACs), clinical variables (age, extra-thoracic cancer history, gender).Supervised Machine LearningRandom Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), Light Gradient Boosting (LGB), Least Absolute Shrinkage and Selection Operator (LASSO).Random Forest (RF)AUC-ROC: 0.99 (validation cohort); Sensitivity: 92%; Specificity: 97%.
Rafiepoor (2025)DiagnosisMiRNA expression data from GEO datasets.Supervised Machine LearningSupport Vector Machine (SVM), Boosted Trees (BT), Bootstrap Forest (BF), K-Nearest Neighbors (KNN), Neural Networks (NeB), LASSO Regression, Nominal Logistic, Naïve Bayes (NB).Support Vector Machine (SVM)AUC-ROC: 0.94, Sensitivity: 91%
Shehta (2025)DiagnosisMedical imaging (microscopic blood smear images)Deep LearningResNetRS50, RegNetX016, AlexNet, Convnext, EfficientNet, Inception_V3, Xception, VGG19ResNetRS50Accuracy: 97%, Recall: 99%, F1-score: 98%
Chiu (2025)DiagnosisDermoscopic imagesDeep LearningSwin Transformer, Vision Transformer (ViT), EfficientNetB5, EfficientNetV2B2, ResNet50, VGG16Ensemble model (Swin Transformer + Vision Transformer + EfficientNetB5)CSMUH dataset: Accuracy 97.31% (test), 99.77% (training). ISIC dataset: Accuracy 85.38% (test), 95.86% (training). False negatives reduced from 124 to 45 (ISIC dataset) and eliminated (CSMUH dataset).
Wang (2025)DiagnosisNBI endoscopic images and clinical data.Supervised Machine LearningRandom Forest (RF), Support Vector Machine (SVM), Decision Tree (DT)Random Forest (RF)Accuracy: 96%, Precision: 90%, Recall: 100%, AUC: 0.97
Natha (2025)DiagnosisDermoscopic imagesSupervised Machine LearningRandom Forest (RF), Multi-layer Perceptron Neural Network (MLPN), Support Vector Machine (SVM)Max Voting ensemble methodAccuracy: 94.70%, Precision: 93.23%, Recall: 92.54%, F1-score: 94.20%
Gui (2024)DiagnosisDCE-MRI imagesDeep LearningFaster R-CNN, Mask R-CNN, YOLOv9Improved Faster R-CNN (with ROI aligning, FPN, and additional convolutional layers)mAP: 0.752, Sensitivity: 0.950, False positive rate: 0.133
Şahin (2025)DiagnosisCT imagesDeep LearningYou Only Look Once (YOLOv8), YOLOv5, YOLOv7YOLOv8Accuracy: 95.35%, Precision: 0.972, Recall: 0.919, Sensitivity: 94.74%, Specificity: 95.83%
Chen (2025)DiagnosisRNA-seq gene expression data from granulosa cells.Supervised Machine LearningLASSO Regression, Support Vector Machine Recursive Feature Elimination (SVM-RFE), XGBoostXGBoostAUC-ROC: 0.875 (validation cohort); AUC-ROC: 0.795 (SVM)
Zhong (2025)DiagnosisLabel-free multiphoton imaging (MPM) data, histopathology features, clinical information.Supervised Machine LearningDecision Tree (DT), Multi-Layer Perceptron (MLP), Random Forest (RF), Support Vector Machine (SVM)Multi-Layer Perceptron (MLP) - Two-stage MINT modelAUC-ROC: 0.92 (stage 1), 1.00 (stage 2); Accuracy: 89.36% (external validation)
Pan (2025)TreatmentMRI-based radiomics and clinical data (tumor morphology, enhancement characteristics, and diffusion-weighted imaging).Deep LearningLogistic Regression, Support Vector Machine (SVM), Bayes, K-Nearest Neighbors (KNN), Decision Tree, Random ForestBayes Model integrating MRI radiomics and clinical dataAUC-ROC: 0.829, Sensitivity: 76.9%, Specificity: 83.3%
Wang (2025)TreatmentClinical, serological, and ultrasound imaging data.Supervised Machine LearningSupport Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), XGBoost, LightGBM, Logistic Regression, K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), Gradient Boosting Tree (GBT), Backpropagation Neural Network (BPNN), LASSO, Stepwise Regression.SVM + RFAUC-ROC: 0.972 (training), 0.922 (internal validation), 0.78 (external validation); Sensitivity: 85.52%, Specificity: 97.73%.
Wang (2025)TreatmentTranscriptomic data (RNA expression) and qPCR analysis.Supervised Machine LearningRandom-effects meta-analysis, Forward-search optimizationMeta-analysis optimized gene signature (LAMC2, TSPAN1, MYO1E, MYOF, SULF1)AUC-ROC: 0.99 (training), 0.89 (external validation), 0.83 (peripheral blood validation)
Nashat (2025)TreatmentContrast-enhanced computed tomography (CECT) imaging and clinical features.Supervised Machine LearningSupport Vector Machine (SVM), Decision Tree, K-Nearest Neighbor, Logistic Regression, Multilayer Perceptron, Random ForestSupport Vector Machine (SVM)Accuracy: 95.24% (volumetric response), 84.13% (histologic response); Sensitivity: 95.65% (volumetric), 89.47% (histologic); Specificity: 94.12% (volumetric), 88% (histologic)
Zhou (2024)TreatmentBlood biomarkers (C-reactive protein, neutrophil count, lactate dehydrogenase, alanine transaminase).Supervised Machine LearningSupport Vector Machine (SVM), XGBoost, Random Forest (RF), Decision Tree (DT), Gradient-Boosted Machine (GBM), Generalized Linear Models (GLM), Least Absolute Shrinkage and Selection Operator (LASSO)Support Vector Machine (SVM)AUC-ROC: 0.908 (training), 0.666 (validation cohort 1), 0.776 (validation cohort 2)
Torok (2025)TreatmentSerum N-glycome analysis using capillary electrophoresis with laser-induced fluorescence detection (CGE-LIF).Supervised Machine LearningQuadratic Discriminant Analysis (QDA), Random Forest, XGBoost, Neural Network, Support Vector Classifier (SVC)Quadratic Discriminant Analysis (QDA)AUC-ROC: 0.829 (regression), 0.8295 (progression), 0.841 (stationary)
Bozcuk (2025)TreatmentClinical and mutational data (age, ECOG score, mutation type, metastases, smoking status, blood biomarkers)Reinforcement LearningDeep Q-Network (DQN), Extra Trees Classifier (ETC)Deep Q-Network (DQN)AUC-ROC: 0.80 (DQN), 0.73 (ETC)
Chen (2025)TreatmentMulti-omics data (scRNA-seq, bulk transcriptomics, GWAS)Supervised Machine LearningRandom Survival Forest (RSF), Bayesian Deconvolution, Gene Set Variation Analysis (GSVA)Multi-omics-driven Machine Learning Signature (MOMLS)AUC-ROC: 0.875 (validation cohort); Kaplan-Meier survival analysis showed significant prognostic differences.
Pan & Wang (2025)TreatmentClinical data (immune cell profiling, HBV DNA levels, antiviral treatment status).Supervised Machine LearningLASSO Regression, Random Forest (RSF), XGBoostRandom Forest (RSF)AUC-ROC: 0.961 (RSF), 0.903 (XGBoost), 0.864 (LASSO)
Chufal (2025)TreatmentClinical data and DIBH waveform parameters (breath-hold amplitude, duration, consistency).Supervised Machine LearningGradient Boosting, LightGBM, XGBoost, CatBoost, Logistic Regression, Random Forest, K-Nearest Neighbors, Support Vector Machines, Naive BayesUncalibrated Gradient Boosting EnsembleAUC-ROC: 0.803, Recall: 0.526
Miyamoto (2025)TreatmentCT radiomics features (texture analysis, intensity statistics, shape features).Supervised Machine LearningRandom Forest (RF), Boruta AlgorithmRandom Forest (RF)AUC-ROC: 0.94 (training), 0.87 (validation)
Grosu (2025)TreatmentCT colonography images, radiomics features.Supervised Machine LearningRandom Forest (RF)Random Forest (RF)Accuracy: 84% (AI-assisted), Sensitivity: 85%, Specificity: 82%, Fleiss’ kappa: 0.92
Huang (2025)TreatmentClinical and radiotherapy data (fraction size, cumulative dose, interruptions), pathological risk factors.Supervised Machine LearningeXtreme Gradient Boosting (XGBoost), Cox RegressionXGBoostAccuracy: 57.89%, Sensitivity: 57.14%, Positive Predictive Value: 44.44%, AUC-ROC: 0.58
Zhang (2025)TreatmentClinical and demographic data (tumor stage, histological grade, treatment history).Deep LearningBalanced Individual Treatment Effect for Survival data (BITES), Cox Mixtures with Heterogeneous Effects (CMHE), DeepSurv, Cox Proportional Hazards (CPH), Random Survival Forest (RSF)Balanced Individual Treatment Effect for Survival data (BITES)IPTW-adjusted HR: 0.84 (CRT vs. surgery + RT/CRT); 0.77 (surgery + CRT vs. surgery + RT)
Ramasamy (2024)TreatmentCT radiomics features, clinical data (performance status, treatment site, smoking history, SBRT dose)Supervised Machine LearningMulti-Layer Perceptron (MLP), Support Vector Classification (SVC), Random Forest, Adaptive Boosting (AdaBoost)Multi-Layer Perceptron (MLP) with ADASYN samplingAUC-ROC: 0.94 (combined model), 0.95 (clinical model for early-stage patients)

AI performance metrics by study.

4 Discussion

Our study included 45 studies that employed ML models for diagnostic, prognostic, and treatment outcomes in cancer patients. The findings indicate several deficiencies in reporting quality, as assessed by CREMLS and TRIPOD+AI. These deficiencies primarily relate to sample size calculation, reporting on data quality, strategies for handling outliers, documentation of ML model predictors, access to training or validation data, and reporting on model performance heterogeneity.

These reporting limitations align with the PROBAST risk of bias assessment, which identified unclear risk of bias in the predictors and analysis domains in approximately one in four included studies. Therefore, it is plausible to conclude that reporting quality is most limited in sections related to methodology, analysis, and elements that support study reproducibility (e.g., code and datasets). These limitations may introduce a higher risk of bias.

Our study aligns with previous research identifying significant issues in the quality of reporting in studies employing AI techniques to develop clinical prediction models, including the presence of “spin” practices and poor reporting standards (12). Moreover, another systematic review concluded that most studies are at high risk of bias, citing issues such as small sample sizes, inadequate handling of missing data, and lack of external validation (13). These problems are not limited to predictive models for cancer but also extend to other disciplines within clinical diagnostics, for instance, cardiology or mental health (14, 15).

The potential of AI models to predict cancer diagnoses, prognosis, and treatment are substantial, representing an opportunity to enhance healthcare access (16). While some studies have demonstrated that the diagnostic accuracy of AI models for detecting neoplasms is remarkably high, particularly in cases such as upper gastrointestinal tract cancers (17), the methodological quality of these studies remains low, with a high risk of selection bias. Additionally, other review studies evaluating the use of AI for predicting treatment and prognostic outcomes in cancer patients have also identified reporting issues and highly restricted access to data necessary for study replication (18, 19). Regarding performance metrics, the evidence highlights the need to standardize the reporting of metrics in studies due to variability in the parameters used and the inconsistent reporting of performance outcomes (20, 21).

Currently, several initiatives aim to promote the responsible use of AI models in oncology patients, such as the American Society of Clinical Oncology’s transparency principles for AI. These principles seek to enhance transparency in AI models within oncology, potentially improving their real-world application and supporting clinical decision-making (22). However, a recent review published in a journal of this scientific society found that adherence to these principles remains limited, highlighting the need to promote their implementation (23). Nevertheless, this issue does not appear to be exclusive to oncology. A systematic review identified risks of bias and reporting quality issues in ML models across various healthcare contexts, not just oncology (13). Therefore, poor reporting quality seems to be a broader issue affecting health-related disciplines in general.

Incorporating cost-related information into AI model evaluations is essential for their real-world applicability. Understanding both direct and indirect costs would facilitate the integration of these models into clinical practice by providing a clearer assessment of the financial requirements for infrastructure, staff training, and implementation logistics (24). One of the key reporting limitations identified in the included studies was the lack of information on costs and the consequences of classification errors. This may be because most studies focus primarily on accuracy and effectiveness while overlooking the associated costs, despite evidence suggesting that AI can reduce the costs of cancer diagnosis and treatment (25). Additionally, assessing the economic impact of classification errors could help balance model performance with cost-effectiveness, ensuring that AI-driven tools provide both clinical and economic value. Future studies should integrate cost analyses alongside model performance metrics to enhance the feasibility and adoption of AI in oncology.

The analysis of AI model data must be transparent, as improper handling of missing data, outliers, participant imbalance, or dimensionality reduction can introduce biases into the model. For instance, internal biases and errors during model training may lead to misclassifications, potentially resulting in biased clinical decision-making if the model is implemented in real-world settings (26). Therefore, it is essential for AI models to ensure transparency throughout this process by sharing data, code, and any other materials that enable replication (22).

The main strength of our study lies in the detailed assessment of reporting quality using the CREMLS checklist and TRIPOD+AI checklist, and the risk of bias evaluation with PROBAST. However, an important limitation of our study is that it is not a systematic review encompassing all available evidence from multiple databases, resulting in limited representativeness.

5 Conclusion

Our study, using CREMLS and TRIPOD+AI, identified that AI models in oncology exhibit reporting limitations, particularly in the methodology sections related to predictors and the analysis plan. These findings align with an unclear risk of bias in one out of every four included studies, as indicated by PROBAST in the Predictors and Analysis domains. Additionally, our study outlines the specific areas where reporting is deficient, providing researchers with guidance to improve reporting quality in these sections and, consequently, reduce the risk of bias in their studies.

Statements

Author contributions

AS: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Project administration, Supervision, Validation, Writing – review & editing. DV-Z: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft. CMR-R: Data curation, Investigation, Writing – review & editing. SE-A: Conceptualization, Data curation, Investigation, Methodology, Visualization, Writing – review & editing. JF: Conceptualization, Data curation, Formal Analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Supervision, Validation, Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This study has been in part supported by the contract HT9425-24-1-0264-P00001 from the Congressionally Directed Medical Research Program.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc.2025.1555247/full#supplementary-material

Supplementary Material 1

Search strategy.

Supplementary Table 1

Characteristics of the included studies.

Supplementary Table 2

List of records included.

Abbreviations

AI, Artificial Intelligence; ML, Machine Learning; TRIPOD+AI, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis – Artificial Intelligence extension; CREMLS, Checklist for REporting Machine Learning Studies; PROBAST, Prediction Model Risk of Bias Assessment Tool.

References

  • 1

    CrosbyDBhatiaSBrindleKMCoussensLMDiveCEmbertonMet al. Early detection of cancer. Science. (2022) 375:eaay9040. doi: 10.1126/science.aay9040

  • 2

    MaiterASalehiMSwiftAJAlabedS. How should studies using AI be reported? lessons from a systematic review in cardiac MRI. Front Radiol. (2023) 3:1112841. doi: 10.3389/fradi.2023.1112841

  • 3

    MaulanaFIAdiPDPLestariDAriyantoSVPurnomoA. (2023). The scientific progress and prospects of artificial intelligence for cancer detection: A bibliometric analysis, in: 2023 International Seminar on Intelligent Technology and Its Applications (ISITIA), . pp. 638–42. doi: 10.1109/ISITIA59021.2023.10221162

  • 4

    JayakumarSSounderajahVNormahaniPHarlingLMarkarSRAshrafianHet al. Quality assessment standards in artificial intelligence diagnostic accuracy systematic reviews: a meta-research study. NPJ Digit Med. (2022) 5:11. doi: 10.1038/s41746-021-00544-y

  • 5

    GurumurthyGGurumurthyJGurumurthyS. Machine learning in paediatric haematological Malignancies: a systematic review of prognosis, toxicity and treatment response models. Pediatr Res. (2024). doi: 10.1038/s41390-024-03494-9

  • 6

    MoharramiMAzimian ZavarehPWatsonESinghalSJohnsonAEWHosniAet al. Prognosing post-treatment outcomes of head and neck cancer using structured data and machine learning: A systematic review. PloS One. (2024) 19:e0307531. doi: 10.1371/journal.pone.0307531

  • 7

    El EmamKLeungTIMalinBKlementWEysenbachG. Consolidated reporting guidelines for prognostic and diagnostic machine learning models (CREMLS). J Med Internet Res. (2024) 26:e52508. doi: 10.2196/52508

  • 8

    CollinsGSMoonsKGMDhimanPRileyRDBeamALVan CalsterBet al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. (2024) 385:e078378. doi: 10.1136/bmj-2023-078378

  • 9

    SimeraIMoherDHirstAHoeyJSchulzKFAltmanDG. Transparent and accurate reporting increases reliability, utility, and impact of your research: reporting guidelines and the EQUATOR Network. BMC Med. (2010) 8:24. doi: 10.1186/1741-7015-8-24

  • 10

    WolffRFMoonsKGMRileyRDWhitingPFWestwoodMCollinsGSet al. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med. (2019) 170:51–8. doi: 10.7326/M18-1376

  • 11

    OuzzaniMHammadyHFedorowiczZElmagarmidA. Rayyan-a web and mobile app for systematic reviews. Systematic Rev. (2016) 5. doi: 10.1186/s13643-016-0384-4

  • 12

    Andaur NavarroCLDamenJAATakadaTNijmanSWJDhimanPMaJet al. Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models. J Clin Epidemiol. (2023) 158:99110. doi: 10.1016/j.jclinepi.2023.03.024

  • 13

    Andaur NavarroCLDamenJAATakadaTNijmanSWJDhimanPMaJet al. Risk of bias in studies on prediction models developed using supervised machine learning techniques: systematic review. BMJ. (2021) 375:n2281. doi: 10.1136/bmj.n2281

  • 14

    CaiYCaiY-QTangL-YWangY-HGongMJingT-Cet al. Artificial intelligence in the risk prediction models of cardiovascular disease and development of an independent validation screening tool: a systematic review. BMC Med. (2024) 22:56. doi: 10.1186/s12916-024-03273-7

  • 15

    ChenZLiuXYangQWangY-JMiaoKGongZet al. Evaluation of risk of bias in neuroimaging-based artificial intelligence models for psychiatric diagnosis: A systematic review. JAMA Netw Open. (2023) 6:e231671. doi: 10.1001/jamanetworkopen.2023.1671

  • 16

    KapoorDUSainiPKSharmaNSinghAPrajapatiBGElossailyGMet al. AI illuminates paths in oral cancer: transformative insights, diagnostic precision, and personalized strategies. EXCLI J. (2024) 23:1091–116. doi: 10.17179/excli2024-7253

  • 17

    ArribasJAntonelliGFrazzoniLFuccioLEbigboAvan der SommenFet al. Standalone performance of artificial intelligence for upper GI neoplasia: a meta-analysis. Gut. (2020) 70:145868. doi: 10.1136/gutjnl-2020-321922

  • 18

    CortiCCobanajMMarianFDeeECLloydMRMarcuSet al. Artificial intelligence for prediction of treatment outcomes in breast cancer: Systematic review of design, reporting standards, and bias. Cancer Treat Rev. (2022) 108:102410. doi: 10.1016/j.ctrv.2022.102410

  • 19

    DhimanPMaJNavarroCASpeichBBullockGDamenJAet al. Reporting of prognostic clinical prediction models based on machine learning methods in oncology needs to be improved. J Clin Epidemiol. (2021) 138:6072. doi: 10.1016/j.jclinepi.2021.06.024

  • 20

    KananMAlharbiHAlotaibiNAlmasuoodLAljoaidSAlharbiTet al. AI-driven models for diagnosing and predicting outcomes in lung cancer: A systematic review and meta-analysis. Cancers (Basel). (2024) 16:674. doi: 10.3390/cancers16030674

  • 21

    KumarYGuptaSSinglaRHuY-C. A systematic review of artificial intelligence techniques in cancer prediction and diagnosis. Arch Comput Methods Eng. (2022) 29:2043–70. doi: 10.1007/s11831-021-09648-w

  • 22

    American Society of Clinical Oncology. Principles for the responsible use of artificial intelligence in oncology (2024). Available online at: https://society.asco.org/sites/new-www.asco.org/files/ASCO-AI-Principles-2024.pdf (Accessed October 4, 2024).

  • 23

    SmileyAReategui-RiveraCMVillarreal-ZegarraDEscobar-AgredaSFinkelsteinJ. Exploring artificial intelligence biases in predictive models for cancer diagnosis. Cancers (Basel). (2025) 17:407. doi: 10.3390/cancers17030407

  • 24

    WolffJPaulingJKeckABaumbachJ. The economic impact of artificial intelligence in health care: systematic review. J Med Internet Res. (2020) 22:e16866. doi: 10.2196/16866

  • 25

    KacewAJStrohbehnGWSaulsberryLLaiteerapongNCiprianiNAKatherJNet al. Artificial intelligence can cut costs while maintaining accuracy in colorectal cancer genotyping. Front Oncol. (2021) 11:630953. doi: 10.3389/fonc.2021.630953

  • 26

    CortiCCobanajMDeeECCriscitielloCTolaneySMCeliLAet al. Artificial intelligence in cancer research and precision medicine: Applications, limitations and priorities to drive transformation in the delivery of equitable and unbiased care. Cancer Treat Rev. (2023) 112:102498. doi: 10.1016/j.ctrv.2022.102498

Summary

Keywords

cancer, artificial intelligence, diagnosis, prognosis, therapy

Citation

Smiley A, Villarreal-Zegarra D, Reategui-Rivera CM, Escobar-Agreda S and Finkelstein J (2025) Methodological and reporting quality of machine learning studies on cancer diagnosis, treatment, and prognosis. Front. Oncol. 15:1555247. doi: 10.3389/fonc.2025.1555247

Received

03 January 2025

Accepted

18 March 2025

Published

14 April 2025

Volume

15 - 2025

Edited by

Sharon R. Pine, University of Colorado Anschutz Medical Campus, United States

Reviewed by

Devesh U. Kapoor, Gujarat Technological University, India

Miaomiao Yang, Yantai Yuhuangding Hospital, China

Updates

Copyright

*Correspondence: Joseph Finkelstein,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics