Methodological quality of radiomic-based prognostic studies in gastric cancer: a cross-sectional study

Background Machine learning radiomics models are increasingly being used to predict gastric cancer prognoses. However, the methodological quality of these models has not been evaluated. Therefore, this study aimed to evaluate the methodological quality of radiomics studies in predicting the prognosis of gastric cancer, summarize their methodological characteristics and performance. Methods The PubMed and Embase databases were searched for radiomics studies used to predict the prognosis of gastric cancer published in last 5 years. The characteristics of the studies and the performance of the models were extracted from the eligible full texts. The methodological quality, reporting completeness and risk of bias of the included studies were evaluated using the RQS, TRIPOD and PROBAST. The discrimination ability scores of the models were also compared. Results Out of 283 identified records, 22 studies met the inclusion criteria. The study endpoints included survival time, treatment response, and recurrence, with reported discriminations ranging between 0.610 and 0.878 in the validation dataset. The mean overall RQS value was 15.32 ± 3.20 (range: 9 to 21). The mean adhered items of the 35 item of TRIPOD checklist was 20.45 ± 1.83. The PROBAST showed all included studies were at high risk of bias. Conclusion The current methodological quality of gastric cancer radiomics studies is insufficient. Large and reasonable sample, prospective, multicenter and rigorously designed studies are required to improve the quality of radiomics models for gastric cancer prediction. Study registration This protocol was prospectively registered in the Open Science Framework Registry (https://osf.io/ja52b).


Introduction
Gastric cancer (GC) is the fifth most common cancer and the fourth most common cause of cancer death worldwide (1).Systemic chemotherapy, radiotherapy, surgery, immunotherapy, and targeted therapy have all been shown to be viable treatment options for GC (2)(3)(4)(5).However, due to the heterogeneous nature of GC and the high rate of recurrence and metastasis, the current advances in diagnostic techniques and treatment modalities for GC are not yet satisfactory.Current standard treatment strategies often lead to over-treatment with unnecessary toxicity or undertreatment in cases of tumor progression.Therefore, there is an urgent need to develop tools that could be used to clarify the treatment response and prognosis of GC patients before surgery.
Radiomics involves the extraction of quantitative metrics (radiomics features) from medical images.This data can be used on its own or combined with demographic, histological, genomic, or proteomic data to build models to solve clinical problems (6).Its main workflow (Figure 1) includes data acquisition and curation, region of interest segmentation, feature extraction, analysis and model creation (7).Radiomics is increasingly being used to predict clinical outcomes, particularly in GC (8).However, although numerous studies have evaluated the accuracy of the radiomics model in predicting treatment response in GC, the methodological quality of these studies was not evaluated.
Several tools have been developed to assess the methodological quality of radiomics studies, including the Radiomics Quality Score (RQS) (9), the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) (10) assessment tools and the Prediction Risk Of Bias Assessment Tool (PROBAST) (11).The RQS is a standardized assessment tool commonly used to evaluate the scientific integrity and clinical relevance of radiomics oncology studies (12,13).The TRIPOD tool consists of a checklist designed to evaluate the transparency and completeness of predictive modeling research reports.This tool has been used to evaluate the integrity of numerous oncology radiomics studies (14,15).The PROBAST was developed to assess the risk of bias and thereby provide a comprehensive evaluation of the methodological quality of primary studies that report predictive model development, validation, or updating (11,16).
Therefore, this cross-sectional study of the literature aimed to use the RQS, TRIPOD and PROBAST to assess the methodological quality of prognostic radiomics studies related to GC.

Eligibility criteria
This study was conducted following the PRISMA guidelines (Supplementary Material 1) (17).Due to the rapid advancement in machine learning and radiomics in recent years, only peer-review studies published in last 5 years (between September 2017 and September 2022) were included in this Study.Furthermore, only studies evaluating the prognosis of primary GC in humans based on radiomics features extracted by handcraft or deep learning from clinical images, including computed tomography (CT), magnetic resonance (MR), and positron emission tomography/computed tomography (PET/CT) were included in this study.
Radiomics studies used for diagnostic purposes or to evaluate the degree of differentiation within the tumor were excluded.Studies using models based on non-radiomics features (e.g., standardized uptake values (SUV), clinical parameters, dosimetric parameters, and gene expression data) and those that did not predict prognosis directly were excluded.In addition, case reports, systematic reviews, conference abstracts, editorials, and expert opinion papers were also excluded.

Search methods
The initial literature search was performed using the PubMed and EMBASE electronic databases on 11 September 2022.Since the radiomics studies do not involve randomized controlled studies (RCTs), the Cochrane central database was not searched.Medical Subject Headings (MeSH) and Emtree terms related to GC, radiomics, artificial intelligence, deep learning, and prognosis were used to perform the search.The search strategy is described in more detail in Supplementary Material 2.

Selection process
Two researchers (T.J and Z.Z.) searched the PubMed and Embase databases to identify relevant articles.The titles and abstracts of the identified studies were screened independently by the 2 researchers to confirm the eligibility of the studies.Any disagreements in the selection of the studies were resolved via discussion until a consensus was reached.A third researcher (X.L.) was consulted if no consensus was reached.The full texts of the eligible studies were then obtained through an institutional journal subscription and examined by 2 researchers (T.J and Z.Z.) independently for their eligibility based on the criteria described above.The articles that met all the eligibility criteria were included for data extraction and methodological evaluation.

Data extraction
Data extraction was performed independently by two researchers (T.J and Z.Z.) from the included publications.The extracted information comprised general information and methodological characteristics of the studies, including author, year, research design (prospective and retrospective), the number of collaborating institutes, outcome measures, sample size, the radiomics feature extraction method employed (deep learning or handcrafted), the number of features retained in the final model, any additional non-radiomics features used for model development, the performance metrics utilized to assess the model, and the calibration results (if provided).

Analysis of the methodological quality
Two researchers (T.J and Z.Z) evaluated the methodological quality independently using the RQS, TRIPOD and PROBAST.Any disagreements were resolved by consulting with a third researcher (X.L.).
The RQS model proposed by Lambin et al. ( 9) is based on the steps used to construct a radiomics model and consists of 16 items across 6 domains.The RQS ranges from -8 to 36.The TRIPOD checklist (10) can be used to assess the completeness of the included studies while using RQS (18).This tool consists of 22 main criteria with 37 items.Items 21 and 22 were not evaluated in this study because they assess the supplementary and funding information, respectively.Based on the TRIPOD criteria, the prediction models were classified as development only (type 1a), development and validation using resampling (type 1b), random split sample validation (type 2a), non-random split sample validation (type 2b), validation using separate data (type 3), or validation only (type 4).To assess the risk of bias and applicability of the included studies, PROBAST was employed (16), which includes 20 signaling questions distributed among 4 domains (participants, predictors, outcome, and analysis).

Statistical analysis
The RQS for each item and the total RQS were presented as mean +/-standard deviation (SD).When an item obtained a score of at least 1, it was described as basic adherence.The basic adherence rate was calculated as the percentage number of studies with basic adherence.When an item obtained was higher than the average score, it was considered the ideal score.The percentage number of ideal scores was defined as the number of studies obtaining an ideal score from the total number of studies.The basic adherence rate for TRIPOD was calculated using the same method.TRIPOD item 5c (if completed) and validation items 10c, 10e, 12, 13c, 17, and 19a were excluded from the calculation of the overall adherence rate.The results of PROBAST were summarized as percentages and presented in a visual plot.Signaling question 4.9, "Do predictors and their assigned weights in the final model correspond to the results from the reported multivariable analysis?"was not included as it only applies to regression-based studies.The analyses were conducted using R version 4.2.1.

Literature search results
Figure 2 illustrates the PRISMA process used to conduct the study.The initial online database search revealed 305 records, of which 205 were retrieved from PubMed, and the rest were retrieved from EMBASE.After removing the 22 duplicates, 283 studies remained for further screening.The screening of the titles and abstracts revealed 28 eligible studies.Six of these studies were excluded after evaluating the full text, and a total of 22 studies  were finally included in this study.

Basic and methodological characteristics of the included studies
The basic and methodological characteristics of all included studies are summarized in Table 1.All studies included in our study were retrospective.Only 8 were multi-institutional, of which 6 included patients from 2 different institutions, and 2 studies included patients from 3 different institutions.Interestingly, almost all of the studies come from Chinese researchers.Some studies did not mention the specific histological type of gastric cancer, while others (5/22) targeted gastric adenocarcinoma.The number of sample size of the included studies ranged from 30 to 2320.
The treatments involved are divided into two types: surgery and medications.Surgery includes partial or total gastrectomy with or without D2 lymphadenectomy.Medications include neoadjuvant chemotherapy or adjuvant chemotherapy with specific regimens such as SOX (S-1 plus oxaliplatin), XELOX (capecitabine plus oxaliplatin), FOLFIRI (folinic acid, fluorouracil, and irinotecan)/FOLFOX (folinic acid, fluorouracil, and oxaliplatin), and a study investigated the impact of PD-1 inhibitors on prognosis of gastric cancer (31).
Different study endpoints were reported in the studies.These were broadly divided into prognosis, treatment response, and other.The prognosis was reported in 18 studies and was artificially classified as poor and good (37) based on overall survival (OS), progression-free survival (PFS), disease-free survival (DFS)/ recurrence-free survival (RFS).The pathological treatment response was reported in 3 studies.This category included tumor regression grade (TRG) after neoadjuvant chemotherapy, complete remission (CR), partial remission (PR), stable remission (SR), and progressive remission (PR).The other category included lymphovascular invasion (LVI) (19), early recurrence (40), and peritoneal recurrence (25).The model by Liang et al. was used to predict both the prognosis and treatment response (31).
The discriminatory performance of the prognostic prediction model was assessed on the training and validation datasets using either the concordance (C-index) or the area under the curve (AUC).For the training cohort, the C-index ranged from 0.654 (25) to 0.880 (37), and the AUC ranged from 0.722 (30) to 0.965 (26).For the validation cohort, the C-index ranged between 0.610 (25) and 0.810 (20), and the AUC ranged between 0.744 (32) and 0.878 (33).

Assessment of the methodological quality of the studies based on RQS
Based on the steps involved in constructing a radiomics model, the RQS assesses the quality of radiomics studies across 16 projects in 6 key domains.These 6 areas include protocol quality and stability in image and segmentation, feature selection and validation, model performance, biologic/clinical validation and utility, high level of evidence, and open science and data (Details in Supplementary Table S1).The overall mean RQS value was 15.32 ± 3.20 (range 9 to 21), which is 42.55% of the ideal 36 scores.Of the 6 domains, domain 5 had the lowest score at 0. Domain 2 achieved the highest mean ideal score (72.16%) of all the six domains.Table 2 shows the basic adherence rate to the 16 RQS criteria for the 6 domains.The total basic adherence RQS was 59%.

Analysis of reporting completeness based on the TRIPOD checklist
In order to increase the transparency of research reports on predictive modeling, the TRIPOD statement has developed a checklist in 5 areas: title and abstract, introduction, methods, results, and discussion.The reporting completeness of the included studies according to the TRIPOD checklist is summarized in Table 3 and Supplementary Table S2.After excluding both the "if done" in item 5c and the validation items 10c, 10e, 12, 13c, 17, and 19a from the numerator and denominator, the mean number of adherences with the 35 items on the TRIPOD checklist was 20.45 ± 1.83, and the adherence rate was 73.05% ± 6.53%.Figure 3 shows the AUC/C-index and RQS reported by the included studies classified by TRIPOD.The different TRIPOD classifications are illustrated using different colors.The studies with the higher RQS had a better TRIPOD classification [usually 2a (19,32) or 3 (22,23,36,38,39)].Furthermore, these studies also had a higher AUC or C-index ranging from 0.760 (36) to 0.892 (22).
All studies had at least one high-risk domain, with participants and analysis sections being the most frequent.Therefore, all studies The AUC/C-index and RQS reported by the included studies classified by TRIPOD.
were ultimately rated as high risk of bias.The four domains and the overall risk of bias of the included studies are visualized in Figure 4.

Discussion
This study aimed to assess the methodological characteristics and quality of radiomics studies predicting the prognosis of patients with GC published in the last 5 years, using RQS, TRIPOD, and PROBAST.All studies included in this study were retrospective,  Signaling question 4.9 "Do predictors and their assigned weights in the final model correspond to the results from the reported multivariable analysis?"was not included as it applies to regression-based studies.PROBAST, prediction risk of bias assessment tool; NA, not applicable to external validation.
which may have introduced inaccuracies in prognosis-focused follow-up.Furthermore, the included studies mainly focused on the prognosis of GC patients after gastrectomy, with only a few studies evaluating the prognosis after adjuvant chemotherapy, neoadjuvant chemotherapy, or PD-1 inhibitor therapy.Additionally, the sample size of most studies was insufficient for building stable predictive models and lacked reasonable sample size estimation in advance.Clinical factors were incorporated in almost all currently available radiomics prognostic models for GC, with some models also incorporating genetic factors (26,32) or immunohistochemistry (31).The integration of radiomics with clinical and genetic features has been shown to improve the predictive performance of prognostic models (41).However, most studies did not perform external validation, potentially limiting the generalizability of the models.The lack of standardized practices for analyzing radiomics models, limited data sharing between institutions, and the lack of automated segmentation are currently limiting the adoption of these models in prospective clinical studies (42).Thus, further prospective multicenter studies with larger and adequately powered samples are necessary to improve the quality and generalizability of prognostic radiomics models for GC.Upon evaluating the radiomics prognostic prediction models for GC using the RQS, our study revealed a lack of scientific quality in the current models, particularly in domain 1, domain 5, and domain 6.Notably, none of the included studies conducted a phantom study on all scanners, despite previous research showing that the variability of the values of radiomic features calculated on CT images from different CT scanners can be comparable to the variability of these features found in CT images of other tumor (43).Consequently, future radiomic studies in gastric cancer should consider and minimize the impact of differences between scanners.Additionally, none of the included studies met the high level of evidence criteria, as all were retrospective and lacked costeffectiveness analyses.Our analysis also showed low scores in biologic correlations (9.09%) and open science/data (14.77%), which are similar to the limitations observed in radiomics models used for other purposes (44)(45)(46).Upon assessment of reporting completeness using the TRIPOD checklist, the included studies showed poor basic adherence rates, particularly for items such as blinding when assessing results, demonstrating how the required sample size was reached, handling of missing data, and presenting the entire prediction model.These results are consistent with previous reviews on radiomics and oncology studies that also utilized TRIPOD (45)(46)(47).Therefore, there is a pressing need to address these aspects to ensure that reporting of prognostic GC radiomics prediction models is more transparent, complete, and standardized.It should be noted, however, that the current TRIPOD checklist is mainly focused on regression-based predictive model approaches, limiting its applicability to artificial intelligence and machine learning research, which typically do not require regression analysis.To address this limitation, a new version of the TRIPOD statement for machine learning is currently in development (48).
The evaluation of included studies using PROBAST revealed that all of them were at high risk of bias.Contributing factors to bias included a failure to use blinding to obtain predictors, a lack of reasonable sample size estimation in advance, and improper handling of participants with missing data.Similarly, most of the radiomics studies examining other diseases were also found to be at high risk of bias.A systematic review of radiomic prognostic prediction models for breast cancer showed that 95.7% of the included studies were at high risk of bias (49).Similarly, a systematic review of radiomic prognostic prediction studies for non-small cell lung cancer found that all of the included studies were at high risk of bias (50).Furthermore, these reviews also identified participant and analytic domains as the primary sources of bias.
This study has some limitations that have to be acknowledged.Due to the small number of existing studies and the wide range of mathematical tools used to assess the performance of the models, it was not possible to perform a quantitative meta-analysis.In addition, several items on the RQS and TRIPOD tools could not be assessed as they do not apply to prognostic radiomics models.It is also important to acknowledge that some items on the RQS are over-idealistic and are difficult to achieve in practice (51).Furthermore, the TRIPOD checklist was designed to facilitate the reporting of radiomics studies and not to assess the methodological quality of radiomics studies (52).Finally, although we did our best to use objective criteria, independent raters, and dissent negotiations to evaluate the methodological quality of the radiomics studies, there may still be some unavoidable bias in our evaluation.We searched for worldwide studies in this area and found that the main country of publication was China, which may lead to geographical bias and may not have broad extrapolation.
Our findings indicate that the current methodological quality of radiomics studies for prognosis prediction in GC is insufficient.Therefore, larger and reasonable sample size, prospective, multicenter, and rigorously designed studies are required to improve the generalizability of the models.Future radiomics studies should also include phantom studies on the scanners, more biological correlations, and open science/data.Risk of bias of included studies.

FIGURE 1 "
FIGURE 1 "Flowchart of application of AI in radiology for GI cancers.", by Azadeh Tabari, licensed under CC BY 4.0.

FIGURE 2 Flowchart
FIGURE 2Flowchart of the literature search and study selection (PRISMA 2020).

TABLE 1
Basic and methodological characteristics of the included studies.

TABLE 2
Radiomics quality score according to the six key domains.

TABLE 3
Adherence to individual TRIPOD items in radiomics studies.
10d.Statistical analysis methods: specify all measures used to assess model performance and if relevant, to compare multiple models (discrimination and calibration) 20 (90.91) 11.Risk groups: provide details on how risk groups were created, if done (yes or no, n = 14) 14 (63.64)

TABLE 4
PROBAST signaling questions in 22 included studies.