The Diagnostic Performance of Machine Learning-Based Radiomics of DCE-MRI in Predicting Axillary Lymph Node Metastasis in Breast Cancer: A Meta-Analysis

Objective The aim of this study was to perform a meta‐analysis to evaluate the diagnostic performance of machine learning(ML)-based radiomics of dynamic contrast-enhanced (DCE) magnetic resonance imaging (MRI) DCE-MRI in predicting axillary lymph node metastasis (ALNM) and sentinel lymph node metastasis(SLNM) in breast cancer. Methods English and Chinese databases were searched for original studies. The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) and Radiomics Quality Score (RQS) were used to assess the methodological quality of the included studies. The pooled sensitivity, specificity, diagnostic odds ratio (DOR), and area under the curve (AUC) were used to summarize the diagnostic accuracy. Spearman’s correlation coefficient and subgroup analysis were performed to investigate the cause of the heterogeneity. Results Thirteen studies (1618 participants) were included in this meta-analysis. The pooled sensitivity, specificity, DOR, and AUC with 95% confidence intervals were 0.82 (0.75, 0.87), 0.83 (0.74, 0.89), 21.56 (10.60, 43.85), and 0.89 (0.86, 0.91), respectively. The meta-analysis showed significant heterogeneity among the included studies. There was no threshold effect in the test. The result of subgroup analysis showed that ML, 3.0 T, area of interest comprising the ALN, being manually drawn, and including ALNs and combined sentinel lymph node (SLN)s and ALNs groups could slightly improve diagnostic performance compared to deep learning, 1.5 T, area of interest comprising the breast tumor, semiautomatic scanning, and the SLN, respectively. Conclusions ML-based radiomics of DCE-MRI has the potential to predict ALNM and SLNM accurately. The heterogeneity of the ALNM and SLNM diagnoses included between the studies is a major limitation.


INTRODUCTION
*Axillary lymph node metastasis (ALNM) is common in breast cancer patients and determines the clinical stage, treatment plans, surgical procedure and patient outcome (1,2). Currently, the axillary lymph node (ALN) status of patients with breast cancer is diagnosed by sentinel lymph node biopsy (SLNB) and axillary lymph node dissection (ALND). However, these procedures are not risk-free operations and can potentially lead to implantation metastasis (3). Therefore, it is essential to explore a noninvasive approach for assessing ALNM to reduce the incidence of surgical complications and improve the patient's quality of life.
Dynamic contrast-enhanced (DCE) magnetic resonance imaging (MRI) has generally been well accepted and routinely used for breast cancer staging (4,5). For predicting ALNM, previous studies of DCE-MRI have primarily focused on node size, cortical thickness, disappearance of lymph parenchyma, and enhancement patterns (6). Unfortunately, early diagnosis of ALNM through DCE-MRI is not yet ideal since it is limited by subjective factors, such as the radiologist's experience and knowledge level. Additionally, subtle changes, such as cell density, morphology, and microtissue structure, in ALNM might not be apparent to the naked eye (7,8).
In recent years, radiomics and machine learning (ML) models have become increasingly popular for analyzing diagnostic images (9,10). The ability of radiomics analysis to maximize the number of features in quantitative images has excellent potential for evaluating ALNM in breast cancer patients (11)(12)(13)(14)(15).
However, because of the small sample sizes of previous studies, statistical research has been limited, and research results have also varied from study to study. Thus, it is necessary to perform a meta-analysis to further evaluate the diagnostic performance of ML-based radiomics of DCE-MRI in predicting ALNM and SLNM in breast cancer.

MATERIALS AND METHODS
We conducted and reported this meta-analysis based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines (16).

Literature Search
The PubMed, Embase, Web of Science, and Cochrane Library databases and four Chinese databases [VIP, CNKI, Wanfang and Chinese BioMedical Literature Databases (CBM)] were searched by two observers independently to identify studies. The search was performed on June 23, 2021, without a start date limit. The study search was conducted using the following keywords: "magnetic resonance imaging", "MRI", "MRI scans", "breast cancer", "breast carcinoma", "metastasis", "machine learning", "radiomics" and "lymph node". MeSH terms and variations of each term were used. Moreover, we restricted the studies to those published in English or Chinese and performed a manual search of the related articles' reference lists to identify other articles that might meet the inclusion criteria. Endnote software, version X9, was used to manage all records. Disagreements were discussed and resolved to reach a consensus.

Study Selection
The titles and abstracts of potentially relevant studies were screened for appropriateness by two reviewers(Z-J and Z-L). Inconsistencies were discussed by the reviewers, and consensus was reached.
All of the studies were selected according to the following criteria: (a) original research studies; (b) patients with breast cancer were enrolled who were confirmed to have ALNM or SLNM by biopsy or histopathology; (c) ML-based DCE-MRI applied to classify ALNM or SLNM using radiomics; and (d) data are sufficient to reconstruct the 2×2 contingency table to estimate the sensitivity and specificity of the diagnosis.
Studies were excluded if: (a) reviews, editorials, abstracts, animal studies, and conference presentations; and (b) multiple reports published for the same population (in this case, the publication with the most details was chosen to be included in this meta-analysis).

Data Extraction
Relevant data were extracted from each study, including the first author, publication year, sample size, magnetic field strength, information about radiomics and ML pipeline, data sources and reference standards, detailed information on lesion segmentation, contrast agents, and DCE phases. For each study, the true positive (TP), false-positive (FP), false negative (FN), and true negative (TN) values were extracted, and a pairwise (2×2) contingency table was created.

Data Quality Assessment
The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) and Radiomics Quality Score (RQS) were used to assess the methodological quality of the included studies and the risk of bias at the study level, respectively (17,18). RQS items comprise: (a) image acquisition; (b) radiomics feature extraction; (c) data modeling; (d) model validation; and (e) data sharing. Each of the 16 items (Table 1) of the RQS is rated, resulting in a total of points ranging from −8 to 36, with −8 defined as 0% and 36 defined as 100% (18).
The QUADAS-2 tool consists of: (a) patient selection; (b) index test; (c) reference standard; and (d) flow and timing.
Two independent reviewers (L-LC and Z-L) conducted the quality assessment, and disagreements were discussed with a third reviewer (T-M) to reach a consensus.

Statistical Analysis
This meta-analysis was conducted via Stata software, version 16.0, Review Manager software, version 5.3, and the Open Metaanalyst software tool. The predictive accuracy was quantified using pooled sensitivity, specificity, diagnostic odds ratio (DOR), positive likelihood ratio (PLR) and negative likelihood ratio (NLR) with 95% confidence intervals (CIs). The summary receiver operating characteristic curve (SROC) and area under the curve (AUC) were used to summarize the diagnostic accuracy.
Q and I 2 were calculated to estimate the heterogeneity among the studies included in this meta-analysis. I 2 values of 0 to 25%, 25 to 50%, 50 to 75% and >75% represent very low, low, medium and high heterogeneity, respectively. Pooling studies and effect size were evaluated using a random-effects model, indicating that estimating the distribution of true effects between studies considers heterogeneity (19). If there was obvious heterogeneity, Spearman's correlation coefficient was used to assess the threshold effect between the sensitivity logit and the specificity logit. Subgroup analysis was performed to further investigate the cause of the heterogeneity. The following covariates were used to explain factors that could contribute to heterogeneity: In addition, the sensitivity analysis was assessed by eliminating the included studies one by one. The effective sample size funnel plot described by Deek's test was used to estimate publication bias (20).

Clinical Utility
A Fagan plot was used to assess the clinical utility, which provided the posttest probability (P post) of ALNM when pretest probabilities (P pre, suspicion of ALNM) were calculated (21).

Literature Search
The complete literature search flowchart is presented in Figure 1.

Data Quality Assessment
The 13 studies achieved an average RQS range of 11.38, a median of 13, and a range of 5 to 15. The mean RQS proportion was 13.9%, with a maximum of 41.7%. Table 1 summarizes the mean scores for each dimension, and Table S1 (Supplement Materials) shows the RQS for each study and the individual scores for each study. None of the included articles employed prospective validation, and only one study evaluated the costeffectiveness of radiomics (25). No studies publicly shared segmentation, functionality, or code. Generally, the data quality was considered acceptable, and the details of the risk of bias and applicability concerns of the included studies are presented in Figure 2.

Exploration of Heterogeneity
There was significant heterogeneity in sensitivity (I 2 = 80.6%) and specificity (I 2 = 89.57%). As shown in Figure 4, the results of the diagnostic threshold analysis showed that there is no threshold effect because Spearman's correlation coefficient was 0.181, and the P value was 0.553.
Subgroup analysis was also performed by comparing studies with the different variables. Table 4 shows the results of the analysis for subgroups. Studies

Sensitivity Analyses
There were no significant changes when eliminating the included studies one by one. The results of sensitivity analyses for each study are shown in Table S3 (Supplement Materials).

Clinical Utility
Using an ML-based radiomics DCE-MRI model would increase the posttest probability to 54 from 20% with a PLR of 5 when the pretest was positive and would reduce the posttest probability to 5% with an NLR of 0.22 when the pretest was negative ( Figure 6).

DISCUSSION
In our meta-analysis, radiomics DCE-MRI showed promising results for ALNM characterization, with a pooled sensitivity, specificity, and AUC of 0.82, 0.83, and 0.89, respectively. This finding indicates that this approach could be considered an effective and accurate tool for ALNM and SLNM prediction.
In the present study, we found that there was obvious heterogeneity between the studies. Indeed, heterogeneity can be caused by many factors, e.g., threshold effect, different magnetic  fields, segmentation, etc. In this meta-analysis, the threshold effect was not the source of heterogeneity because Spearman's coefficient was not significant. Therefore, subgroup analysis was used to determine the source of heterogeneity. Our results demonstrated that studies using 3.0 T MR had better diagnostic performance than studies using 1.5 T MR. We are not surprised by this result. Since high magnetic fields can improve image resolution, they can help to improve diagnostic accuracy. Another subgroup analysis result showed that studies employing ML have slightly better value than those employing deep learning. Deep learning has greater potential for very large datasets with thousands or even millions of instances. In this setting, datasets usually consist of hundreds of patients at most, which is better than with deep learning in this case. Similar findings have been previously reported for ML in other applications (9,10,30). However, deep learning only included two studies. Future studies employing deep learning are needed to confirm this conclusion. ROIs including the ALN area have good diagnostic performance compared with ROIs including the breast tumor area. While an ROI of the ALN is useful to evaluate ALN status, it suffers from some limitations, such as the ALN breast surface coil being mainly concentrated in the breast area; nevertheless, some positive lymph nodes might be located at the edge of the coil, and  some might not even be in the imaging range (31). Studies have focused on breast tumors themselves, which could help to avoid the above limitations. Studies with SLNB or ALND as the gold standard had an equivalent sensitivity and specificity with ALND group. The reason may be that the patient with negative SLN, SLNB maybe an effective and accuracy approach. The sensitivity of predict SLNM is lower than that to predict ALNM and the two kinds of LNs. Therefore, for SLNM, the diagnostic performance of this imaging tool might not be satisfactory, as concluded in this meta-analysis. Further studies should investigate how to improve the sensitivity of SLNM. Although studies in which ROIs are manually drawn by radiologists might be more prone to error and user variability, the prediction is still good compared with the semiautomatic segmentation method. However, manual segmentation is time consuming, tedious, and prone to error. In the future, it would be ideal to develop a reliable and validated automatic method. Our results showed that LR algorithm had higher DOR than SVM. Generally, LR and SVM algorithms are all suitable for model construction with small sample sizes and binary variables. However, for ML-based DCE-MRI radiomics in predicting ALNM, the LR algorithm is more recommended for use with our meta-analysis result. We also found that studies using Siemens MR equipment had higher diagnostic performance than using GE equipment. It means different MR equipment maybe affect the diagnostic performance. Therefore, prospective studies compared the two MR equipment are necessary to explore the diagnostic performance of ML-based DCE-MRI radiomics in predicting ALNM and SLNM. In addition, different DCE phases and cross-validation of different multiples could lead to unknown biases. Moreover, other unmentioned differences between studies might contribute to the heterogeneity.
A previous meta-analysis (32) including 3 studies of DCE-MRI (n=187) reported that the mean sensitivity and specificity were 0.88 and 0.73, respectively. Another study (6) included 7 studies using DCE-MRI and reported that the median sensitivity was 0.60 (range 0.33.3-0.97) (31). Our findings showed higher sensitivity than studies that included DCE-MRI. Conventional DCE only included morphology and a few quantitative parameters. However, radiomics could provide many new quantitative imaging markers and help to characterize heterogeneous tumor lesions (33). This method could provide more valuable information to help radiologists to improve detection, diagnosis, staging, and prediction power.

Limitations
All of the methodological issues followed the Cochrane handbook (34), but there are still some limitations that must be discussed. First, a relatively small number of studies met the selection criteria. The second limitation was the significant heterogeneity, which is an issue similar to that in other metaanalyses of diagnostic accuracy using ML based on radiomics (9,10,30).
Furthermore, study characteristics, such as different ROIs, DCE phases, and reference standards, could lead to heterogeneity. Therefore, we employed subgroup analysis to reduce heterogeneity.
Third, while there were some uncertainties in the QUADAS-2 assessment, the overall quality of the study was sufficient for analysis. Thus, this uncertain risk might not have had a significant impact on the outcomes.
Fourth, 3 studies(3/13)showed an RQS score<20%. The mean RQS score obtained by analyzing the articles reviewed in this study was 11.1 (30.1%), indicating moderate overall quality. The most important points were the type of study, biological relevance tests and discussion, validation, comparison with the gold standard, potential clinical utility, economic analysis and open scientific data ( Table 1 and Table S1). Fifth, in most studies, the lymph nodes assessed by MR have not been specifically associated with histological findings in a node-tonode manner, which is a difficult problem to solve in clinical practice. And it is inevitable that very small lesions may be missed through DCE-MRI. Sixth, some studies used the SLNB as reference standard, which may be caused some false negative rate. Finally, in this meta-analysis, the PLR, NLR and posttest probability were moderate, which would limit the recommendation of their integration into clinical practice.

Future
To improve the clinical applicability of future studies utilizing ML-based radiomics for ALNM, several factors must be followed.
First, external validation is usually not performed, which should be seen as a major limitation in the field of study. Therefore, it is advisable to verify the accuracy of these models further. When reporting ML-based radiomics, it is crucial to follow quality guidelines that include external validation. Second, future studies should also consider expanding datasets from multiple centers to overcome imbalances caused by oversampling small samples and to improve classifier performance. Third, the variation process might affect bias. There are significant variations in the number of features selected, the risk of overfitting and redundancy, and the preprocessing steps (such as manual segmentation), reducing reproducibility. In addition, the different DCE phases should be considered. Therefore, it is necessary to build better radiomics and ML paper standards to establish image acquisition, segmentation, feature engineering, statistical analysis and report format standardization to achieve reproducibility and facilitate the search for radiomics (35). Finally, the ALNM and SLNM prediction model was constructed with a combination of MR radiomics and DCE quantitative parameter and clinical characteristic data to further explore more precise predictions and to improve the clinical utility for ALNM and SLNM.

CONCLUSION
Our results indicated that ML-based DCE-MRI radiomics indicates good diagnostic performance in predicting ALNM and SLNM in breast cancer with high sensitivity and specificity. Nevertheless, due to the heterogeneity of the included studies, caution should be taken when applying the results.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.