Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Endocrinol., 26 January 2026

Sec. Thyroid Endocrinology

Volume 16 - 2025 | https://doi.org/10.3389/fendo.2025.1711029

From data to decision: an interpretable machine learning model for optimizing RAI therapy in Graves’ hyperthyroidism

Lu Lu&#x;Lu LuXiaojuan Wei&#x;Xiaojuan WeiYan ChenYan ChenDongyun MengDongyun MengShaozhou MoShaozhou MoZeyong SunZeyong SunFengyang SongFengyang SongKehua Liao*Kehua Liao*Wentan Huang*Wentan Huang*
  • Department of Nuclear Medicine, People’s Hospital of Guangxi Zhuang Autonomous Region, Nanning, Guangxi Zhuang Autonomous Region, China

Objective: Radioactive iodine (RAI) therapy is a cornerstone treatment for Graves’ hyperthyroidism (GH), yet failure rates remain significant due to the complexity of individual patient responses. Traditional fixed-dose or simple calculated-dose methods often fail to account for non-linear interactions among clinical features.

Methods: We retrospectively analyzed data from 1,292 GH patients who received initial RAI therapy between June 2018 and July 2024. Comprehensive pre-treatment clinical, laboratory, and imaging data, including age, gender, FT4, 3-hour radioactive iodine uptake (RAIU 3h), thyroid weight, and thyroid receptor antibodies (TRAb), were collected. Stepwise regression with the Akaike Information Criterion (AIC) was employed for feature selection, identifying nine optimal predictors. Six machine learning algorithms were compared, with performance evaluated using AUC, Brier score, and Decision Curve Analysis (DCA). SHapley Additive exPlanations (SHAP) analysis provided model interpretability.

Results: The final cohort, comprising 1,292 patients (61.3% female, median age 37 years), achieved a 75.8% remission rate. Nine significant variables were identified as optimal predictors: gender, age, history of antithyroid drug use, disease course over 2 years, total iodine dose (TID), free thyroxine (FT4), RAIU 3h, thyroid weight, and TRAb. Among the algorithms tested, the Random Forest (RF) model demonstrated superior performance, achieving an AUC of 0.950 on the independent test set and a Brier score of 0.067, indicating excellent discrimination and calibration. SHAP analysis confirmed RAIU 3h, FT4, age, and thyroid weight as the most influential features, providing clinical transparency.

Conclusion: The developed interpretable machine learning framework offers a precise, personalized tool for predicting RAI outcomes, potentially guiding optimizing dosing strategies to reduce treatment failure.

1 Introduction

Graves’ disease (GD) is a common autoimmune thyroid disorder primarily causing hyperthyroidism, known as GD hyperthyroidism (GH) (1). This condition results from antibodies stimulating thyroid-stimulating hormone (TSH) receptors, leading to excess thyroid hormone production (2). GH is the most prevalent form of hyperthyroidism, characterized by increased nervous, circulatory, digestive, and metabolic activity (3). In China, its incidence ranges from 1.1% to 1.6% (4). Currently, hyperthyroidism is primarily managed through three clinical approaches: antithyroid drugs (ATD), radioactive iodine therapy (RAI), and surgical thyroidectomy. Given the invasive nature of thyroidectomy, it is associated with potential postoperative complications, including hypothyroidism, recurrent laryngeal nerve injury, and hypocalcemia (5), which limits its use. RAI is a widely accepted, non-invasive treatment in Western countries like the United States due to its high cure rates, low recurrence, short treatment duration, simplicity, safety, minimal side effects, and low cost (6). It is often the first choice or a key alternative for GH patients, especially those with adverse reactions to ATD, poor drug efficacy, frequent relapses, long disease duration, surgical contraindications or risks, liver damage, leukopenia, thrombocytopenia, atrial fibrillation, or periodic skeletal muscle paralysis (5).

RAI for GH has been used for over 70 years, but determining the optimal thyroid absorption dose remains a challenge. Traditional dosing often involves a fixed range of 185–555 MBq (5–15 mCi), which overlooks individual patient differences, potentially leading to suboptimal outcomes or side effects. The dosing formula used is [Z × thyroid size (g) × 100]/24-hour iodine uptake rate (RAIU), where Z is the planned Bq or μCi per gram of thyroid tissue, ranging from 3.7 to 7.4 MBq (100-200 μCi) (7). Accurate thyroid size measurement is essential, but there is debate over whether current methods, like microdose RAIU, accurately reflect high-dose I-131 dynamics and gland radiation sensitivity (5, 8). Clinically, the calculated dosage results are often considered in conjunction with the patient’s disease condition to determine the final dosage. For example, in cases of prolonged illness, a hard thyroid gland, or patients who have not recovered after initial treatment, an appropriate increase in dosage may be warranted, while for patients with a short illness duration or those who have relapsed after surgery, a reduction in dosage may be appropriate.

I-131 dosage for GH treatment is determined using either empirical fixed-dose methods or formula-based calculations. The fixed-dose approach applies the same dosage to all patients, ignoring thyroid size and iodine uptake, and lacks scientific validity (9). In contrast, formula-based methods consider these factors, offering more scientific grounding. However, they rely on accurate thyroid size and weight data, which current methods like radionuclide imaging, ultrasound, and palpation fail to provide precisely (10). Measurement uncertainty leads to inconsistent treatment outcomes, underscoring the need for further research. Personalized I-131 dosing for GH treatment is essential due to individual differences in absorption and metabolism. Customizing doses based on factors like thyroid size and iodine uptake improves effectiveness, minimizes side effects, and reduces radiation exposure. This approach prevents I-131 overuse, cutting drug waste, costs, and secondary treatments. It also shortens hospital stays and treatment cycles, easing demand on protective wards and improving resource efficiency, ultimately allowing more patients to be served. Adopting personalized I-131 dosing for GH improves treatment outcomes, safety, and cost-efficiency, while advancing precision medicine and meeting societal demands for quality healthcare. This approach promises broader future applications and can help address the shortage of protective wards in China’s nuclear medicine sector, offering both economic and social benefits.

The rapid advancement of computational technology has markedly contributed to the expanding field of research dedicated to the application of machine learning (ML) algorithms in the analysis of medical data (1113). ML enables the processing of large-scale medical datasets, facilitating more precise analyses that enhance clinical decision-making (11). Recent studies consistently highlight the significant advantages of ML in disease prediction, diagnosis, and treatment evaluation (1416). In a notable development, Moon et al. introduced a ML classifier named OncoNPC, which leverages multi-center targeted sequencing data from 36,445 known primary cancer samples to predict the primary cancer type in cases of cancer of unknown primary (CUP). OncoNPC initially validated the existence of shared genetic and prognostic features between CUP and known cancer types. Its classification capabilities hold the potential to inform and guide clinical decision-making processes (17). Attia et al. have developed an expedited ML methodology for the detection of atrial fibrillation in patients during sinus rhythm, employing standard 10-second 12-lead electrocardiograms. This model exhibited enhanced efficacy in identifying potential atrial fibrillation in patients with cryptogenic stroke (ESUS), outperforming traditional screening techniques such as B-type natriuretic peptide levels and the CHA2DS2-VASc score. As a result, this advancement provides an innovative, cost-effective, and non-invasive tool for atrial fibrillation screening and the management of patients with ESUS (18). While ML has shown promise in medical prognosis, few studies have compared advanced ensemble methods with classical statistical models in predicting RAI outcomes. This study aims to fill this gap by developing an interpretable ML framework, validating it against classical approaches, and identifying key predictors for non-remission.

2 Patients and methods

2.1 Study subjects

This study encompassed a cohort of 1711 patients who received treatment for GH at the Nuclear Medicine Department of the People’s Hospital of Guangxi Zhuang Autonomous Region between June 2018 and July 2024. The diagnosis of hyperthyroidism was established in accordance with the 2016 American Thyroid Association Guidelines for the Diagnosis and Management of Hyperthyroidism and Other Causes of Thyrotoxicosis (5). All participants underwent their initial administration of RAI therapy. The inclusion criteria were defined as follows: (1) a confirmed diagnosis of GH; (2) discontinuation of ATD for at least five days; (3) initial administration of RAI; (4) commitment to consistent follow-up for one year post-treatment; and (5) availability of comprehensive diagnostic and treatment records. The exclusion criteria encompassed: (1) individuals with thyroid weights exceeding 80 grams, as determined by a 99mTcO4- thyroid SPECT scan, to ensure homogeneity within the study population; this is due to the fact that large goiters often necessitate surgical intervention or unique dosimetric protocols, which could introduce selection bias; (2) pregnant or lactating women; (3) individuals with a history of thyroid surgery; (4) patients unable to adhere to regular follow-up schedules; (5) patients diagnosed with granulocyte deficiency and/or liver failure; and (6) individuals with a history of malignancies.

2.2 Assessment of therapeutic efficacy

An initial assessment of therapeutic efficacy was conducted for all patients 4 to 8 weeks following RAI therapy. Subsequently, thyroid function was assessed every 4 to 8 weeks for a period of up to twelve months, or until the patient developed hypothyroidism and attained a stable condition following thyroid hormone replacement therapy. The effectiveness of RAI treatment was classified based on follow-up outcomes as follows: (1) Complete remission or clinical cure: Follow-up extending beyond six months with full resolution of hyperthyroid symptoms and normalization of serum free thyroxine (FT4) levels; (2) Hypothyroidism: Presence of hypothyroid symptoms and signs, with serum FT4 levels below the normal range and elevated thyroid-stimulating hormone (TSH) levels; (3) Partial remission: Reduction in hyperthyroid symptoms, partial resolution of signs, and decreased serum FT4 levels without normalization; (4) Ineffective: Ineffective responses was defined by either no significant improvement or a worsening of hyperthyroidism symptoms and signs, with no reduction in serum FT3 and FT4 concentrations. Outcomes of complete remission or clinical cure and hypothyroidism were classified as “Remission” (remission group), whereas partial remission and ineffective responses were categorized as “Non-Remission” (non-remission group).

2.3 Data collection

Demographic variables:

Age: Recorded in years at the time of initial 131I therapy.

Gender: Documented as male or female, with a code of “1” representing male and a code of “2” denoting female.

Clinical parameters:

(1) Thyroid hormones and TPOAb: These were measured using the UniCel DxI 800 Access Immunoassay System with a chemiluminescence method: TSH: 0.56-5.91 μIU/mL; T3: 0.92-5.91 nmol/L; T4: 69.71-163.95 nmol/L; FT3: 3.53-7.37 pmol/L; FT4: 7.98-16.02 pmol/L; TPOAb:<9.0 IU/mL.

(2) TRAb: Measured using the UniCel DxI 800 Access Immunoassay System, with a reference range of 0-1.75 IU/L.

(3) Evaluation of RAIU: This study evaluated thyroid iodine uptake rates using I-131, provided by Nanning Atomic High-throughput Isotope Co., Ltd. Prior to the evaluation, patients were instructed to refrain from consuming iodine-containing foods and medications for a period of 2 to 4 weeks. On the day of the assessment, patients ingested sodium I-131, with doses ranging from 2 to 10 μCi, while in a fasting state in the morning. Following ingestion, patients continued fasting for an additional 2 hours. Radioactivity measurements of the thyroid region were subsequently conducted at 3 hours and 24 hours post-administration using the NM-6110 thyroid function measuring instrument. The effective half-life (Teff) was determined from the sequential I-131 uptake measurements. Teff is defined as the time required for the I-131 activity within the thyroid gland to decrease to 50% of its initial value, accounting for the combined effects of physical decay and biological clearance.

(5) Thyroid weight: After intravenous injection of 99mTcO4- (2-5mCi), thyroid imaging was performed 15–20 minutes later. The patient was positioned supine with a pillow under the shoulder and neck to hyperextend the neck and fully expose the thyroid. Images were collected using the Discovery NM/CT 670, equipped with a low-energy general collimator, a matrix size of 256×256, an energy peak of 140keV, a window width of ±10%, and a collection count of 300k. The region of interest (ROI) was delineated in the blue-purple interface of the thyroid color image using Xeleris post-processing software to obtain the thyroid area, height, and weight.

Treatment-Related factors:

History of ATD Therapy: The variable “ATD” represents the history of ATD usage, where a code of “0” signifies no prior use and a code of “1” indicates a positive history of use.

Administered 131I Dosage: This refers to the prescribed dose of 131I in millicuries (mCi), as well as the iodine dose per gram of thyroid tissue (IDPG) measured in megabecquerels per gram (MBq/g). The variable “IDPG” categorizes the iodine dose per gram of thyroid tissue, with a code of “1” denoting small doses (70-90 μCi/g) and a code of “2” indicating large doses (91-120 μCi/g).

Course of Disease: This refers to the duration of GH prior to 131I therapy. The variable “Disease_course” is defined by the length of the illness, with a code of “0” indicating a duration of two years or less, and a code of “1” denoting a duration exceeding two years.

2.4 RAI treatment dose

The procedure and its associated precautions were comprehensively communicated to all patients, with particular emphasis placed on the necessity of adhering to a low-iodine diet and avoiding medications containing iodide for a duration of 7 to 14 days prior to treatment. Furthermore, ATD were required to be discontinued at least one week before the administration of 131I therapy.

Our hospital employs a calculated dosage method to determine the I-131 treatment dose, administered using a fully automated 131I dispensing machine, based on the formula:

I-131 treatment dose (μCi)= Dose per gram of thyroid tissue (uCi)× Thyroid weight (g)24h thyroid uptake rate of I-131 (%)

According to their clinical condition, three expert nuclear medicine physicians prescribed the IDPG for each patient, generally between 70-120 μCi/g.

2.5 Feature selection

To ensure data quality, we utilized the MissForest algorithm to impute minor missing values (missing rate< 5%). This non-parametric imputation method, based on RF, facilitates the estimation of missing values in mixed-type data (19). Continuous variables were retained in their original form to preserve information granularity, addressing limitations associated with categorical conversion. To ascertain the most salient predictive factors and mitigate data dimensionality, we utilized a stepwise regression analysis employing a bidirectional elimination strategy, integrating both forward selection and backward elimination techniques. This method iteratively refines the model by incorporating variables that substantially enhance model fit while discarding those deemed statistically insignificant or redundant. The selection process was directed by the minimization of the Akaike Information Criterion (AIC), which optimizes the trade-off between model fit and complexity. The algorithm commenced with an initial model and persisted in the iterative procedure until the AIC score attained its minimum, indicating no further potential for improvement.

2.6 Model development, evaluation, and interpretation

In this study, we investigated six distinct ML algorithms to ascertain the optimal classifier for the dataset: XGBoost Classifier (XGB), Logistic Regression (LR), LightGBM Classifier (LGBM), Random Forest Classifier (RF), AdaBoost Classifier, and Decision Tree Classifier (DT). To enhance model performance and mitigate overfitting, a 5-fold cross-validation approach was implemented during the training phase. Hyperparameter optimization was performed using GridSearchCV to identify the optimal parameter configurations for each algorithm. The primary metric for model evaluation was the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC). Additional evaluation metrics included accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1-score, and Cohen’s Kappa. The optimal threshold value was determined through ROC curve analysis. The DeLong test was utilized to statistically compare the AUCs across different models.

We constructed a SHAP (SHapley Additive exPlanations) summary plot, also known as a beeswarm plot, to illustrate the overall significance and directional impact of the features. In this visualization, features are ordered according to the sum of absolute SHAP values across all samples. Each point represents an individual sample, with color coding indicating the feature value (red for high values, blue for low values), and the position along the x-axis reflecting the feature’s impact on the model’s output (where positive SHAP values suggest an increased probability of the positive class, i.e., Label=1). Additionally, SHAP dependence plots were employed to assess the marginal effects of specific features, such as FT4 and TRAB, on the predicted outcomes, facilitating the identification of potential non-linear relationships and interaction effects among variables. To further illustrate the model’s clinical applicability in individual cases, we utilized SHAP force plots, or waterfall plots, to analyze specific samples from the test set. These plots decompose the prediction for an individual patient, demonstrating how each feature contributes to either elevating (positive force) or reducing (negative force) the prediction relative to the baseline (Figure 1).

Figure 1
Flowchart illustrating a six-step process: (1) Patient Enrollment of 1,711 patients from 2018-2024. (2) Data Collection includes clinical and biochemical data. (3) Feature Selection through stepwise regression. (4) Model Development with six machine learning classifiers. (5) Model Validation using ROC AUC curve and SHAP analysis plot. (6) Clinical Application for personalized prediction and treatment guidance.

Figure 1. Study overview.

2.7 Statistical analysis

Statistical analyses and data processing were executed using R software (version 4.2.3) and Python (version 3.11.4). Continuous variables were first assessed for normality; those following a normal distribution were expressed as mean ± standard deviation (SD) and compared using the independent samples t-test. Conversely, non-normally distributed data were presented as the median with interquartile range [M (P25, P75)], with group comparisons conducted via the Mann-Whitney U test. Categorical data were summarized as frequencies and percentages [n (%)], and differences between groups were analyzed using the Chi-square test or Fisher’s exact test. All statistical tests were two-sided, and a P-value of less than 0.05 was considered statistically significant.

3 Result

3.1 Patient characteristics

In this study, an initial cohort of 1,711 patients was enrolled. However, 419 patients were subsequently excluded due to various factors: thyroid weight exceeding 80 grams (312 patients), non-compliance with follow-up protocols (71 patients), a history of malignancies (24 patients), and prior thyroid surgery (12 patients). Consequently, the final cohort consisted of 1,292 patients, who were stratified into a training cohort and a testing cohort in a 7:3 ratio (Figure 2). Within the training cohort, the median age was 37.00 years (interquartile range [IQR]: 29.00-43.00), with 369 males (40.82%) and 535 females (59.18%). Prior to receiving RAI, the mean FT3 level was 22.09 pmol/L (IQR: 15.02-30.69), while the median FT4 level was 53.95 pmol/L (IQR: 40.82-63.69). In the testing cohort, the median age was 37.00 years (IQR: 30.00-43.00), comprising 150 males (38.66%) and 238 females (61.34%). The FT3 level prior to RAI was 21.75 pmol/L (IQR: 14.24-30.13), and the median FT4 level was 53.19 pmol/L (IQR: 38.27-62.02). In the training cohort, a cure was achieved by 75.77% of patients (685 out of 904), while 24.23% (219 out of 904) did not attain a cure. In the training cohort, 75.77% of patients (685 out of 904) achieved a cure, whereas 24.23% (219 out of 904) did not attain this outcome. Similarly, in the testing cohort, 74.74% of patients (290 out of 388) were cured, while 25.26% (98 out of 388) were not. Further information is available in Table 1.

Figure 2
Flowchart of patient enrollment and study grouping. Initial enrollment of 1,711 patients. 419 were excluded for reasons like thyroid weight over 80 grams, non-compliance, history of malignancy, and previous surgery. Final cohort of 1,292 was randomized in a 7:3 ratio into training (904) and testing (388) cohorts.

Figure 2. Patients enrollment flowchart.

Table 1
www.frontiersin.org

Table 1. Clinical baseline characteristics of the training set and testing set.

3.2 Feature selection and multivariate analysis

Based on the stepwise regression analysis using the AIC criterion, nine significant variables were identified as the optimal predictor subset from the initial candidate features. These selected features included gender, age, history of ATD usage, disease course over 2 years, TID, FT4, RAIU 3h, thyroid weight, and TRAB.

The multivariate logistic regression results for these selected features are detailed in Table 2. The analysis revealed significant associations between these clinical parameters and the outcome. Specifically, disease course over 2 years (OR = 2.315, 95% CI: 1.401 - 3.823, P = 0.001) and history of ATD usage (OR = 2.187, 95% CI: 1.415 - 3.382, P<0.001) were identified as strong independent risk factors. Additionally, Age, TID, RAIU 3h, thyroid weight, and TRAB were positively associated with the outcome (all OR > 1). Conversely, gender (OR = 0.523, 95% CI: 0.349 - 0.785, P = 0.002) and FT4 (OR = 0.971, 95% CI: 0.959 - 0.983, P<0.001) demonstrated a negative association, serving as protective factors or negative predictors in the model. These nine features were subsequently utilized as the input variables for the development of ML models.

Table 2
www.frontiersin.org

Table 2. Stepwise regression using forward and backward methods.

3.3 RF: The best comprehensive discrimination and calibration capabilities

To identify the optimal ML algorithm for our predictive task, we conducted a comprehensive performance comparison among six models: XGB, LR, LGBM, RF, AdaBoost, and DT (refer to Supplementary Tables S1, S2). As illustrated in Figure 3, the RF model demonstrated superior performance across multiple evaluation metrics in the validation cohort.

Figure 3
Six-part image showing performance metrics for different machine learning models. Panel A and B: ROC curves for training and validation. Panel C: Forest plot of AUC scores. Panel D: Calibration curve. Panel E: PR curve for validation. Panel F: Decision curve for validation. Models include XGBoost, logistic, LightGBM, RandomForest, AdaBoost, and DecisionTree, each represented by different colored lines.

Figure 3. Comprehensive assessment of predictive performance, calibration, and clinical utility. (A, B) ROC analysis demonstrates the discriminatory ability of the models. The Random Forest model achieved the highest AUC in both the validation (A) and training (B) sets. (C) Comparison of AUC distributions among the six models. (D) Calibration plots showing the agreement between predicted risks and actual outcomes. Models closer to the diagonal line indicate better calibration. (E) Decision curves showing the net benefit of using the models across a range of threshold probabilities compared to default strategies. (F) Precision-Recall curves evaluating model performance, particularly focusing on the trade-off between precision and recall.

In terms of discrimination ability, the RF model achieved the highest AUC. As shown in the validation ROC curves (Figure 3A), the RF model attained an AUC of 0.908 (SD = 0.039), outperforming the second-best model, XGB (AUC = 0.892, SD = 0.031), and significantly surpassing the other algorithms such as AdaBoost (AUC = 0.777) and DT (AUC = 0.755). The training set results (Figure 3B) further confirmed the robust learning capacity of the RF model, with an AUC of 0.993. The forest plot of AUC scores (Figure 3C) visually summarizes these findings, highlighting the RF model’s leading position with the highest mean AUC and stable confidence intervals.

Furthermore, the Precision-Recall (PR) curve (Figure 3E), which is particularly informative for evaluating classifier performance, showed that the RF model yielded the highest Average Precision (AP) of 0.815 (SD = 0.051). This indicates that the RF model maintained high precision even at varying levels of recall, superior to XGB (AP = 0.788) and LR (AP = 0.582).

Model calibration was assessed to evaluate the agreement between predicted probabilities and observed outcomes. The calibration curve (Figure 3D) revealed that the RF model (green line) aligned most closely with the ideal diagonal line. Quantitatively, the RF model achieved the lowest Brier score of 0.101 (SD = 0.011), indicating the minimal mean squared error in probability predictions compared to XGB (0.104) and LR (0.153).

Finally, Decision Curve Analysis (DCA) was employed to estimate the clinical utility of the models (Figure 3F). The DCA showed that the RF model provided the highest net benefit across a wide range of threshold probabilities compared to the other models and the “treat-all” or “treat-none” strategies. This suggests that using the RF model for decision-making would result in the best clinical outcomes.

In conclusion, considering discrimination, precision, calibration, and clinical utility, the RF model exhibited the best comprehensive capabilities and was selected as the final predictive model for this study.

3.4 RF model evaluation and validation

To rigorously assess the generalization ability and robustness of the selected RF model, we performed a multi-dimensional evaluation using training, validation, and independent test sets (refer to Supplementary Tables S3).

First, the discrimination performance was evaluated using ROC curves across the different datasets. As shown in Figure 4A, the RF model achieved a near-perfect performance in the training set with a mean AUC of 0.995 (SD = 0.003). Crucially, this high level of discrimination was well-maintained in the validation set (Figure 4B, AUC = 0.924, SD = 0.036) and the independent test set (Figure 4C, AUC = 0.950). The consistency of high AUC scores across these datasets suggests that the model effectively learned the underlying patterns without suffering from significant overfitting.

Figure 4
The image is a set of eight different plots evaluating the performance of a model. Panels A, B, and C show ROC curves for training, validation, and test sets, respectively, indicating high sensitivity and specificity. Panel D presents a calibration plot with a reliability curve for the Random Forest model. Panel E displays a test decision curve with different treatment strategies. Panels F and G are confusion matrices with predicted vs. actual values. Panel H is a KS statistic plot for the test set, showing the performance of class 0 and class 1 with KS statistics marked.

Figure 4. Assessment of model performance, calibration, and clinical utility. (A–C) ROC curves for the (A) training, (B) validation, and (C) independent test cohorts. The Area Under the Curve (AUC) is displayed in the bottom right of each panel. (D) Calibration curve showing the relationship between predicted probabilities and observed frequencies. The dotted line represents perfect calibration. (E) Decision Curve Analysis (DCA) for the test set, illustrating the clinical net benefit of the model. (F, G) Confusion matrices summarizing the prediction results (True/False Positives and Negatives) for the (F) training and (G) test datasets. (H) KS statistic plot demonstrating the maximum separation between the two classes distributions (Class 0 and Class 1).

The calibration of the model was further examined to ensure the reliability of the predicted probabilities. The calibration plot (Figure 4D) demonstrated excellent agreement between the predicted probabilities and the observed outcome frequencies, with the curve closely following the ideal 45-degree diagonal. The model achieved a low Brier score of 0.067 (95% CI: 0.053–0.081), indicating high accuracy in probability estimation.

To provide a detailed view of classification accuracy, confusion matrices were generated. Figure 4F displays the model’s performance on the training data, showing perfect classification with 674 true negatives and 230 true positives. In the test set (Figure 4G), the model continued to perform well, correctly identifying 299 true negatives and 51 true positives, with minimal false positives (n=2) and false negatives (n=36).

Furthermore, the Kolmogorov-Smirnov (KS) statistic was calculated to evaluate the model’s ability to separate positive and negative samples. As illustrated in Figure 4H, the RF model yielded a high KS statistic of 0.824 at a threshold of 0.370, confirming a significant distinction between the two classes distributions.

Finally, the clinical utility of the model was validated using DCA on the test set (Figure 4E). The decision curve showed that the RF model (red line) provided a higher net benefit than the “treat-all” or “treat-none” strategies across a wide range of threshold probabilities, underscoring its potential value in clinical decision-making.

3.5 Model interpretability and SHAP analysis

To overcome the “black box” nature of ML algorithms and provide clinical transparency, we employed SHAP analysis to elucidate the contribution of each feature to the RF model’s predictions.

The global importance of features is illustrated in the bar plot (Figure 5B), which ranks variables based on the mean absolute SHAP values. RAIU 3h was identified as the most influential predictor, followed by FT4, age, and thyroid weight. Other variables, such as TID, TRAB, disease course over 2 years, gender, and the history of ATD usage, showed relatively lower but non-negligible contributions to the model output.

Figure 5
Panel A shows a SHAP summary plot with various features impacting model output, with RAIU_3h having a significant effect. Panel B is a bar plot illustrating mean SHAP values for the same features, highlighting RAIU_3h as the most influential. Panels C to F display waterfall plots for individual predictions, showing contributions from different features like TID, FT4, and thyroid weight, with color gradients indicating positive or negative impacts and base values indicated.

Figure 5. SHAP analysis for model explanation. (A) SHAP summary plot showing the impact of each feature on the model output. Points are colored by feature value (red for high, blue for low). (B) Bar plot of global feature importance ranked by the mean absolute SHAP value. (C–F) Individual force plots for local interpretability. The base value represents the average model output, and f(x) represents the predicted probability for the specific patient. Features in red contribute positively to the prediction, while features in blue contribute negatively. Cases shown include (C) low probability, (D, E) high probability, and (F) moderate probability.

The SHAP summary plot (Figure 5A) further visualizes the directionality and distribution of feature impacts. Each dot represents a sample, with color indicating the feature value (red for high, blue for low). For the top predictor, RAIU 3h, higher values (red dots) were predominantly associated with positive SHAP values, indicating a positive correlation with the predicted probability of the outcome. Similarly, higher age and thyroid weight generally contributed to an increased risk prediction. Conversely, for FT4, higher values (red dots) were clustered on the negative side of the SHAP axis, suggesting that elevated FT4 levels tend to lower the predicted probability, whereas lower FT4 levels contribute to a higher risk score.

To demonstrate how the model arrives at decisions for individual patients, SHAP force plots were generated for representative cases (Figures 5C–F). In these plots, red arrows indicate features that increase the prediction value, while blue arrows indicate features that decrease it.

Figure 5C shows a patient with a very low predicted probability (f(x)=0.01). The low risk score was primarily driven by the combined negative contributions (blue arrows) of RAIU 3h (58.47%), FT4 (56.29 pmol/L), and thyroid weight (34.27 g).

In contrast, Figure 5D illustrates a high-risk case (f(x)=0.74). Here, elevated RAIU 3h (75.85%) and thyroid weight (76.36 g) acted as strong positive drivers (red arrows), pushing the prediction towards a higher probability.

Figure 5E presents another high-risk patient (f(x)=0.63), where RAIU 3h (84.62%) again played a dominant role in increasing the model output. Figure 5F depicts a case with a moderate probability (f(x)=0.40). This prediction resulted from a balance between conflicting factors: while FT4 (78.26 pmol/L) and thyroid weight (62.3 g) pushed the score higher (red), RAIU 3h (78.7%) and age (29.0) exerted a downward influence (blue). These visualizations confirm that the RF model relies on clinically relevant markers, primarily RAIU 3h and FT4, to stratify patients effectively.

4 Discussion

This study developed and validated a RF model to predict the efficacy of initial RAI therapy in patients with GH, utilizing a comprehensive set of clinical, laboratory, and imaging parameters. The RF model demonstrated superior performance in discrimination, calibration, and clinical utility compared to other prominent ML algorithms, achieving an impressive AUC of 0.950 on an independent test set. Crucially, the application of SHAP analysis provided critical insights into the model’s decision-making process, highlighting the most influential features and their directional impact on individual predictions, thereby addressing the “black box” challenge inherent in complex ML models (20, 21).

RAI remains a cornerstone in the management of GH due to its high cure rates, non-invasive nature, and cost-effectiveness (22). However, determining the optimal RAI dose has historically been a significant challenge, with traditional fixed-dose regimens often failing to account for individual patient variability, leading to suboptimal outcomes or side effects (23). Our study reinforces the necessity of personalized dosing by identifying key patient-specific factors influencing treatment success. The achieved high predictive accuracy of the RF model underscores the potential of ML to overcome these traditional limitations by integrating multifactorial clinical data to inform tailored treatment strategies. This approach aligns with the growing emphasis on precision medicine in nuclear medicine, where diagnosis and treatment are increasingly molecularly targeted and individualized (24).

Our study identified RAIU 3h as the most influential predictor, with higher values generally associated with an increased likelihood of the predicted outcome (remission/non-remission, depending on the context of the model output). This is highly consistent with previous research, which recognizes RAIU, especially early uptake values, as a crucial indicator of the thyroid gland’s iodine-concentrating ability and radiosensitivity (25, 26). Studies have shown that low 24-hour RAIU implies a high cure rate, whereas high 24-hour RAIU indicates a high failure rate (26). Similarly, a high percentage uptake at 24 hours after a test dose of 131I administration has been identified as an influential predictor (25). A significantly elevated early uptake (3h) often indicates rapid iodine turnover, suggesting a hyper-functioning gland where the retention time of radioiodine may be insufficient to deliver the therapeutic absorbed dose. This aligns with the kinetic theory that rapid turnover reduces the effective half-life of I-131, thereby increasing the risk of treatment failure (Non-Remission).

The study confirmed thyroid weight as a significant positive predictor, indicating that larger thyroid glands are associated with a higher predicted probability of the outcome (potentially non-remission given the context of factors influencing treatment failure). This corroborates extensive literature demonstrating that larger thyroid gland size or weight is a well-established negative predictor for successful RAI therapy (2729). Larger glands often correlate with increased radioresistance and higher rates of treatment failure due to the difficulty in delivering a sufficient and uniform radiation dose throughout the tissue. Our SHAP analysis specifically validated thyroid weight as a significant contributor, with higher values generally increasing the risk prediction. In previous research, the incidence of hypothyroidism in patients with non-palpable goiter was higher than in those with medium or large goiter, further supporting the influence of thyroid size (30).

Our model identified FT4 as a negative predictor, meaning higher FT4 levels were associated with a lower predicted probability of successful outcome. The SHAP summary plot visually reinforced this by showing that higher FT4 values contributed negatively to the predicted probability. This aligns with previous studies indicating that higher FT4 concentrations at presentation may correlate with a poorer response to RAI therapy or a higher likelihood of treatment failure. For instance, a study noted that successfully treated GD patients had a lower FT4 at presentation (25). Additionally, a negative association has been found between FT4 levels and disease remission after therapy discontinuation. However, some studies have noted no significant correlation between plasma P-Selectin levels and serum FT4 levels, suggesting that the direct interpretation of FT4’s role can be complex and may involve interactions with other factors (31).

Elevated TRAb levels were positively associated with the outcome in our multivariate analysis, suggesting that higher TRAb titers might predict a lower success rate or higher non-remission risk. This is highly consistent with earlier research, which has consistently linked high TRAb titers to reduced RAI efficacy and an increased risk of treatment failure (29). High TRAb levels suggest ongoing autoimmune activity that may counteract the therapeutic effect of RAI. A study found that the mean TRAb index of the hyperthyroid group was significantly higher than that of the euthyroid group, and TRAb index had a significant effect on the rate of hyperthyroidism after 3 months or later (32). Another study showed that the level of TRAb in the non-remission group was higher than that in the remission group. This study confirms TRAb as a critical prognostic marker, quantitatively integrated into the ML framework.

Our multivariate analysis found gender (male) to be a negative predictor, suggesting male patients may have a higher risk of non-remission compared to females. While some studies suggest no significant difference in cure rates between females and males, or efficacy not dependent on gender, others indicate varied associations (30, 33). Conversely, another study found that male gender was associated with treatment failure and was a main risk factor for early hypothyroidism (28, 29). These varied associations highlight the need for further investigation into gender-specific biological or clinical factors influencing RAI outcomes, with this study contributing to that nuanced understanding.

Age was found to be a positive predictor, implying that older age was associated with a higher predicted probability of the outcome (potentially remission). This finding is consistent with some prior research indicating that older patients may respond more favorably to RAI therapy (34). Successfully treated GD patients were younger than unsuccessfully treated ones in one study, while another found age to be a relative factor influencing RAI treatment efficacy, with older patients less likely to achieve clinical improvement (35). The effect of patient age as an influential predictor has been reported in other studies as well.

TID was positively associated with the outcome, which is broadly supported by studies demonstrating that higher RAI doses can increase success rates and achieve earlier treatment success. This aligns with the increasing clinical consensus that individualized dosing strategies, rather than fixed-dose regimens, lead to better patient outcomes (36, 37). The present study’s formula-based calculation method for RAI dose determination aims to personalize treatment by considering factors like thyroid size and iodine uptake, moving beyond the limitations of empirical fixed-dose approaches. This personalized approach is crucial because traditional fixed-dose methods often overlook individual patient differences, potentially leading to suboptimal outcomes or side effects.

The comprehensive comparison of six ML algorithms demonstrated the superior overall performance of the RF model. With an AUC of 0.908 in the validation cohort and 0.950 in the independent test set, along with excellent Average Precision, a low Brier score, and significant net benefit in DCA, the RF model exhibited robust discrimination and calibration capabilities. This aligns with previous research highlighting RF’s advantages in medical prediction due to its ability to handle high-dimensional data, complex interactions, and robustness against overfitting (3840). Specifically in thyroid disease prediction, RF has consistently shown high accuracy and stable performance compared to other algorithms (4143). A significant contribution of this study is the integration of SHAP analysis to provide interpretability for the RF model. The SHAP summary and dependence plots globally revealed the relative importance and directional impact of each feature, confirming that the model leverages clinically meaningful variables such as RAIU 3h, FT4, age, and thyroid weight. Furthermore, individual SHAP force plots illustrated how these features cumulatively drive predictions for specific patients, offering a transparent view of the model’s reasoning. This interpretability is crucial for fostering clinician trust and facilitating the clinical translation of AI-driven tools, as it allows healthcare professionals to understand why a particular prediction is made, rather than just what the prediction is (20, 21).

The developed interpretable RF model holds significant clinical utility for personalizing RAI therapy in GH patients. By accurately predicting the efficacy of initial RAI treatment, clinicians can make more informed decisions regarding patient selection, dose adjustment, and follow-up strategies. For instance, patients identified as high-risk for treatment failure by the model, perhaps due to large goiter size and high TRAb titers, could be considered for higher initial RAI doses or alternative treatments, potentially reducing the need for repeat RAI administrations or extended ATD use (29, 44, 45). Conversely, patients predicted to respond well might receive optimized lower doses, reducing radiation exposure while maintaining efficacy. This personalized approach aligns with the goal of precision medicine, enhancing therapeutic success, and minimizing side effects such as post-RAI hypothyroidism or the exacerbation of Graves’ orbitopathy (46). Furthermore, by reducing treatment failures and subsequent interventions, our model can contribute to resource optimization within healthcare systems, particularly in nuclear medicine departments. Improved efficacy predictions can lead to more efficient scheduling, reduced drug waste, shorter hospital stays, and better allocation of protective ward resources, which is especially pertinent in regions facing such constraints (47).

Despite the promising results, this study has several limitations. First, the exclusion of patients with massive goiters (>80g) limits the model’s generalizability to this specific subgroup. Second, as a single-center retrospective study, it carries inherent risks of selection bias and may limit the generalizability of the findings to diverse patient populations and healthcare settings. Although an independent test set was used for validation, external validation with multicenter, prospective cohorts is essential to confirm the robustness and applicability of the model across different demographics and clinical practices. Third, while a comprehensive set of clinical features was included, other potential confounders not collected in our dataset might influence RAI outcomes. Future studies could explore the integration of additional multimodal data, such as genomic markers, advanced imaging features (radiomics beyond thyroid weight), or dynamic physiological data, using advanced data fusion techniques to build even more sophisticated and accurate predictive models. Fourth, our outcome classification into “Remission” versus “Non-Remission” is a binary simplification of a more complex clinical spectrum, which includes distinct states like euthyroidism and hypothyroidism. Future research could investigate multi-class classification models to predict these specific outcomes more granularly, offering more nuanced guidance for post-treatment management. Finally, future efforts should focus on transitioning these predictive models into real-time clinical decision support systems. These systems could dynamically update predictions based on evolving patient data, provide actionable recommendations at the point of care, and facilitate shared decision-making between clinicians and patients. This transition would also necessitate rigorous evaluation of the ethical implications, data security, and patient privacy within such AI-driven healthcare applications.

In conclusion, our study successfully developed an interpretable RF model that accurately predicts the efficacy of initial RAI therapy in GH patients. By identifying key predictive factors and providing transparent explanations through SHAP analysis, this model represents a significant step towards personalized medicine, promising improved clinical outcomes, enhanced resource utilization, and a more scientific approach to managing GH.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

Ethics statement

The studies involving humans were approved by People’s Hospital of Guangxi Zhuang Autonomous Region (KY-GZR-2025-035). The studies were conducted in accordance with the local legislation and institutional requirements. The ethics committee/institutional review board waived the requirement of written informed consent for participation from the participants or the participants’ legal guardians/next of kin because written informed consent was not required due to the retrospective nature of the study.

Author contributions

LL: Conceptualization, Data curation, Funding acquisition, Investigation, Project administration, Writing – original draft. XW: Data curation, Supervision, Validation, Visualization, Writing – review & editing. YC: Supervision, Validation, Visualization, Writing – review & editing. DM: Data curation, Supervision, Validation, Visualization, Writing – review & editing. SM: Data curation, Supervision, Validation, Visualization, Writing – review & editing. ZS: Data curation, Supervision, Visualization, Writing – review & editing. FS: Data curation, Supervision, Validation, Visualization, Writing – review & editing. KL: Conceptualization, Data curation, Investigation, Resources, Software, Supervision, Validation, Visualization, Writing – review & editing. WH: Conceptualization, Data curation, Formal analysis, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was supported by the self-funded scientific research project of the Western Medicine category of the Health Commission of Guangxi Zhuang Autonomous Region (Funding No.: Z-A20250125).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fendo.2025.1711029/full#supplementary-material

References

1. Smith TJ and Hegedüs L. Graves’ Disease. N Engl J Med. (2016) 375:1552–65. doi: 10.1056/NEJMra1510030

PubMed Abstract | Crossref Full Text | Google Scholar

2. McDermott MT. Hyperthyroidism. Ann Intern Med. (2020) 172:ITC49–64. doi: 10.7326/AITC202004070

PubMed Abstract | Crossref Full Text | Google Scholar

3. Brent GA. Clinical practice. Graves’ disease. N Engl J Med. (2008) 358:2594–605. doi: 10.1056/NEJMcp0801880

PubMed Abstract | Crossref Full Text | Google Scholar

4. Wang J and Qin L. Radioiodine therapy versus antithyroid drugs in Graves’ disease: a meta-analysis of randomized controlled trials. Br J Radiol. (2016). doi: 10.1259/bjr.20160418

PubMed Abstract | Crossref Full Text | Google Scholar

5. Ross DS, Burch HB, Cooper DS, Greenlee MC, Laurberg P, Maia AL, et al. 2016 american thyroid association guidelines for diagnosis and management of hyperthyroidism and other causes of thyrotoxicosis. Thyroid. (2016) 26:1343–421. doi: 10.1089/thy.2016.0229

PubMed Abstract | Crossref Full Text | Google Scholar

6. Szczepanek-Parulska E and Płaczkiewicz-Jankowska E. Hyperthyroidism in Graves’ disease: practical advice on the choice of treatment and indications for radioiodine use. Pol Arch Intern Med. (2024) 134:16763. doi: 10.20452/pamw.16763

PubMed Abstract | Crossref Full Text | Google Scholar

7. Leslie WD, Ward L, Salamon EA, Ludwig S, Rowe RC, and Cowden EA. A randomized comparison of radioiodine doses in Graves’ hyperthyroidism. J Clin Endocrinol Metab. (2003) 88:978–83. doi: 10.1210/jc.2002-020805

PubMed Abstract | Crossref Full Text | Google Scholar

8. Tamatea JAU, Conaglen JV, and Elston MS. Response to radioiodine therapy for thyrotoxicosis: disparate outcomes for an indigenous population. Int J Endocrinol. (2016) 2016:7863867. doi: 10.1155/2016/7863867

PubMed Abstract | Crossref Full Text | Google Scholar

9. de Rooij A, Vandenbroucke JP, Smit JWA, Stokkel MP, and Dekkers OM. Clinical outcomes after estimated versus calculated activity of radioiodine for the treatment of hyperthyroidism: systematic review and meta-analysis. Eur J Endocrinol. (2009) 161:771–7. doi: 10.1530/EJE-09-0286

PubMed Abstract | Crossref Full Text | Google Scholar

10. Jarløv AE, Nygaard B, Hegedüs L, Hartling SG, Hansen JM, and Karstrup S. Observer variation in the clinical and laboratory evaluation of patients with thyroid dysfunction and goiter. Thyroid. (1998) 8:393–8. doi: 10.1089/thy.1998.8.393

PubMed Abstract | Crossref Full Text | Google Scholar

11. Ngiam KY and Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. (2019) 20:e262–73. doi: 10.1016/S1470-2045(19)30149-4

PubMed Abstract | Crossref Full Text | Google Scholar

12. Daidone M, Ferrantelli S, and Tuttolomondo A. Machine learning applications in stroke medicine: advancements, challenges, and future prospectives. Neural Regener Res. (2024) 19:769–73. doi: 10.4103/1673-5374.382228

PubMed Abstract | Crossref Full Text | Google Scholar

13. Merkin A, Krishnamurthi R, and Medvedev ON. Machine learning, artificial intelligence and the prediction of dementia. Curr Opin Psychiatry. (2022) 35:123–9. doi: 10.1097/YCO.0000000000000768

PubMed Abstract | Crossref Full Text | Google Scholar

14. Al Bataineh A and Manacek S. MLP-PSO hybrid algorithm for heart disease prediction. J Pers Med. (2022) 12:1208. doi: 10.3390/jpm12081208

PubMed Abstract | Crossref Full Text | Google Scholar

15. McElroy SJ and Lueschow SR. State of the art review on machine learning and artificial intelligence in the study of neonatal necrotizing enterocolitis. Front Pediatr. (2023) 11:1182597. doi: 10.3389/fped.2023.1182597

PubMed Abstract | Crossref Full Text | Google Scholar

16. Bouqentar MA, Terrada O, Hamida S, Chahhou M, El Hami A, and Zine-Dine K. Early heart disease prediction using feature engineering and machine learning algorithms. Heliyon. (2024) 10:e38731. doi: 10.1016/j.heliyon.2024.e38731

PubMed Abstract | Crossref Full Text | Google Scholar

17. Moon I, LoPiccolo J, Baca SC, Sholl LM, Kehl KL, and Gusev A. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nat Med. (2023) 29:2057–67. doi: 10.1038/s41591-023-02482-6

PubMed Abstract | Crossref Full Text | Google Scholar

18. Attia ZI, Noseworthy PA, Lopez-Jimenez F, Asirvatham SJ, Deshmukh AJ, Gersh BJ, et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet. (2019) 394:861–7. doi: 10.1016/S0140-6736(19)31721-0

PubMed Abstract | Crossref Full Text | Google Scholar

19. Stekhoven DJ and Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. (2012). doi: 10.1093/bioinformatics/btr597

PubMed Abstract | Crossref Full Text | Google Scholar

20. Salih AM, Raisi-Estabragh Z, Galazzo IB, Radeva P, Menegaz G, and Lekadir K. A perspective on explainable artificial intelligence methods: SHAP and LIME. Advanced Intelligent Syst. (2025) 7:2400304. doi: 10.1002/aisy.202400304

Crossref Full Text | Google Scholar

21. Reddy GP and Kumar YVP. "Explainable AI (XAI): Explained," in: 2023 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Vilnius, Lithuania, (2023), pp. 1–6. doi: 10.1109/eStream59056.2023.10134984

Crossref Full Text | Google Scholar

22. Lutterman SL, Zwaveling-Soonawala N, Verberne HJ, Verburgh CA, and van Trotsenburg ASP. The efficacy and short- and long-term side effects of radioactive iodine treatment in pediatric graves’ Disease: A systematic review. Eur Thyroid J. (2021) 10:353–63. doi: 10.1159/000517174

PubMed Abstract | Crossref Full Text | Google Scholar

23. Bonnema SJ and Hegedüs L. Radioiodine therapy in benign thyroid diseases: effects, side effects, and factors affecting therapeutic outcome. Endocr Rev. (2012) 33:920–80. doi: 10.1210/er.2012-1030

PubMed Abstract | Crossref Full Text | Google Scholar

24. Duan H, Iagaru A, and Aparici CM. Radiotheranostics - precision medicine in nuclear medicine and molecular imaging. Nanotheranostics. (2022) 6:103–17. doi: 10.7150/ntno.64141

PubMed Abstract | Crossref Full Text | Google Scholar

25. Shestakova GV, Efimov AS, Strongin LG, Ryabikov AN, Koroleva SV, and Kasimova EA. Predictors of the results of radioiodine therapy of Gagves’ disease. Clin Exp thyroidology. (2010) 6:48. doi: 10.14341/ket20106348-53

Crossref Full Text | Google Scholar

26. Yang D, Xue J, Ma W, Liu F, Fan Y, and Rong J. Prognostic factor analysis in 325 patients with Graves’ disease treated with radioiodine therapy. Nucl Med Commun. (2018) 39:16–21. doi: 10.1097/MNM.0000000000000770

PubMed Abstract | Crossref Full Text | Google Scholar

27. Feng W, Shi H, Yang Y, Xu W, and Wu H. Predictive factors for the efficacy of radioactive iodine treatment of graves’ Disease. Int J Endocrinol. (2024) 2024:7535093. doi: 10.1155/2024/7535093

PubMed Abstract | Crossref Full Text | Google Scholar

28. Hu R-T, Liu D-S, and Li B. Predictive factors for early hypothyroidism following the radioactive iodine therapy in Graves’ disease patients. BMC Endocr Disord. (2020) 20:76. doi: 10.1186/s12902-020-00557-w

PubMed Abstract | Crossref Full Text | Google Scholar

29. Tay WL, Chng CL, Tien CS, Loke KY, Lee KO, and Lee J. High thyroid stimulating receptor antibody titre and large goitre size at first-time radioactive iodine treatment are associated with treatment failure in graves’ Disease. Ann Acad Med Singap. (2019) 48:181–7. doi: 10.47102/annals-acadmedsg.V48N6p181

PubMed Abstract | Crossref Full Text | Google Scholar

30. Erem C, Kandemir N, Hacihasanoglu A, Ersoz HO, Ukinc K, and Kocak M. Radioiodine treatment of hyperthyroidism. Endocr. (2004) 25:55–60. doi: 10.1385/ENDO:25:1:55

PubMed Abstract | Crossref Full Text | Google Scholar

31. Anagnostis P, Adamidou F, Polyzos SA, Koliakos G, Mavroudi A, and Kita M. Predictors of long-term remission in patients with Graves’ disease: a single center experience. Endocrine. (2013) 44:448–53. doi: 10.1007/s12020-013-9895-0

PubMed Abstract | Crossref Full Text | Google Scholar

32. Kaise K, Kaise N, Yoshida K, Itagaki Y, Kiso Y, Sayama N, et al. Thyrotropin receptor antibody activities significantly correlate with the outcome of radioiodine (131I) therapy for hyperthyroid graves’ Disease. Endocrinol Japon. (1991) 38:429–33. doi: 10.1507/endocrj1954.38.429

PubMed Abstract | Crossref Full Text | Google Scholar

33. Wafa A, Wahba H, El-Hadaad H, El-Sharawy S, and El-Refaie K. Predictors of recurrent thyrotoxicosis in a cohort of Egyptian thyrotoxic patients treated with radioactive iodine. Egypt J Obes Diabetes Endocrinol. (2018) 4:1. doi: 10.4103/ejode.ejode_5_18

Crossref Full Text | Google Scholar

34. Šfiligoj D, Gaberšcek S, Mekjavic PJ, Zaletel K, Pirnat E, and Hojker S. Factors influencing the success of radioiodine therapy in patients with Graves’ disease. Nucl Med Commun. (2015) 36:560. doi: 10.1097/MNM.0000000000000285

PubMed Abstract | Crossref Full Text | Google Scholar

35. Hu Y, Liu S, Xiong X, Liu Y, Li C, Zhang Q, et al. Effects of metabolic and organ function factors on the efficacy of radioactive iodine therapy for hyperthyroidism. Front Endocrinol. (2025) 16:1568699. doi: 10.3389/fendo.2025.1568699

PubMed Abstract | Crossref Full Text | Google Scholar

36. Cheah SK, Aljenaee K, Muhammad N, et al. Outcomes following fixed dose radioactive iodine therapy (RAI) in hyperthyroid patients with grave’s disease and toxic nodular disease. Endocrinol Metab Int J. (2016) 3(6):174–6. doi: 10.15406/emij.2016.03.00070

Crossref Full Text | Google Scholar

37. Martin NM, Patel M, Nijher GMK, Netherton WA, Meeran K, Druce MR, et al. Adjuvant lithium improves the efficacy of radioactive iodine treatment in Graves’ and toxic nodular disease. Clin Endocrinol. (2012) 77:621–7. doi: 10.1111/j.1365-2265.2012.04385.x

PubMed Abstract | Crossref Full Text | Google Scholar

38. Speiser JL, Miller ME, Tooze J, and Ip E. A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst Appl. (2019) 134:93–101. doi: 10.1016/j.eswa.2019.05.028

PubMed Abstract | Crossref Full Text | Google Scholar

39. Xu C, Wang J, Zheng T, Dai Z, Hong Y, and Wang J. Prediction of prognosis and survival of patients with gastric cancer by a weighted improved random forest model: an application of machine learning in medicine. Arch Med Sci. (2021) 18:1208–20. doi: 10.5114/aoms/135594

PubMed Abstract | Crossref Full Text | Google Scholar

40. Sekhar C, Minal M, and Madhu E. MULTIMODAL CHOICE MODELING USING RANDOM FOREST DECISION TREES. IJTTE. (2016) 6:356–67. doi: 10.7708/ijtte.2016.6(3).10

Crossref Full Text | Google Scholar

41. Lee K-S and Park H. Machine learning on thyroid disease: a review. Front Biosci (Landmark Ed). (2022) 27:101. doi: 10.31083/j.fbl2703101

PubMed Abstract | Crossref Full Text | Google Scholar

42. Duggal P and Shukla S. "Prediction of thyroid disorders using advanced machine learning techniques," in: 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, (2020), pp. 670–5. doi: 10.1109/Confluence47617.2020.9058102

Crossref Full Text | Google Scholar

43. Abbad Ur Rehman H, Lin C-Y, Mushtaq Z, and Su S-F. Performance analysis of machine learning algorithms for thyroid disease. Arab J Sci Eng. (2021) 46:9437–49. doi: 10.1007/s13369-020-05206-x

Crossref Full Text | Google Scholar

44. De Jong JAF, Verkooijen HM, Valk GD, Zelissen PMJ, and de Keizer B. High failure rates after 131I therapy in graves hyperthyroidism patients with large thyroid volumes, high iodine uptake, and high iodine turnover. Clin Nucl Med. (2013) 38:401–6. doi: 10.1097/RLU.0b013e3182817c78

PubMed Abstract | Crossref Full Text | Google Scholar

45. El-Kareem MA, Derwish WA, and Moustafa HM. Response rate and factors affecting the outcome of a fixed dose of RAI-131 therapy in Graves’ disease: a 10-year Egyptian experience. Nucl Med Commun. (2014) 35:900–7. doi: 10.1097/MNM.0000000000000152

PubMed Abstract | Crossref Full Text | Google Scholar

46. Orsini F, Traino A, Grosso M, Gurioli D, Faggioni L, Caramella D, et al. Personalization of radioiodine treatment for Graves’ disease: a prospective, randomized study with a novel method for calculating the optimal 131I-iodide activity based on target reduction of thyroid mass. Q J Nucl Med Mol Imaging. (2012) 56(6):496–502.

PubMed Abstract | Google Scholar

47. Park H, Kim HI, Park J, Park SY, Kim TH, Chung JH, et al. The success rate of radioactive iodine therapy for Graves’ disease in iodine-replete area and affecting factors: a single-center study. Nucl Med Commun. (2020) 41:212. doi: 10.1097/MNM.0000000000001138

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: explainable AI, Graves’ disease, machine learning, precision medicine, radioiodine therapy, treatment outcome prediction

Citation: Lu L, Wei X, Chen Y, Meng D, Mo S, Sun Z, Song F, Liao K and Huang W (2026) From data to decision: an interpretable machine learning model for optimizing RAI therapy in Graves’ hyperthyroidism. Front. Endocrinol. 16:1711029. doi: 10.3389/fendo.2025.1711029

Received: 23 September 2025; Accepted: 29 December 2025; Revised: 08 December 2025;
Published: 26 January 2026.

Edited by:

Giulia Lanzolla, University of Cagliari, Italy

Reviewed by:

Haiyang Zhang, Shanghai Jiaotong University School of Medicine, China
Jayadharshini P., Anna University, India

Copyright © 2026 Lu, Wei, Chen, Meng, Mo, Sun, Song, Liao and Huang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Kehua Liao, MjkwNDU5MzE0QHFxLmNvbQ==; Wentan Huang, aHVhbmd3ZW50YW5Ac2luYS5jb20=

These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.