- 1PET-CT Center, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, China
- 2Department of Network Information, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, China
- 3School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, China
- 4Data Center, Hangzhou First People’s Hospital, Hangzhou, China
Background: According to the 2021 WHO classification of tumors of the central nervous system, isocitrate dehydrogenase (IDH) status serve an independent prognostic biomarker and is closely associated with tumor diagnosis and treatment response. At present, the determination of IDH status still relies on invasive surgical procedures.
Method: A total of 345 patients with pathologically confirmed gliomas diagnosed at the First Affiliated Hospital of Xi’an Jiaotong University between October 2019 and October 2024 were retrospectively included, comprising 148 (42.9%) IDH-wild and 197 (57.1%) IDH-mutant. An additional 495 glioma patients were obtained from the public TCIA dataset. Patients were randomly split into training, validation, and test cohorts 6:2:2. A Hierarchical Attention-Based Multiple Instance Learning (HAB-MIL) framework was developed, integrating auxiliary positional encoding into feature maps to capture spatially specific information and generate refined 3D lesion representations. Model performance was evaluated using five-fold cross-validation, with receiver operating characteristic (ROC) curves, area under the curve (AUC), sensitivity, and specificity as assessment metrics.
Result: HAB-MIL achieved competitive performance, with AUCs of 0.917 and 0.892 on the glioma datasets from TCIA and the First Affiliated Hospital of Xi’an Jiaotong University. Additionally, our work achieves results that are comparable to the state-of-the-art methods in TCIA dataset and demonstrates that multiple instance learning has great potential for IDH prediction.
Conclusion: The proposed HAB-MIL achieved IDH classification based on conventional preoperative MRI images, eliminating the need for pixel-level annotations and significantly reducing the annotation burden for doctors.
1 Introduction
Gliomas, the most prevalent malignant tumor in the central nervous system, account for approximately 80% of all intracranial tumors (1, 2). According to the WHO 2021 classification of central nervous system tumors, isocitrate dehydrogenase (IDH) status is considered as a critical biomarker for the diagnosis and treatment of glioma and plays a vital role in prognosis and treatment strategies (3, 4). Previous studies have shown that patients with IDH-wild generally have a poorer prognosis and exhibit less sensitivity to the therapeutic targeting of lDH mutations with vorasidenib compared with those with IDH-mutant (5–7). At present, determination of IDH status primarily depends on surgical sampling followed by genetic sequencing. However, these approaches have several limitations: lesions that are deeply situated or located near eloquent brain regions may be difficult or impossible to sample (8, 9). Consequently, the accurate and noninvasive prediction of IDH status has become an urgent clinical need.
Recent advances in artificial intelligence, particularly in deep learning, have enabled noninvasive assessment of IDH status through MRI (10–12). Deep learning models can automatically learn complex patterns directly from raw images, thereby minimizing the need for manual feature extraction and extensive domain expertise. In contrast to traditional tissue biopsies, it provides a safe, efficient, and repeatable way to preoperative evaluation and long-term follow-up (13, 14). However, it largely depends on access to high-quality annotated datasets (15). Data labeling is labor-intensive and requires the expertise of highly trained specialists.
To overcome the challenges, weakly supervised learning approaches—such as Multiple Instance Learning (MIL)—have gained considerable attention in medical image analysis (16–18). In the MIL framework, an image is regarded as a “bag” consisting of multiple “instances,” such as image patches or slices. For binary classification tasks, a bag is labeled as positive if at least one instance is positive. By relying on bag-level rather than instance-level labels, MIL is well suited for medical imaging lacking of fine-grained labeling (19).
Although MIL approaches, such as Attention-based Multiple Instance Learning (AB-MIL) and Clustering-constrained Attention Multiple Instance Learning (CLAM), have achieved remarkable progress in instance-level feature aggregation, they often overlook the spatial positional relationships among instances within the original image. In AB-MIL, the attention weights are derived from a fixed parameter matrix shared across all samples. This global static attention mechanism struggles to adapt to the diverse morphological variations of lesions and tends to overemphasize irrelevant background or noisy regions (20, 21). Similarly, CLAM, despite introducing a clustering constraint to improve instance-level discriminability, still relies on a single static attention mechanism and therefore lacks the capability to dynamically adjust attention distributions in response to varying feature patterns across different samples (22, 23).
To address these limitations, we propose a MIL network named Hierarchical Attention-Based Multiple Instance Learning (HAB-MIL). The model includes a Dynamic Gated Attention (DGA) module that integrates a learnable singular value decomposition (LSVD) framework. This framework adaptively decomposes and reweights the instance feature matrix, allowing dynamic modulation of the key components within the attention map. It enables the network to capture latent inter-instance dependencies and adjust the attention distribution in a data-adaptive manner.
2 Method
2.1 Patient
This study included patients who were preoperatively diagnosed with glioma and underwent surgical treatment at the Department of Neurosurgery, the First Affiliated Hospital of Xi’an Jiaotong University, between October 2019 and October 2024. A total of 506 patients with histopathologically confirmed glioma and available genetic testing were initially screened according to the 2021 WHO classification of tumors of the central nervous system, 5th edition. After applying the inclusion and exclusion criteria, 345 patients who met all requirements were ultimately enrolled in the study cohort, as shown in Figure 1. The IDH status of all data was determined by DNA sequencing.
Inclusion Criteria:
1. Patients aged 18–80 years with a single intracranial lesion confirmed as glioma by postoperative pathology, with complete molecular genetic testing results.
2. Mini-Mental State Examination (MMSE) score between 28 and 30, indicating no significant cognitive impairment and sufficient ability to cooperate during MRI scanning.
3. No MRI contraindications, such as metallic implants, cardiac pacemakers, or severe claustrophobia.
4. No prior history of craniotomy or radiotherapy.
Exclusion Criteria:
1. Poor imaging quality or severe artifacts resulting in data unsuitable for analysis.
2. Presence of infectious, structural, or metabolic brain diseases that could interfere with image interpretation.
3. Incomplete or indeterminate pathological and molecular genetic testing results.
All patients underwent preoperative cranial MRI examinations and provided written informed consent prior to imaging. The study was approved by the Ethics Committee of the First Affiliated Hospital of Xi’an Jiaotong University and was registered at ClinicalTrials.gov (registration number: NCT05019196).
2.2 Method
Given a complete histopathological image slice , our goal is to predict the image label by analyzing the features extracted from discrimination patches . To achieve this, we designed a two-stage framework for the classification of IDH status, as illustrated in Figure 2. In the first stage, we encode the spatial features of the tumor. In the second stage, we introduce a novel HAB-MIL network, which comprises multiple layers of convolutional neural networks and an attention mechanism module. This module aggregates the features of the candidate instance in the image-level prediction by recalibrating the importance coefficients for each instance.
Figure 2. Overview of the hierarchical attention-based multiple instance learning for predicting isocitrate dehydrogenase status in glioma.
2.3 Deep instance generation
In MIL, the data is organized into “bags”, each containing multiple “instances”. The label assigned to a bag depends on the instances it contains. For a binary classification task, the bag is labeled positive if at least one instance within it is positive; otherwise, it is labeled negative. In the weakly supervised histopathology image classification problem, the dataset consists of MRI images, where each slide has label , and the objective is to train a model to predict the labels of MRIs. This relationship is formally expressed as (Equation 1):
where is the bag label, is the label of instance, and is the number of instances in each bag. In our work, this assumption indicates that MRI is from a glioma patient if it involves at least one lesion. Based on the assumption, the empirical loss is formulated by (Equation 2):
where represents a labeling function induced an MIL scoring function , and can be any loss function. The MIL process can be mathematically represented as follows (Equation 3):
where is the feature extraction function, is the instance-level aggregation operator, is predicts the bag-level label. The model processes three MRI modalities: T1w, T2w and Flair, which represented as a 3D tensor , where is the batch size, is the number of channels and is the spatial resolution. The final layer of 3D fully CNN outputs a series of 3D feature maps with the shape of , where , , and represent the high, width, spatial, and feature dimension of 3D feature maps, respectively. The feature extraction process consists of multiple 3D convolutional layers, batch normalization, ReLU activation, max pooling, and dropout layers. Finally, feature map is formulated as (Equation 4):
where is a 3D convolution kernels, with size , and the output channels are 32. indicates the 3D convolution operation. Batch Normalization (BN) is used to stabilize training. is the ReLU activation function.
2.4 Collateral location encoding
The spatial location of a tumor often provides critical diagnostic cues (24, 25), since we proposed Collateral location encoding (CLE), as illustrated in Figure 2. represents the voxel coordinates of the image. Since typically takes continuous values, employing a smooth nonlinear transformation like GELU allows for a more natural modeling of the relationships between coordinates, unlike ReLU, which introduces abrupt discontinuities. The location feature map or each position of pixel is calculated as (Equation 5):
where are the learnable weights and biases of our neural positional encoding layers. The learned features and location encoding from the previous feature extraction are first normalized to benefit the training and backpropagation process. Then features concatenated to be the input of this phase, which can be denoted as (Equation 6):
where means the concatenation operation, , refer to the normalization of , , . Then, using the self-attention mechanism, the importance of the features can be calculated by (Equations 7, 8):
where denote the learnable weight and bias parameters. and represent the input and output of the adaptive feature importance weighting mechanism based on the attention module. operation is applied to normalize the attention scores and resulting in an attention weight matrix that maps each weight to the interval and guarantees the sum of these mapped attention weights to be 1. denotes element-wise multiplication. Through this process, the input features can be weighted and assigned different importance, enabling the model to emphasize critical features and thereby enhance overall performance.
2.5 Dynamic gated attention
Attention mechanisms are well known for dynamically highlighting important parts of input data by assigning different weights, allowing the model to focus on the most critical information for the task at hand (26–28). To enhance the accuracy of learning attention weights, we propose Dynamic Gated Attention (DGA), which achieves superior performance. As illustrated in Figure 2, the DGA module integrates two different processing methods.
The first component is an attention representation that uses Tanh as the activation function. With an output range of , Tanh provides both positive and negative activation signals, allowing the model to capture complex relationship features. In addition, a dropout function is incorporated to mitigate overfitting. The attention weights are defined as (Equation 9):
The second component utilizes a different attention representation. As illustrated in Figure 3, we propose using Learnable Singular Value Decomposition (LSVD) to reduce the dimensionality of the feature map (29), followed by a Sigmoid activation function, which creates a “switch-like” effect, indicating whether the model should attend to or ignore a particular feature. The algorithm flowchart is shown in Algorithm 1.
Algorithm 1. Learnable Singular Value Decomposition.
Fact1 The feature map can be decomposed as , where is an orthogonal matrix with the left singular vector of as the column vector, where is an orthogonal matrix formed by the right singular vectors of . The matrix is a diagonal matrix consisting of the singular values of .
Proof of the Fact 1: According to the Singular Value Decomposition (SVD), any matrix admits a factorization of the form , where are the singular values and is the rank of . Hence, for , the decomposition is as shown in (Equation 10):
Here, are the singular vectors, is the number of nonzero singular values. For clarity, we denote the left singular matrix as , the diagonal matrix as , and the right singular matrix as , yielding: .
The remaining challenge is to determine . A natural approach is to leverage neural networks to learn these latent representations. Specifically, assuming is a trainable diagonal matrix defined as (Equation 11):
where are trainable parameters in the neural network, which are initialized prior to training and subsequently optimized via backpropagation.
Then, we utilize two learnable layers of to fit as shown in Equation 12, 13):
Among these, are projection layers to transform the matrix into matrices of shape . In our implementation, we share the same set of two linear layers for by employing a common set of linear transformation parameters . The transformation is defined as (Equation 14):
Formally, we denote by a bag of deep instance, the attention- based MIL pooling is defined by as shown in (Equations 15, 16):
where and are attention weights.
3 Experiment
3.1 Dataset
In this study, two distinct datasets were used for model development and evaluation. Both datasets extract the same features and do not specify whether the IDH-mutant are of the IDH1 or IDH2 type.
In-house data set: The first data set was acquired from the First Affiliated Hospital of Xi’an Jiaotong University and comprised 789 MRI samples of 344 subjects (incorporating FLAIR, T1w, and T2w sequences, all occurrences of “T1w” refer to T1 without contrast injection, and “T1c” refers to T1 with contrast injection.), as summarized in Table 1. All patients underwent surgical resection followed by genetic sequencing to confirm IDH status. For patients diagnosed before 2021, we performed a centralized histopathological review. Two board-certified neuropathologists, who specialize in brain tumor pathology, re-examined the archival hematoxylin–eosin and immunohistochemistry slides (including IDH1 R132H, ATRX, p53 and other markers when available) and reviewed the original molecular pathology reports. Based on these materials, tumors were reclassified according to the 2021 World Health Organization Classification of Tumors of the Central Nervous System (5th edition), using all available molecular markers. Importantly, no additional molecular testing (such as TERT, EGFR or chromosomal analyses) was performed because of the retrospective design and limited archival tissue in some case (30, 31).
Public data set: The second data set was obtained from The Cancer Imaging Archive (TCIA), which included 1485 MRI samples from 495 subjects (395 wild-type cases and 100 mutant-type cases). Each subject in the TCIA dataset provided Flair, T1w, and T2w modalities, together with documented IDH status. Both datasets underwent identical pre-processing and post-processing pipelines. The diagnosis of glioma for all patients was performed according to the 2021 WHO classification of tumors of the central nervous system, 5th edition (32). Specifically, MRI volumes were preserved in the DICOM format and standardized to a uniform dimension of . Standard intensity normalization and artifact removal procedures were applied, followed by data augmentation strategies such as color jitter and random affine transformations to mitigate overfitting. This unified preprocessing approach allowed for direct comparisons between the two datasets and a robust evaluation of the proposed HAB-MIL framework.
3.2 Implementation details
The framework was implemented using Pytorch2.5.1, and all models were trained with NVIDIA 3090 GPU with CUDA 11.8. We use 3D convolutional neural networks as the deep instance generator. We set the output shape of to be according to cross-validation. The input shapes of the MRI slices are We set training epoch T to 100 and batch size to 2. Data enhancement strategies included color jitter and random affine transformations. The network was trained using the ADAM optimizer with an initial learning rate 1e-4 and weight decay of 1e-5. In all experiments, 60% of the subjects was used for training, 20% for model selection and hyperparameter tuning, and the remaining 20% for testing. The dataset was augmented with random flipping, random affine transformation, intensity scaling, color jitter as was done to train the segmentation network. No scans from the same subject were included in both the training and the testing samples to ensure data independence and avoid potential information leakage. Five-fold cross-validation is performed within the training set, where the validation portion in each fold is applied for model selection and hyperparameter verification. Each experiment was repeated five times to ensure fair comparisons. Evaluation metrics included accuracy (Acc), area under the curve (AUC), sensitivity (Sens), specificity (Spec) and ROC curves.
4 Result
4.1 Ablation experiment
A series of ablation studies on the TCIA data set was carried out to evaluate the effectiveness of the two modules (CLE and DGA) in HAB-MIL. Initially, use only the 3D convolution and MIL modules (denoted as Ablation-1) to evaluate the performance of the backbone. In the backbone, features can still be extracted from three concatenated conventional magnetic resonance sequences for the prediction of IDH. Next, add only the CLE module to the backbone (denoted Ablation-2), using positional encoding as a feature, and train the model. Next, add only the DGA module to the backbone (denoted Ablation-3). The results of the ablation experiments are presented in Table 2. Compared to backbone results, the incorporation of tumor location encoding improved the accuracy of IDH prediction by 21.8%. This suggests that positional encoding enables our HAB-MIL to learn location-specific features, which aids in predicting consistent differences across all image regions. Then, by adding the DGA module and selecting key instances, the model’s accuracy was further improved by 24%. Finally, combining all three modules results in the best prediction performance. Therefore, the CLE module can be used to explore the spatial relationships of pixels, and the DGA module helps select key instances, improving IDH prediction.
Table 2. Classification results in the TCIA dataset of different modules in terms of AUC, ACC, SEN, SPE.
Table 3 lists classification results of various methods for the prediction of IDH. Among the various methods compared, the SPE can be used as a baseline to measure the validity of segmentation. We first ablate the effects of our CLE in the HAB-MIL for the test. As can be seen, Sinusoidal Positional Encoding (SPE) which simply concatenates the sinusoidal positional encoding of the pixel locations (x, y) values directly into the encoder stage leads to a slight performance improvement of 2% in most metrics compared to those without it (denoted as “w/o SPE”). Bello propose neural positional encoding (NPE) for depth estimation (33). However, our HAB-MIL shows substantial performance improvements in all metrics by incorporating our CLE. Specifically, the AUC increased by 28.8% and the ACC by 27.1% compared to the model w/o SPE. This improvement can be attributed to the learnable auxiliary positional encoding, which allows the model to more flexibly capture and leverage position-specific feature information. From the comparison between these methods, it can be observed that CLE is more helpful for IDH prediction, as it incorporates information that accurately describe location and morphological features of tumor and peritumoral edema.
Table 3. Comparison results among CLE module and different location coding functions in terms of AUC, ACC, SEN, SPE.
To evaluate the effectiveness and feasibility of the DGA module within HAB-MIL, we considered three attention weighting methods: max pooling, average pooling, and AB-MIL approaches. These methods were compared against our proposed HAB-MIL model. Specifically, the max-pooling-based and average-pooling-based methods follow the traditional MIL assumption, where the final subject-level prediction is obtained from the most significant instance or the average of all instances within a bag, respectively. In contrast, the attention-pooling-based method leverages an attention mechanism to assign weights to each instance embedding, facilitating learning at the bag level. In contrast, HAB-MIL considers a dynamic attention mechanism to selectively focus on the most relevant instances, effectively minimizing the inclusion of irrelevant information. As shown in Table 4, our proposed HAB-MIL achieved the best overall performance, attaining an accuracy of 90.4%. Max-pooling demonstrated the lowest performance among all methods, with an AUC of 69.3%, which is expected as it only considers the single most prominent instance for prediction. In contrast, the mean-based approach, while accounting for all instances, introduces a significant amount of irrelevant information, leading to less satisfactory results. Moreover, although plain attention pooling improved AUC and ACC by 5.3% and 8.9% respectively at the instance level compared to traditional methods, our proposed HAB-MIL achieves superior performance in terms of accuracy, recall, precision and F1, indicating that the effectiveness of the DGA module.
Table 4. Comparison results among DGA module and different weight functions in terms of AUC, ACC, SEN, SPE.
4.2 Different combination of MRI sequences
We designed experiments to demonstrate the necessity of using different sequences simultaneously in the model. As shown in Tables 5 and 6, removing the T1w or T2w sequence from the input of the model results in a decrease in IDH prediction performance. Figure 4 shows the ROC curves of different sequences in the TCIA dataset. This indicates that the T1w and T2w sequences are crucial for the IDH prediction task, as they guide the network to focus more on the tumor region and extract features that are highly relevant for gliomas.
Table 5. Classification results in the TCIA dataset of three different sequences in terms of AUC, ACC, SEN, SPE.
Table 6. Classification results in the in-house dataset of three different sequences in terms of AUC, ACC, SEN, SPE.
4.3 Comparison with the state-of-the-art algorithms
To validate the proposed HAB-MIL algorithm for IDH statusclassification more effectively, we compared our proposed HAB-MIL algorithm with eight state-of-the-art IDH prediction methods, including Radiomics approaches (34–36), MDL (37), MultiGeneNet (38), FAD (39), SGPNet (40), PS-Net (41), MFEF Net (24), MTTU Net (42), MTS-UNET (43) and GLISP (44). The results of the different methods are shown in the Table 7. By comparing the proposed method with other IDH prediction networks, the following results can be observed.
He, Bumes, and Kandalgaonkar et al. applied radiomics-based approaches in their studies. In terms of data acquisition, Bumes utilized magnetic resonance spectroscopy (MRS), while He employed contrast-enhanced T1-weighted imaging (T1C). Although multimodal data were incorporated, insufficient feature standardization or inter-modality registration may have hindered the effective integration of information, resulting in amplified noise and reduced model performance. Moreover, conventional radiomics methods are limited to extracting shallow, handcrafted features, which may fail to capture the complex pathological heterogeneity underlying gliomas.
Firstly, GLISP model for patch-level prediction employs a lightweight CNN architecture. While this design ensures computational efficiency, it may fall short in capturing deeper and more complex pathological features compared to more advanced models. Secondly, the prediction accuracy of MultiGeneNet is higher than that of FAD. This is because MultiGeneNet is designed to simultaneously predict multiple key genetic mutations. It employs a shared feature extraction backbone with separate classification branches for each task, enabling effective feature sharing while preserving task-specific distinctions—ultimately improving overall performance. However, the model processes patch samples from whole-slide images without explicitly modeling intra-tumoral heterogeneity, which may result in the loss of important regional variations within the tumor.
Then MTS-UNET architecture lacks an explicit positional encoding mechanism, which may hinder its ability to capture spatial context—such as the relative positioning of lesions—particularly in scenarios involving high tumor heterogeneity. The poor performance of MTDL may be due to the loss of information on intra-tumoral heterogeneity and the variation in tumor size between patients. The model searches for the largest tumor bounding box and then crops the input image to a fixed large size without incorporating the tumor mask information. This approach can introduce irrelevant background information into the model, particularly when the tumor size is small.
Although the MFEF net model incorporates advanced modules such as segmentation-guided feature extraction, asymmetric amplification, and dual-attention feature fusion, the asymmetric amplification module may amplify not only pathological features, but also noise. Additionally, the feature spaces of the T2w and Flair sequences may differ significantly. Calculating these differences directly can introduce inconsistencies, affecting the stability of feature representation and prediction performance, resulting in suboptimal outcomes. MTTU Net achieved good results using a CNN transformer encoder, but it employs an uncertainty-aware pseudo-label selection method to generate and select pseudo-label. If the initial model generates a significant number of incorrect pseudo-labels, these errors may propagate and amplify with each iteration, causing the model to gradually deviate from the correct target and fall into a negative feedback loop.
Hence, the proposed HAB-MIL model is designed to address the limitations of approaches by explicitly accounting for the high heterogeneity of gliomas. It integrates positional encoding during the feature extraction phase and leverages the DGA module to emphasize the significance of key instances. As a result, it achieves superior performance in IDH status prediction compared to all other evaluated methods.
4.4 Interpretative visualization
Figure 5 illustrates how our proposed HAB-MIL model localizes discriminative regions for the prediction of IDH status. In particular, the three columns of images are shown for each subject: (1) Original image of different patients, (2) key patches identified by HAB-MIL, and (3) Grad-CAM visualizations produced by HAB-MIL. According to Figure 5, the Key patches in wild-type IDH tumors are more widely distributed, encompassing both the tumor core and surrounding regions. In contrast, key patches in mutant-type IDH tumors are predominantly confined to the tumor core or adjacent areas, indicating a more localized distribution and reduced invasiveness.
Figure 5. Interpretability of the HAB-MIL model. The first column is FLAIR slices extracted from different patients. Meanwhile, the second column is key patches from HAB-MIL, whereas the third column is their corresponding Grad-CAMs with the HAB-MIL.
The boundaries of wild-type IDH tumors appear relatively indistinct, and some tumor regions show low contrast to surrounding tissues, suggesting a tendency to invasive growth. In comparison, mutant-type IDH tumors show well-defined boundaries, concentrated lesions, and more prominent high signals in the tumor core, indicating a denser internal structure. These observations highlight significant differences in morphological characteristics, invasiveness, and growth patterns between the two types of tumors, offering valuable imaging-based insights to guide treatment decisions and prognosis assessment.
5 Discussion
Despite the significant success of deep neural networks in the field of IDH prediction, it still encounters challenges related to weak interpretability and low credibility. In clinical practice, classification of IDH status is crucial for guiding subsequent treatment strategies and predicting the prognosis of patients. Currently, the only way to determine IDH status is to obtain pathological tissue by open surgery. Consequently, if we can describe the imaging distinctions between IDH-wild and IDH-mutant and integrate these characteristics into the diagnostic model, the interpretability of the model would be enhanced, and its classification performance could be further improved. Following this strategy, a novel Hierarchical Attention-Based Multiple Instance Learning (HAB-MIL) framework is proposed. To effectively capture tumor location information, auxiliary positional encoding was employed to encode imaging features, which were then concatenated with the feature maps derived from a deep instance-level feature extractor, as illustrated in Figure 1. Tumor location, serving as a non-invasive biomarker, not only significantly enhances the accuracy of the classification model but also provides valuable insights for preoperative prediction of IDH status. Additionally, we developed a weakly supervised learning-based classification network, which substantially reduces annotation costs and better accommodates the spatial heterogeneity of gliomas. This approach facilitates the automatic identification of key regional features closely associated with IDH prediction, thereby improving the model’s accuracy, generalization performance, and clinical interpretability.
Table 2 presents the results of the ablation study on different modules. As shown in table, the model’s performance improves to a certain extent with the addition of both modules compared to using the backbone alone. Table 3 presents the ablation study results for different positional encoding methods. Testing with various positional encodings led to slight variations in the performance of the proposed algorithm, demonstrating the flexibility and effectiveness of GeLU. Table 4 compares the performance of different attention mechanisms. The dynamic attention mechanism used in this study improves the precision of feature selection through a gating mechanism and dropout regularization, while also enhancing the model’s generalization capability. This approach is particularly well-suited for applications in few-shot learning or multi-instance learning scenarios.
Tables 5 and 6 present the results of different modality combinations on the TCIA dataset and the in-house dataset, respectively. The performance on the TCIA dataset is slightly better than that on the in-house dataset, likely due to the strict quality control typically applied to public datasets prior to release. In contrast, the quality of in-house data may be affected by factors such as the acquisition environment, equipment variability, or human error. In both datasets, the combination of three modalities consistently yields better performance, which attributed to the varying sensitivities of different modalities to tissue structures, lesions, and fluids. By integrating multiple modalities, more comprehensive and informative imaging data can be obtained. Additionally, the network shows increased robustness to missing modalities when setting the T1 and T2 or FLAIR sequence to zero while training. There is only a small decrease in performance when only providing the T1w, T2w and Flair scans compared to all three MRI as input. This is especially useful as not all three MRI modalities are available for all patients in the Xi’an Jiaotong University Hospital dataset. This way an accurate prediction could still be obtained for these patients. The very high specificity and slightly lower sensitivity indicate that prediction inaccuracies are due to parts of the surrounding edema that are not detected by the network.
An in-depth analysis of Figure 4 and the attention weights learned by the model reveals a consistent anatomical distribution pattern of lDH-mutant gliomas. Specifically, gliomas located in the thalamus and cerebellum are predominantly IDH wild-type, whereas those in the insular cortex are more likely to harbor IDH mutations. This observation is consistent with previous radiological and histopathological studies, further validating the reliability of our model. Compared to existing methods, our approach offers three distinct advantages: first, HAB-MIL accurately identifies subtle imaging differences between IDH wild-type and lDH-mutant gliomas through key instance selection, helping clinicians pinpoint regions that are most informative for distinguishing IDH status. Second, the process of identifying key instances is intuitive and straightforward to implement. Finally, the clinical application of this model requires only routine MRI sequences (T1w, T2w, and FLAIR) as input, without the need for contrast agents, thereby reducing potential risks in practical use. Furthermore, the model processes the entire MRI image directly, requiring only skull-stripping and registration steps, without manual tumor delineation by radiologists. This design minimizes reliance on specialized expertise and facilitates broader clinical implementation. This study still has several limitations. First, the HAB-MIL model has not yet been validated in large-scale, multicenter clinical settings; therefore, its feasibility and stability in real-world clinical practice remain to be further confirmed. Second, some cases in this study lacked key molecular markers such as TERT promoter mutation, EGFR amplification, and chromosome 7 gain/chromosome 10 loss (+7/−10).
6 Conclusion
In this study, we proposed a HAB-MIL framework, contributing to the prediction of IDH status using only routinely acquired preoperative MRI. Compared with conventional clinical workflows that require labor-intensive, slice-by-slice tumor annotation, HAB-MIL employs a weakly supervised learning strategy that relies solely on case-level labels, eliminating the need for detailed lesion delineation. This approach significantly reduces annotation time and manual effort while preserving model accuracy. Moreover, the model requires only routine MRI sequences such as T1-weighted, T2-weighted, and FLAIR images, without the need for contrast agents, thereby minimizing potential procedural risks and reducing the overall financial burden on patients.
In addition, future work will focus on further refining and expanding this study. First, we plan to validate the HAB-MIL model in large-scale, multicenter clinical settings to evaluate its feasibility and stability in real-world clinical practice. Second, we will collect more diverse datasets encompassing cases acquired from different imaging devices and scanning protocols, as well as additional modalities such as MR perfusion and DTI, to further enhance the model’s generalization and robustness. Finally, we will conduct comparative analyses with other noninvasive prediction methods to comprehensively evaluate the model’s performance.
Data availability statement
The original contributions presented in the study are included in the article/supplementary material . Further inquiries can be directed to the corresponding author.
Ethics statement
The studies involving humans were approved by The Ethics Committee of the First Affiliated Hospital of Xi’an Jiaotong University. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements.
Author contributions
QX: Formal analysis, Software, Validation, Visualization, Writing – original draft. YHS: Software, Writing – review & editing. YL: Data curation, Writing – review & editing. YS: Data curation, Writing – review & editing. HW: Methodology, Software, Writing – review & editing. FW: Methodology, Writing – review & editing. RW: Resources, Supervision, Writing – review & editing. BC: Resources, Supervision, Writing – review & editing. MZ: Resources, Supervision, Writing – review & editing. CN: Funding acquisition, Resources, Writing – review & editing.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This study was funded by the Shaanxi Provincial Key Industrial Innovation Chain Project (No. 2024SF-ZDCYL02-10),2024 Research Project on Clinical Applications of Medical Artificial Intelligence (No.YLXX24AIA021).
Conflict of interest
The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1. Schaff LR and Mellinghoff IK. Glioblastoma and other primary brain Malignancies in adults: A review. JAMA. (2023) 329:574. doi: 10.1001/jama.2023.0023
2. Weller M, Wen PY, Chang SM, Dirven L, Lim M, Monje M, et al. Glioma. Nat Rev Dis Primers. (2024) 10:33. doi: 10.1038/s41572-024-00516-y
3. Singh S, Dey D, Barik D, Mohapatra I, Kim S, Sharma M, et al. Glioblastoma at the crossroads: current understanding and future therapeutic horizons. Sig Transduct Target Ther. (2025) 10:213. doi: 10.1038/s41392-025-02299-4
4. Carosi F, Broseghini E, Fabbri L, Corradi G, Gili R, Forte V, et al. Targeting isocitrate dehydrogenase (IDH) in solid tumors: current evidence and future perspectives. Cancers. (2024) 16:2752. doi: 10.3390/cancers16152752
5. Rudà R, Bruno F, Ius T, Silvani A, Minniti G, Pace A, et al. IDH wild-type grade 2 diffuse astrocytomas: prognostic factors and impact of treatments within molecular subgroups. Neuro-Oncology. (2022) 24:809–20. doi: 10.1093/neuonc/noab239
6. Ramos-Fresnedo A, Pullen MW, Perez-Vega C, Domingo RA, Akinduro OO, Almeida JP, et al. The survival outcomes of molecular glioblastoma IDH-wildtype: a multicenter study. J Neurooncol. (2022) 157:177–85. doi: 10.1007/s11060-022-03960-6
7. Solomou G, Finch A, Asghar A, and Bardella C. Mutant IDH in gliomas: role in cancer and treatment options. Cancers. (2023) 15:2883. doi: 10.3390/cancers15112883
8. Miller JJ, Gonzalez Castro LN, McBrayer S, Weller M, Cloughesy T, Portnow J, et al. Isocitrate dehydrogenase (IDH) mutant gliomas: A Society for Neuro-Oncology (SNO) consensus review on diagnosis, management, and future directions. Neuro-Oncology. (2023) 25:4–25. doi: 10.1093/neuonc/noac207
9. Yu D, Zhong Q, Xiao Y, Feng Z, Tang F, Feng S, et al. Combination of MRI-based prediction and CRISPR/Cas12a-based detection for IDH genotyping in glioma. NPJ Precis Onc. (2024) 8:140. doi: 10.1038/s41698-024-00632-8
10. Jeon YH, Choi KS, Lee KH, Jeong SY, Lee JY, Ham T, et al. Deep learning-based quantification of T2-FLAIR mismatch sign: extending IDH mutation prediction in adult-type diffuse lower-grade glioma. Eur Radiol. (2025) 35:5193–202. doi: 10.1007/s00330-025-11475-7
11. Bangalore Yogananda CG, Truong NCD, Wagner BC, Xi Y, Bowerman J, Reddy DD, et al. Bridging the clinical gap: Confidence informed IDH prediction in brain gliomas using MRI and deep learning. Neuro-Oncol Adv. (2025) 7:vdaf142. doi: 10.1093/noajnl/vdaf142
12. Negro A, Gemini L, Tortora M, Pace G, Iaccarino R, Marchese M, et al. VASARI 2.0: a new updated MRI VASARI lexicon to predict grading and IDH status in brain glioma. Front Oncol. (2024) 14:1449982. doi: 10.3389/fonc.2024.1449982
13. Jopek MA, Pastuszak K, Cygert S, Best MG, Wurdinger T, Jassem J, et al. Deep learning-based, multiclass approach to cancer classification on liquid biopsy data. IEEE J Transl Eng Health Med. (2024) 12:306–13. doi: 10.1109/JTEHM.2024.3360865
14. Elwenspoek MMC, Sheppard AL, McInnes MDF, Merriel SWD, Rowe EWJ, Bryant RJ, et al. Comparison of multiparametric magnetic resonance imaging and targeted biopsy with systematic biopsy alone for the diagnosis of prostate cancer: A systematic review and meta-analysis. JAMA Netw Open. (2019) 2:e198427. doi: 10.1001/jamanetworkopen.2019.8427
15. Gong Y, Liu G, Xue Y, Li R, and Meng L. A survey on dataset quality in machine learning. Inf Softw Technol. (2023) 162:107268. doi: 10.1016/j.infsof.2023.107268
16. Gadermayr M and Tschuchnig M. Multiple instance learning for digital pathology: A review of the state-of-the-art, limitations & future potential. Comput Med Imaging Graphics. (2024) 112:102337. doi: 10.1016/j.compmedimag.2024.102337
17. Li Z, Wang Y, Zhu Y, Xu J, Wei J, Xie J, et al. Modality-based attention and dual-stream multiple instance convolutional neural network for predicting microvascular invasion of hepatocellular carcinoma. Front Oncol. (2023) 13:1195110. doi: 10.3389/fonc.2023.1195110
18. Mahadevkar SV, Khemani B, Patil S, Kotecha K, Vora DR, Abraham A, et al. A review on machine learning styles in computer vision—Techniques and future directions. IEEE Access. (2022) 10:107293–329. doi: 10.1109/ACCESS.2022.3209825
19. Liang X, Li X, Li F, Jiang J, Dong Q, Wang W, et al. MedFILIP: medical fine-grained language-image pre-training. IEEE J BioMed Health Inform. (2025) 29:3587–97. doi: 10.1109/JBHI.2025.3528196
20. Shi X, Xing F, Xie Y, Zhang Z, Cui L, and Yang L. Loss-based attention for deep multiple instance learning. AAAI. (2020) 34:5742–9. doi: 10.1609/aaai.v34i04.6030
21. Konstantinov AV and Utkin LV. Multi-attention multiple instance learning. Neural Comput Applic. (2022) 34:14029–51. doi: 10.1007/s00521-022-07259-5
22. Kim D, Lee J, Jung M, Yim K, Hwang G, Yoon H, et al. Whole slide image-level classification of Malignant effusion cytology using clustering-constrained attention multiple instance learning. Lung Cancer. (2025) 204:108552. doi: 10.1016/j.lungcan.2025.108552
23. Lu MY, Williamson DFK, Chen TY, Chen RJ, Barbieri M, and Mahmood F. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat BioMed Eng. (2021) 5:555–70. doi: 10.1038/s41551-020-00682-w
24. Zhang J, Cao J, Tang F, Xie T, Feng Q, and Huang M. Multi-level feature exploration and fusion network for prediction of IDH status in gliomas from MRI. IEEE J BioMed Health Inform. (2024) 28:42–53. doi: 10.1109/JBHI.2023.3279433
25. Rani V, Kumar M, Gupta A, Sachdeva M, Mittal A, and Kumar K. Self-supervised learning for medical image analysis: a comprehensive review. Evolving Syst. (2024) 15:1607–33. doi: 10.1007/s12530-024-09581-w
26. Dominguez-Morales JP, Duran-Lopez L, Marini N, Vicente-Diaz S, Linares-Barranco A, Atzori M, et al. A systematic comparison of deep learning methods for Gleason grading and scoring. Med Image Anal. (2024) 95:103191. doi: 10.1016/j.media.2024.103191
27. Ding Y, Zhao L, Yuan L, and Wen X. (2022). Deep multi-instance learning with adaptive recurrent pooling for medical image classification, in: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA. pp. 3335–42. IEEE. doi: 10.1109/BIBM55620.2022.9995191
28. Cheplygina V, de Bruijne M, and Pluim JPW. Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Med Image Anal. (2019) 54:280–96. doi: 10.1016/j.media.2019.03.009
29. Liang X, Han L, Zhang X, Li X, Sun Y, Tong T, et al. Singular value decomposition based under-sampling pattern optimization for MRI reconstruction. Med Phys. (2025) 52:e17860. doi: 10.1002/mp.17860
30. Musigmann M, Bilgin M, Bilgin SS, Krähling H, Heindel W, and Mannil M. Completely non-invasive prediction of IDH mutation status based on preoperative native CT images. Sci Rep. (2024) 14:26763. doi: 10.1038/s41598-024-77789-6
31. Hosseini S, Hosseini E, Hajianfar G, Shiri I, Servaes S, Rosa-Neto P, et al. MRI-based radiomics combined with deep learning for distinguishing IDH-mutant WHO grade 4 astrocytomas from IDH-wild-type glioblastomas. Cancers. (2023) 15:951. doi: 10.3390/cancers15030951
32. Calabrese E, Villanueva-Meyer JE, Rudie JD, Rauschecker AM, Baid U, Bakas S, et al. The university of california san francisco preoperative diffuse glioma MRI dataset. Radiol: Artif Intell. (2022) 4:e220058. doi: 10.1148/ryai.220058
33. Bello JLG and Kim M. (2021). PLADE-net: towards pixel-level accuracy for self-supervised single-view depth estimation with neural positional encoding and distilled matting loss, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA. pp. 6847–56. IEEE. doi: 10.1109/CVPR46437.2021.00678
34. He A, Wang P, Zhu A, Liu Y, Chen J, and Liu L. Predicting IDH mutation status in low-grade gliomas based on optimal radiomic features combined with multi-sequence magnetic resonance imaging. Diagnostics. (2022) 12:2995. doi: 10.3390/diagnostics12122995
35. Bumes E, Fellner C, Fellner FA, Fleischanderl K, Häckl M, Lenz S, et al. Validation study for non-invasive prediction of IDH mutation status in patients with glioma using in vivo 1H-magnetic resonance spectroscopy and machine learning. Cancers. (2022) 14:2762. doi: 10.3390/cancers14112762
36. Kandalgaonkar P, Sahu A, Saju AC, Joshi A, Mahajan A, Thakur M, et al. Predicting IDH subtype of grade 4 astrocytoma and glioblastoma from tumor radiomic patterns extracted from multiparametric magnetic resonance images using a machine learning approach. Front Oncol. (2022) 12:879376. doi: 10.3389/fonc.2022.879376
37. Wu X, Zhang S, Zhang Z, He Z, Xu Z, Wang W, et al. Biologically interpretable multi-task deep learning pipeline predicts molecular alterations, grade, and prognosis in glioma patients. NPJ Precis Onc. (2024) 8:181. doi: 10.1038/s41698-024-00670-2
38. Liu X, Hu W, Diao S, Abera DE, Racoceanu D, and Qin W. Multi-scale feature fusion for prediction of IDH1 mutations in glioma histopathological images. Comput Methods Programs Biomed. (2024) 248:108116. doi: 10.1016/j.cmpb.2024.108116
39. Choi YS, Bae S, Chang JH, Kang S-G, Kim SH, Kim J, et al. Fully automated hybrid approach to predict the IDH mutation status of gliomas via deep learning and radiomics. Neuro-Oncology. (2021) 23:304–13. doi: 10.1093/neuonc/noaa177
40. Wang Y, Wang Y, Guo C, Zhang S, and Yang L. SGPNet: A three-dimensional multitask residual framework for segmentation and IDH genotype prediction of gliomas. Comput Intell Neurosci. (2021) 2021:5520281. doi: 10.1155/2021/5520281
41. van der Voort SR, Incekara F, Wijnenga MMJ, Kapsas G, Gahrmann R, Schouten JW, et al. Combined molecular subtyping, grading, and segmentation of glioma using multi-task deep learning. Neuro-Oncology. (2023) 25:279–89. doi: 10.1093/neuonc/noac166
42. Cheng J, Liu J, Kuang H, and Wang J. A fully automated multimodal MRI-based multi-task learning for glioma segmentation and IDH genotyping. IEEE Trans Med Imaging. (2022) 41:1520–32. doi: 10.1109/TMI.2022.3142321
43. Farahani S, Hejazi M, Di Ieva A, Fatemizadeh E, and Liu S. Towards a multimodal MRI-based foundation model for multi-level feature exploration in segmentation, molecular subtyping, and grading of glioma. (2025). doi: 10.48550/arXiv.2503.06828
Keywords: dynamic gated attention, glioma, IDH, location encoding, Multiple Instance Learning
Citation: Xie Q, Sun Y, Liang Y, Shang Y, Wang H, Wang F, Wei R, Chen B, Zhang M and Niu C (2025) Predicting isocitrate dehydrogenase status in glioma using hierarchical attention-based deep 3D multiple instance learning. Front. Oncol. 15:1665690. doi: 10.3389/fonc.2025.1665690
Received: 16 July 2025; Accepted: 03 December 2025; Revised: 02 December 2025;
Published: 18 December 2025.
Edited by:
Francesco Bruno, University and City of Health and Science Hospital, ItalyReviewed by:
Giulia Berzero, IRCCS Ospedale San Raffaele, ItalyYu Liang, Harbin University of Science and Technology, China
Copyright © 2025 Xie, Sun, Liang, Shang, Wang, Wang, Wei, Chen, Zhang and Niu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Chen Niu, bml1Y2hlbi54anR1QHhqdHUuZWR1LmNu
Qinqin Xie1,2