Predicting breast cancer treatment response and prognosis using AI-based image classification

Wang, Bingyi; Chen, Shu; Li, Wei

doi:10.3389/fonc.2025.1619994

ORIGINAL RESEARCH article

Front. Oncol., 21 October 2025

Sec. Breast Cancer

Volume 15 - 2025 | https://doi.org/10.3389/fonc.2025.1619994

This article is part of the Research TopicAI-Powered Insights: Predicting Treatment Response and Prognosis in Breast CancerView all 15 articles

Predicting breast cancer treatment response and prognosis using AI-based image classification

Bingyi Wang¹

Shu Chen²

Wei Li^3*

¹Department of Radiation Oncology,Clinical Oncology School of Fujian Medical University, Fujian Cancer Hospital, NHC Key Laboratory of Cancer Metabolism, Fuzhou, China
²Department of Gastric Surgery, Clinical Oncology School of Fujian Medical University, Fujian Cancer Hospital, NHC Key Laboratory of Cancer Metabolism, Fuzhou, China
³Medical School, Yangzhou University, Yangzhou, China

Introduction: Accurate prediction of treatment response and prognosis in breast cancer patients is critical to advance personalized medicine and optimize therapeutic decision-making. Within the context of AI-enabled healthcare, there remains a pressing need to develop robust, interpretable models that can account for the temporal complexity and heterogeneity inherent in longitudinal patient data.

Methods: This study proposes a novel framework designed to model patient-specific treatment trajectories using a dynamics-aware, deep sequence learning architecture. Aligned with the core themes of computational prognostics and precision therapy, our method addresses the challenges posed by variable patient responses, missing clinical records, and complex pharmacological interactions. Existing approaches, including conventional supervised learning and static classification models, often fall short in capturing the underlying temporal dependencies, multimodal data fusion, and counterfactual reasoning necessary for real-world clinical deployment. These limitations hinder generalizability, especially in scenarios where treatment outcomes are delayed or weakly annotated. In contrast, our approach integrates recurrent modeling, attention mechanisms, and uncertainty quantification to better capture the evolving nature of patient health trajectories. Moreover, we incorporate domain-informed regularization techniques and causal inference modules to improve interpretability and clinical relevance.

Results and Discussion: By learning temporal dynamics in a personalized manner, the proposed model enhances predictive performance while remaining sensitive to patient-specific variations and therapeutic regimens. Through extensive validation on real-world breast cancer cohorts, we demonstrate that our framework not only outperforms existing baselines but also provides actionable insights that can inform adaptive treatment planning and risk stratification.

1 Introduction

Breast cancer continues to be a primary contributor to cancer-associated illness and death among women on a global scale. Accurate prediction of treatment response and patient prognosis is essential to improving therapeutic strategies and clinical outcomes (1). Traditionally, such predictions have relied heavily on histopathological examination, molecular subtyping, and clinical staging; however, these approaches are often limited by inter-observer variability and incomplete capture of tumor heterogeneity. With the advent of digital pathology and the availability of high-resolution whole-slide images (WSIs), artificial intelligence (AI) offers a transformative opportunity (2). Not only can AI-driven image classification systems process vast amounts of image data with high consistency, but they can also uncover complex patterns that may not be perceptible to human experts. Moreover, these techniques enhance predictive accuracy by integrating morphological cues with computational precision, enabling clinicians to tailor treatments based on a more robust risk stratification (3). Therefore, developing AI-based models for image classification is not only necessary for optimizing individualized breast cancer therapy but also critical in advancing precision oncology.

Early computational strategies for analyzing histopathological images relied on predefined morphological descriptors and diagnostic protocols (4). These systems extracted interpretable characteristics—such as nucleus size, texture, and spatial arrangement—from tissue samples to support rule-based classification or grading (5). While these approaches aligned with traditional pathology workflows and offered transparency, they were limited in flexibility and struggled to capture the subtle and variable visual features present in large-scale WSIs. In particular, their performance was susceptible to staining inconsistencies, tumor heterogeneity, and variability across datasets (6).

As digital pathology advanced, researchers introduced more adaptable models capable of recognizing patterns directly from labeled examples (7). These methods employed classification algorithms trained on manually extracted features, allowing systems to differentiate tumor subtypes or predict outcomes with improved accuracy (8). Approaches such as support vector machines and ensemble classifiers demonstrated practical utility in medium-sized datasets and well-curated research cohorts. However, they still relied on handcrafted feature extraction pipelines, which imposed constraints on scalability and made it difficult to generalize findings across institutions or patient populations (9).

Recent innovations have led to end-to-end learning frameworks that automatically derive predictive representations from raw pathology images (10). Deep neural networks—particularly convolutional architectures and attention-based models—have enabled a patch-level analysis of WSIs, learning discriminative features that correspond to prognostic markers (11). These systems support the integration of contextual information and facilitate downstream tasks such as survival analysis, molecular subtype inference, and therapy response prediction (12). Despite achieving state-of-the-art performance, challenges remain in interpretability, computational demand, and the need for annotated training data. As a response, the development of explainable and resource-efficient architectures is gaining momentum, aiming to balance clinical reliability with the scalability of deep learning in pathology (13).

In clinical oncology, various biochemical parameters are routinely used for early tumor detection and monitoring. Radenkovic et al. highlighted the diagnostic significance of matrix metalloproteinases (MMP-2 and MMP-9) in basal-like breast cancer, reflecting their association with tumor invasiveness and progression (14). Another study by Radenkovic et al. emphasized the role of oxidative stress-related enzymes such as lactate dehydrogenase (LDH), catalase, and superoxide dismutase (SOD) in tumor tissues, showing that their expression levels correspond with mammographic findings and tumor characteristics (15). Jurisic et al. further discussed the clinical relevance of LDH as a tumor biomarker, summarizing its biochemical behavior and potential in oncological diagnostics (16). In addition to biochemical assessment, morphological analysis remains crucial. The study by Radenkovic et al. demonstrated that correlating mammographic images with histopathological findings in HER2-positive breast cancer provides deeper diagnostic insights, emphasizing the need for integrated diagnostic approaches (17).

While prior studies have demonstrated significant progress in applying deep learning to cancer diagnostics, several challenges remain unaddressed. Traditional symbolic systems often lack flexibility, machine learning approaches are highly feature-dependent, and deep learning models—though powerful—frequently suffer from a lack of interpretability, limiting their adoption in clinical workflows. To address these limitations, we propose a novel hybrid approach that leverages the interpretability of symbolic reasoning with the scalability of deep learning. Our method incorporates a modular AI architecture that integrates pathology-informed feature extraction with transformer-based visual encoders and an attention-guided prognosis predictor. By combining domain knowledge with data-driven inference, this system not only enhances accuracy but also enables interpretability through visual attention maps and feature attribution techniques. Our approach is designed to operate across different clinical settings and cancer subtypes, promoting generalizability and robustness. This hybrid methodology aims to bridge the gap between accuracy and trustworthiness in clinical AI applications, ultimately supporting oncologists in devising personalized treatment regimens and improving patient outcomes.

The main contributions of this work are as follows:

● We propose a novel dual-module framework that integrates symbolic feature extraction with deep visual embeddings, enabling interpretable and accurate prediction of breast cancer treatment response.

● Our method supports multiple clinical scenarios and subtypes by employing a flexible architecture that generalizes across histopathology datasets with minimal performance degradation.

● Experimental results on benchmark datasets demonstrate a significant improvement in prediction accuracy (up to 12% gain) over existing methods while maintaining interpretability through integrated attention maps.

2 Related work

2.1 Deep learning for histopathology analysis

A central research direction in predicting breast cancer treatment response using AI involves deep learning techniques applied to histopathological images (18). Histopathology, particularly hematoxylin and eosin (H&E)-stained slides, remains a gold standard in cancer diagnosis and is widely accessible. Convolutional neural networks (CNNs) have demonstrated notable performance in tasks such as tumor classification, segmentation, and grading (19). Pioneering works like that of Coudray et al. (20) on lung cancer laid the foundation for similar approaches in breast cancer (21). In this domain, deep learning models are trained on large annotated image datasets to recognize morphological features that correlate with treatment outcomes or overall prognosis. A significant body of literature has explored the application of CNNs to distinguish between different breast cancer subtypes, such as invasive ductal carcinoma versus lobular carcinoma, and to predict molecular markers HER2, ER, and PR status (22). Models such as ResNet and DenseNet have been adapted and fine-tuned to extract both low-level texture features and high-level morphological patterns. Moreover, multiple instance learning (MIL) frameworks have been employed to account for the weakly labeled nature of whole slide images, where only slide-level labels are available without pixel-level annotations (23). Another key development is the integration of patch-level analysis and whole-slide-level aggregation using attention mechanisms or transformer-based architectures. These models enable the network to focus on diagnostically relevant regions, thereby improving prediction accuracy and interpretability—for example, attention-based MIL has been shown to provide heatmaps highlighting tumor-infiltrating lymphocytes or necrotic regions, both of which are relevant to prognosis and treatment response (24). Datasets such as CAMELYON16, TCGA, and BACH provide valuable benchmarks for model training and evaluation. However, the heterogeneity of breast cancer tissue and staining protocols across institutions remains a challenge (25). Domain adaptation and self-supervised learning have been proposed to mitigate the performance drop in cross-domain applications. The literature increasingly emphasizes the need for model robustness, generalizability, and clinical interpretability, including the use of saliency maps and feature attribution methods to explain predictions.

2.2 Radiomics and multimodal integration

Radiomics, which involves extracting quantitative features from medical imaging modalities like mammography, MRI, and ultrasound, represents another prominent research direction (26). AI-driven radiomics aims to uncover imaging biomarkers that predict therapeutic response or long-term outcomes. Unlike traditional image interpretation by radiologists, radiomics involves high-throughput feature extraction, including shape, texture, and intensity statistics, which are then correlated with clinical endpoints using machine learning models (27). Recent studies have shown that radiomic features from dynamic contrast-enhanced MRI (DCE-MRI) can predict neoadjuvant chemotherapy (NAC) response with significant accuracy—for instance, early changes in tumor heterogeneity and vascularity have been linked to treatment sensitivity (28). Deep learning has further enhanced radiomics by replacing handcrafted feature engineering with learned representations from raw imaging data. Autoencoders and 3D CNNs have been utilized to capture spatial and temporal patterns in longitudinal imaging (29). The integration of radiomics with clinical, pathological, and genomic data represents a growing trend. Multimodal models leveraging tabular clinical data, histopathological images, and radiomics features have been proposed using fusion networks, often based on transformers or graph neural networks (GNNs) (30). These models aim to holistically characterize the tumor microenvironment and host response, leading to improved predictive performance over unimodal approaches (31). The challenges include the harmonization of imaging protocols across scanners and institutions, limited availability of annotated longitudinal datasets, and the interpretability of deep radiomics models (32). Federated learning has been suggested as a solution to the data privacy and sharing issues that hinder multi-institutional collaborations. Furthermore, explainability techniques are being actively developed to identify which imaging phenotypes contribute most to the predicted outcomes (33).

2.3 AI for personalized treatment planning

A critical area of research lies in the use of AI for personalizing breast cancer treatment by predicting individual responses to therapy. Traditional treatment planning relies heavily on standardized clinical guidelines, which may not capture the complex biological heterogeneity of breast cancer (34). AI systems offer a data-driven alternative, enabling precision oncology through personalized predictions based on image-derived biomarkers and patient-specific characteristics. Predictive models for treatment response focus on various therapeutic regimens, including chemotherapy, hormone therapy, and targeted therapies (35). By analyzing pre-treatment imaging and pathology data, AI can stratify patients into likely responders and non-responders (36). This allows clinicians to modify or escalate treatment strategies proactively, avoiding unnecessary toxicity and improving outcomes. Notable research efforts include the use of longitudinal imaging to model tumor evolution and response trajectories using recurrent neural networks or temporal convolutional networks (37). Moreover, prognosis prediction involves estimating survival outcomes such as disease-free survival (DFS) and overall survival (OS). AI models have been trained to predict these endpoints using features derived from imaging and pathology, often in conjunction with clinical staging and genetic information (38). Kaplan–Meier analysis and Cox proportional hazards modeling are commonly used for evaluation, while AI models often optimize metrics such as concordance index or time-dependent AUC. Another promising direction involves reinforcement learning (RL) to dynamically recommend treatment strategies (39). RL agents can be trained on retrospective datasets to learn policies that maximize long-term patient outcomes under various treatment sequences. This paradigm shift from static prediction to dynamic decision-making is still in its early stages but holds significant potential (40). Current limitations include the scarcity of prospective validation studies, the black-box nature of many AI models, and regulatory challenges in clinical deployment. There is also a growing emphasis on incorporating patient preferences and quality-of-life metrics into AI-assisted treatment planning (41). Collaborative efforts among oncologists, data scientists, and regulatory bodies are essential to translate these advances into routine clinical practice.

3 Method

3.1 Overview

In this section, we introduce our proposed framework designed to model and predict treatment response across varying biomedical and clinical contexts. The capability to accurately forecast an individual’s response to a therapeutic intervention is critical for enabling personalized medicine and optimizing treatment protocols. Our approach draws inspiration from recent advancements in sequence modeling, dynamics imitation, and representation learning, with specific tailoring to the domain of treatment outcome forecasting.

The “Method” section is organized into three key components, each addressing a specific methodological challenge. In Section 3.2, we formulate the problem of treatment response modeling as a structured prediction task within a dynamic system, where patient trajectories under treatment are viewed as stochastic processes. We provide rigorous mathematical formalization, including state space definitions, temporal dependency modeling, and symbolic abstractions of treatment-response interactions. This foundational formulation establishes a backbone for the learning problem and guides subsequent model design. In Section 3.3, we introduce our novel model, ResponseNet, which is a dynamics-aware, multi-level sequence learner tailored to capture both short-term physiological reactions and long-term outcome trends. ResponseNet incorporates heterogeneous data sources, including patient histories, treatment regimens, and clinical measurements, via a deep reparameterization approach. It is designed to imitate the progression of patient states post-treatment, drawing conceptual parallels with generative adversarial imitation learning frameworks adapted from natural video forecasting. The architectural design allows the model to retain interpretability while maintaining strong predictive power across varying temporal granularities. Section 3.4 details our adaptive knowledge infusion strategy, a principled mechanism for injecting domain knowledge into the learning process. This strategy leverages curated clinical priors, ontological constraints, and pharmacological knowledge to shape the learning trajectory of the model. Through an interaction-aware optimization scheme, the model dynamically adjusts its learning focus based on latent treatment–response signals. This approach not only regularizes learning in data-sparse regimes but also encourages biologically plausible predictions that align with expert understanding.

To improve the interpretability of the proposed architecture for readers with clinical or non-technical backgrounds, a simplified and color-coded schematic is introduced, as shown in Figure 1. This figure presents the end-to-end structure of the model in a modular layout, with functional components visually grouped and labeled. The architecture is divided into four high-level blocks: latent state inference (preliminaries), patient-specific prediction (ResponseNet), counterfactual reasoning, and adaptive knowledge infusion (AKI). Each block is represented using distinct colors to highlight its role and to reduce cognitive load when tracing data flow. The figure emphasizes key interactions between learned representations and domain knowledge modules—for example, treatment actions are semantically embedded and passed to both predictive and counterfactual decoding modules. Latent health states are updated dynamically and passed into response prediction layers and symbolic constraints, while clinical priors guide the learning process through regularizers and ontology-based constraints. This design allows for a unified understanding of how data, treatments, and expert knowledge interact within the model. By presenting the architecture in this structured and clinically-oriented format, the figure enables practitioners to interpret the role of each component without relying on formal equations. The layout supports intuitive comprehension of model behavior, particularly how symbolic reasoning, learned dynamics, and decision-time explanations come together to support interpretable prediction. This visualization serves as a bridge between algorithmic detail and practical clinical insight, facilitating interdisciplinary understanding and communication.

Figure 1

Flowchart depicting the Adaptive Knowledge Infusion (AKI) process. It includes sections: Preliminaries, ResponseNet, and Appenrintes. Arrows indicate flow between components like Input Embedding, Predictive Decoding, Semantic Treatment Embedding, and Regularizer, leading to Latensic Anchoring. Processes like Ontology-Based Consistency Learning, and Discriminative Counterfactual Training are also part of the flow.

Figure 1. Simplified architecture of the proposed framework. The model is organized into modular components: latent state inference, predictive and counterfactual decoding, semantic treatment embedding, and adaptive knowledge infusion (AKI). Color coding and directional flow highlight interactions between patient history, symbolic priors, and treatment-aware predictive modules.

3.2 Preliminaries

This work aims to model the latent treatment response trajectory of a patient undergoing therapeutic interventions, using longitudinal historical data including clinical features, physiological measurements, and treatment events. The response modeling task is framed as a partially observed Markov decision process (POMDP), which allows reasoning under uncertainty and incorporates the influence of sequential interventions over time. Let $P$ denote the patient population. For each patient $p \in P$ , the temporal sequence $T_{p} = {(x_{t}^{p}, a_{t}^{p}, y_{t}^{p})}_{t = 1}^{T}$ represents observations over time, where $x_{t}^{p}$ are covariates, $a_{t}^{p}$ are treatments, and $y_{t}^{p}$ are response outcomes. The true underlying health status is captured by a latent state $z_{t}^{p} \in Z$ , evolving stochastically through a transition kernel (Equation 1):

\begin{array}{l} p (z_{t + 1}^{p} | z_{t}^{p}, a_{t}^{p}) = T (z_{t}^{p}, a_{t}^{p}), & (1) \end{array}

and generating observable variables via an emission model (Equation 2):

\begin{array}{l} p (x_{t}^{p}, y_{t}^{p} | z_{t}^{p}) = ℰ (z_{t}^{p}) . & (2) \end{array}

The initial state is drawn from a prior distribution (Equation 3):

\begin{array}{l} z_{1}^{p} \sim p_{0} (z) = N (μ_{0}, Σ_{0}) . & (3) \end{array}

To handle partial observability, a recognition network $q_{ϕ} (z_{t}^{p} | ℋ_{t}^{p})$ is introduced to approximate the posterior over latent states from historical data $ℋ_{t}^{p} = {(x_{s}^{p}, a_{s}^{p}, y_{s}^{p})}_{s = 1}^{t}$ . The variational evidence lower bound (ELBO) is optimized jointly with respect to generative and inference parameters (Equation 4):

\begin{array}{l} \begin{array}{l} ℒ (θ, ϕ) = E_{q_{ϕ}} [\sum_{t = 1}^{T} log p (x_{t}^{p}, y_{t}^{p} | z_{t}^{p}) + log p (z_{t + 1}^{p} | z_{t}^{p}, a_{t}^{p}) \\ - log q_{ϕ} (z_{t}^{p} | ℋ_{t}^{p})] . \end{array} & (4) \end{array}

The full training objective aggregates patient trajectories and includes a regularization term (Equation 5):

\begin{array}{l} J (θ, ϕ) = \sum_{p \in P} ℒ_{p} (θ, ϕ) - λ \cdot ℛ (θ) . & (5) \end{array}

To accommodate censored or partially missing responses, a binary mask $m_{t}^{p} \in {0, 1}^{k}$ is applied to the likelihood computation (Equation 6):

\begin{array}{l} log p (y_{t}^{p} | z_{t}^{p}) = \sum_{j = 1}^{k} m_{t, j}^{p} \cdot log N (y_{t, j}^{p}; μ_{j} (z_{t}^{p}), σ_{j}^{2} (z_{t}^{p})) . & (6) \end{array}

In addition to standard predictions, the framework enables counterfactual reasoning. A prediction operator is defined to estimate future outcomes under alternative, hypothetical treatments ${\tilde{a}}_{t}$ (Equation 7):

\begin{array}{l} {\hat{y}}_{t + 1}^{p, cf} = E_{z_{t}^{p} \sim q_{ϕ}} [E_{z_{t + 1}^{p} \sim T (z_{t}^{p}, {\tilde{a}}_{t})} [ℰ_{y} (z_{t + 1}^{p})]], & (7) \end{array}

which supports “what-if” scenario simulation and assists in evaluating alternative therapy options.

This section builds a probabilistic foundation for understanding how a patient’s health status evolves over time under different treatments. Rather than using raw features alone, the model constructs a hidden state that summarizes clinical information and allows prediction of future outcomes. By using a variational framework, it can handle uncertainty and missing values. The model also supports hypothetical simulations—what would happen if a different treatment had been used—making it useful for treatment planning and clinical decision support.

3.3 ResponseNet

To operationalize the symbolic formulation and latent-state structure introduced in the previous section, we propose ResponseNet, a deep sequence modeling architecture designed to capture and forecast patient-specific treatment response through temporally-grounded latent dynamics. ResponseNet encodes nonlinear dependencies between health status trajectories and administered interventions while enabling interpretable abstractions aligned with clinical variables (as shown in Figure 2).

Figure 2

Flowchart depicting a machine learning model architecture. It includes three main sections: input embedding, intervention module, and an auxiliary embedding process. Arrows indicate data flow. Components such as predictive and counterfactual decoding, latent dynamics modeling, and semantic treatment embedding are highlighted. Steps include calibration attention, random shuffle, and a cropping layer. A circular arrow shows iteration in the auxiliary section. A small image of a cell is above the input embedding.

Figure 2. An illustration of ResponseNet. The architecture of ResponseNet comprises a multi-module framework designed for treatment-aware clinical modeling, including latent dynamics modeling, semantic treatment embedding, and predictive as well as counterfactual decoding. The pipeline begins with input embedding, followed by latent state inference through gated recurrent units, a dedicated intervention module with semantic permutation and decoding, and a global local-attention encoder. Separate decoders generate both observed and counterfactual outcomes, allowing the model to simulate personalized treatment responses under varying hypothetical scenarios. Calibration attention mechanisms and alignment regularizations ensure robustness and interpretability in clinical prediction tasks.

3.3.1 Latent dynamics modeling

At its core, ResponseNet leverages a probabilistic latent state framework to model the evolution of patient-specific health trajectories in response to administered treatments over time. The system is designed to infer compact representations that capture both short-term variability and long-range dependencies in clinical dynamics, with the latent space serving as a hidden abstraction layer that unifies heterogeneous covariates and outcome signals. Each patient’s longitudinal record up to time t is denoted as $ℋ_{t}^{p} = {(x_{s}^{p}, a_{s}^{p}, y_{s}^{p})}_{s = 1}^{t}$ , encompassing observed covariates $x_{s}^{p}$ , intervention actions $a_{s}^{p}$ , and clinical outcomes $y_{s}^{p}$ . We posit a temporally evolving latent state $z_{t}^{p}$ that encodes the internal physiological status, updated through a history-aware encoder formulated as a deep recurrent posterior distribution. The encoder employs gated recurrence to model complex temporal dependencies and amortize inference across varying-length patient histories, parameterizing a multivariate Gaussian distribution over the latent variables as (Equation 8).

\begin{array}{l} q_{ϕ} (z_{t}^{p} | ℋ_{t}^{p}) = N (μ_{t}^{p}, Σ_{t}^{p}), (μ_{t}^{p}, Σ_{t}^{p}) = {GRU}_{ϕ} (ℋ_{t}^{p}), & (8) \end{array}

where $ϕ$ represents the learnable weights of the inference network. To characterize how clinical states evolve under the influence of treatment, we define a continuous latent transition function $f_{θ}$ that maps the current latent state $z_{t}^{p}$ and an embedded treatment action $e (a_{t}^{p})$ to a predictive shift in latent dynamics, capturing the modulating effects of pharmacological interventions and potential interactions between treatment and baseline state. This function is implemented as a multilayer perceptron whose output is perturbed by Gaussian noise to reflect uncertainty in clinical progression, yielding the one-step latent update as (Equation 9).

\begin{array}{l} z_{t + 1}^{p} = f_{θ} (z_{t}^{p}, e (a_{t}^{p})) + ϵ_{t}, ϵ_{t} \sim N (0, σ^{2} I), & (9) \end{array}

where θ denotes the generative parameters of the dynamics model and σ modulates diffusion in the latent space. However, to better account for latent inertia and delayed effects of therapy, we augment this formulation by introducing a second-order difference operator into the transition rule. The model maintains coherence across adjacent latent states by integrating change-of-change signals, allowing the representation to encode temporal acceleration or deceleration in response to treatment shifts. The refined latent transition equation is expressed as (Equation 10).

\begin{array}{l} z_{t + 1}^{p} = z_{t}^{p} + γ \cdot (f_{θ} (z_{t}^{p}, e (a_{t}^{p})) - f_{θ} (z_{t - 1}^{p}, e (a_{t - 1}^{p}))), & (10) \end{array}

where γ is a learnable scalar controlling the strength of coupling across temporal windows. The embedding function $e (a_{t}^{p})$ is jointly learned to reflect both pharmacological identity and dosage, and is trained end-to-end with the rest of the model. To ensure that the latent state remains clinically meaningful and temporally smooth, we introduce a pathwise regularizer that penalizes abrupt changes in latent evolution, stabilizing trajectory estimation and improving generalization in data-sparse regimes. This constraint is defined over the Euclidean distance of successive latent states as (Equation 11).

\begin{array}{l} ℛ_{temp} = \sum_{t = 2}^{T} | | z_{t}^{p} - z_{t - 1}^{p} {| |}_{2}^{2}, & (11) \end{array}

which effectively enforces a soft continuity constraint on the temporal latent manifold. This dynamic modeling framework empowers the architecture to flexibly represent diverse disease trajectories and adaptively adjust to the evolving effects of treatments across time and patients.

3.3.2 Semantic treatment embedding

To capture the pharmacological semantics and structural relations among treatments, we introduce a symbolic embedding mechanism that disentangles class-level and treatment-specific properties through a compositional representation strategy. Each administered treatment $a_{t}^{p}$ is mapped to a dense vector through an embedding function $Ψ (a_{t}^{p})$ , which integrates hierarchical ontology-informed semantics with fine-grained pharmacological deviations. Let $α (a_{t}^{p})$ denote the symbolic class or therapeutic category of treatment $a_{t}^{p}$ , such as hormone therapy, chemotherapy, or targeted inhibitors. We define the embedding as the sum of a class-shared vector $E_{sym} (α (a_{t}^{p}))$ and a specific offset vector $E_{spec} (a_{t}^{p})$ that encodes individual deviations from the class prototype, resulting in (Equation 12).

\begin{array}{l} e (a_{t}^{p}) = Ψ (a_{t}^{p}) = E_{sym} (α (a_{t}^{p})) + E_{spec} (a_{t}^{p}), & (12) \end{array}

where $E_{sym} : V \to ℝ^{m}$ and $E_{spec} : A \to ℝ^{m}$ are learned jointly. This formulation enables parameter sharing across pharmacologically related interventions, facilitating generalization in low-resource settings while retaining the ability to model treatment-specific behavior. To reinforce semantic smoothness and coherence across related treatments, we impose a class-aware regularization objective that penalizes excessive divergence between embeddings of treatments belonging to the same category. Let $C$ be the set of all intra-class treatment pairs, and $δ$ a positive scalar margin defining acceptable divergence within a class. The symbolic regularizer takes the form (Equation 13).

\begin{array}{l} ℛ_{sym} = \sum_{(a_{i}, a_{j}) \in C} max (0, | | e (a_{i}) - e (a_{j}) {| |}_{2}^{2} - δ), & (13) \end{array}

which effectively acts as a margin-based metric learning constraint in the embedding space. Furthermore, to introduce relational inductive bias based on treatment ontologies and pharmacodynamics, we define a symbolic affinity kernel $K$ (a_i,a_j) that measures knowledge-driven similarity between treatments a_iand a_j. This kernel is derived from co-membership in anatomical therapeutic chemical (ATC) codes, empirical co-prescription statistics, or expert-defined similarity graphs. We incorporate this structure into the embedding training via an additional alignment constraint that minimizes the discrepancy between geometric distances in embedding space and knowledge-based similarities. Letting ∥e(a_i)−e(a_j)∥₂ denote Euclidean distance in the learned space, we regularize towards monotonic alignment with $K$ (a_i,a_j) as (Equation 14).

\begin{array}{l} ℛ_{align} = {\sum_{a_{i}, a_{j}} (| | e (a_{i}) - e (a_{j}) {| |}_{2}^{2} - (1 - K (a_{i}, a_{j})))}^{2}, & (14) \end{array}

where larger values of $K$ (a_i,a_j) indicate stronger pharmacological similarity. This constraint encourages embedding geometry to reflect domain knowledge and induces latent semantic clusters consistent with pharmacological theory. To further integrate symbolic structure into the temporal modeling process, we modulate internal attention weights over treatment classes via similarity-weighted aggregation. Let $z_{t}^{p}$ be the latent state at time $t$ , and define the relevance score between $z_{t}^{p}$ and class embedding $e_{c}$ for each class $c$ as an inner product followed by softmax normalization, producing a class-discriminative attention distribution (Equation 15).

\begin{array}{l} α_{t}^{c} = \frac{exp (〈 z_{t}^{p}, e_{c} 〉)}{\sum_{c^{'}} exp (〈 z_{t}^{p}, e_{c^{'}} 〉)}, & (15) \end{array}

where e_c= E_sym(c) is the class-level prototype embedding. These attention scores are used to adaptively gate treatment effects according to temporal context and semantic proximity, allowing the model to selectively prioritize therapeutically relevant actions across dynamic states. By embedding treatment actions into a knowledge-aware latent space and aligning learning dynamics with symbolic ontologies, the model improves both interpretability and generalizability, while maintaining sensitivity to fine-grained pharmacological distinctions necessary for personalized therapeutic reasoning.

3.3.3 Predictive and counterfactual decoding

The latent state $z_{t}^{p}$ serves as a compact representation of the patient’s clinical condition at time t, integrating historical covariates, treatments, and inferred disease progression (as shown in Figure 3).

Figure 3

Diagram illustrating a neural network model for image analysis. It includes an input image of breast tissue with a highlighted area. The process flow is marked by arrows, showing “Feature Flow” and “Agent Flow”. Components include matrices labeled $W_Q$, $W_K$, $W_V$, “Predictive and Counterfactual Decoding”, “Softmax Attention”, and “Agent Bias”. The model transforms input into various features and agent tokens, generating an output.

Figure 3. Illustration of the predictive and counterfactual decoding framework. The diagram demonstrates the decoding process in which patient state representations are transformed into clinical outcome predictions and auxiliary variable reconstructions. Feature flow begins with image-derived inputs, which are linearly projected and pooled to form agent tokens. These tokens pass through the predictive and counterfactual decoding module, enabling response generation. A cross-attention mechanism integrates agent features with contextual bias to inform future predictions. This framework supports not only the accurate estimation of clinical outcomes, such as tumor metrics and lab variables, but also facilitates counterfactual simulation by conditioning the decoder on alternate treatment embeddings. Temporal regularization is incorporated to ensure consistency in decoded trajectories, aiding robust and interpretable clinical decision modeling.

To reconstruct observed variables from this latent representation, we employ dedicated decoder networks for both response outcomes and auxiliary covariates. The decoder for clinical outcomes maps $z_{t}^{p}$ to a predicted response ${\hat{y}}_{t}^{p}$ using a feedforward neural transformation, where nonlinear activation ensures expressivity in modeling complex effects, and the output is parameterized as a Gaussian mean for continuous-valued medical indicators such as tumor size, biomarker levels, or composite clinical scores. Simultaneously, auxiliary covariates ${\hat{x}}_{t}^{p}$ such as lab values or patient status are decoded to support downstream reconstruction objectives and regularization of the latent structure. The decoding equations are defined as follows (Equation 16):

\begin{array}{l} \begin{array}{l} {\hat{y}}_{t}^{p} = D_{y} (z_{t}^{p}) = W_{y} \cdot ReLU (z_{t}^{p}) + b_{y}, {\hat{x}}_{t}^{p} = D_{x} (z_{t}^{p}) \\ = W_{x} \cdot tanh (z_{t}^{p}) + b_{x}, \end{array} & (16) \end{array}

where W_y,W_xare weight matrices and b_y, b_x are biases for their respective decoders. In realistic clinical scenarios, outcome observations are often noisy or uncertain due to measurement variability or delayed manifestations. To model this uncertainty explicitly, we parameterize the conditional distribution of clinical responses as a heteroscedastic Gaussian whose mean and variance are both decoded from $z_{t}^{p}$ . Letting $μ_{j} (z_{t}^{p})$ and $σ_{j}^{2} (z_{t}^{p})$ denote the decoder outputs for the $j$ -th outcome dimension, the predictive likelihood is given by (Equation 17).

\begin{array}{l} p (y_{t}^{p} | z_{t}^{p}) = \prod_{j = 1}^{k} N (y_{t, j}^{p} | μ_{j} (z_{t}^{p}), σ_{j}^{2} (z_{t}^{p})), & (17) \end{array}

where $k$ denotes the number of predicted clinical targets. Beyond reconstruction and forward prediction, a critical function of the model is its ability to simulate hypothetical outcomes under alternative treatments, enabling counterfactual reasoning for decision support. Given a hypothetical intervention ${\tilde{a}}_{t}^{p} \in A$ distinct from the one actually administered, the model estimates the prospective response had this treatment been chosen instead. This is operationalized by feeding the current latent state $z_{t}^{p}$ through the dynamics model $f_{θ}$ in conjunction with the symbolic embedding $e ({\tilde{a}}_{t}^{p})$ of the counterfactual treatment. The resulting shifted latent is then decoded using the same outcome decoder $D_{y}$ , producing a synthetic estimate of the next clinical response (Equation 18):

\begin{array}{l} {\hat{y}}_{t + 1}^{cf} = D_{y} (f_{θ} (z_{t}^{p}, e ({\tilde{a}}_{t}^{p}))), & (18) \end{array}

which enables flexible generation of alternative trajectories across the treatment space. To evaluate the model’s internal consistency and regularize unrealistic fluctuations in predicted outcomes, we further introduce a temporal smoothness regularizer that penalizes excessive changes in decoded covariates over time. This promotes physiological plausibility and ensures the learned latent dynamics induce stable transitions in observed space. Letting ${\hat{x}}_{t}^{p}$ and ${\hat{x}}_{t - 1}^{p}$ denote the reconstructed covariates at adjacent time steps, we define the temporal regularization loss as (Equation 19).

\begin{array}{l} ℛ_{smooth} = \sum_{t = 2}^{T} | | {\hat{x}}_{t}^{p} - {\hat{x}}_{t - 1}^{p} {| |}_{2}^{2}, & (19) \end{array}

which can be integrated into the global training objective. This predictive and counterfactual decoding framework enables not only accurate estimation of future responses but also generates plausible “what-if” scenarios for interventions never observed during training, supporting clinical interpretability and robust policy simulation.

ResponseNet is a modular neural network designed to predict how patients will respond to cancer treatment over time. It works by compressing patient history—such as lab values, tumor measurements, and treatments—into a hidden “health state” that updates after each new treatment. This health state helps forecast future outcomes like tumor size or biomarker levels. To make the predictions understandable, the system uses attention mechanisms to highlight which features or treatment types were most influential, and it supports “what-if” simulations for alternative treatments. The symbolic treatment embedding module connects treatments to known medical classes, improving generalization and interpretability. These design choices together enable both high predictive accuracy and practical usability for clinical research and decision-making.

3.4 Adaptive knowledge infusion

In this section, we introduce adaptive knowledge infusion (AKI), a novel learning strategy designed to enhance the clinical fidelity, stability, and generalizability of ResponseNet. While the model presented previously can capture latent dynamics and decode treatment responses effectively, the integration of structured medical knowledge remains a critical aspect for clinical plausibility. AKI injects hierarchical, domain-driven inductive biases into the training process via structured regularization, latent alignment, and counterfactual discrimination (as shown in Figure 4).

Figure 4

Diagram of a neural network structure featuring four layers. Each layer consists of “Patch Embedding” followed by “Ontology-Based Consistency Learning.” Data scales from $C_1$ to $C_4$ as dimensions reduce through layers. At the bottom, a separate flow shows processes including “Latent Space Anchoring,” “Norm,” and “Discriminative Counterfactual Training,” integrated with the main structure.

Figure 4. Illustration of adaptive knowledge infusion (AKI). The figure outlines the architectural design of AKI, highlighting its three core mechanisms: ontology-based consistency learning, latent space anchoring, and discriminative counterfactual training. The upper pipeline illustrates a multi-stage encoder integrating patch embedding and conceptually structured consistency across resolution levels. The bottom path embeds regularization modules including norm layers, counterfactual training units, and anchoring blocks that align latent representations with medical ontologies and domain priors. These modules together enforce structured semantics, enhance interpretability, and improve generalization in clinical prognostic modeling.

3.4.1 Ontology-based consistency learning

In clinical prognostic modeling, particularly in domains involving high-stakes interventions such as breast cancer treatment, data-driven models often face limitations due to incomplete supervision, delayed outcomes, and inconsistent labeling. Treatment decisions are typically informed by domain knowledge codified in clinical guidelines, pharmacological taxonomies, and expert intuition, yet most sequence models remain agnostic to these structured priors. To address this discrepancy, we integrate symbolic knowledge into model training via ontology-based regularization, grounding latent treatment dynamics in known therapeutic semantics. Let $G = (V, ℰ)$ denote a treatment ontology, where $V$ is a finite set of treatment classes and $ℰ$ represents semantic relations such as subclass-of, similarity, or therapeutic proximity. Each administered treatment $a \in A$ is mapped to a class label $α (a) \in V$ , and relationships among these classes induce constraints on their latent effects. For any two treatments $a_{i}$ and $a_{j}$ linked by a similarity edge $(a_{i}, a_{j}) \in ℰ_{sim} \subseteq ℰ$ , we enforce consistency between their induced shifts in latent state via a variance-penalized deviation term. Letting $z$ denote the pre-treatment latent state and $Δ (z, a) = f_{θ} (z, e (a)) - z$ the treatment-induced transformation, the semantic consistency loss is expressed as (Equation 20).

\begin{array}{l} ℒ_{consist} = \sum_{(a_{i}, a_{j}) \in ℰ_{sim}} E_{z} [| | Δ (z, a_{i}) - Δ (z, a_{j}) {| |}_{2}^{2}], & (20) \end{array}

which regularizes the model to yield functionally similar predictions for pharmacologically similar drugs. To extend this structure beyond isolated treatment instances and account for longitudinal impact, we define a cumulative therapeutic influence over a trajectory. Let ${a_{t}}_{t = 1}^{T}$ be the sequence of administered treatments and $z_{t - 1}^{p}$ the latent state prior to each administration. We compute the aggregated therapeutic deviation as a weighted sum of instantaneous shifts, modulated by decay weights ${w_{t}}$ that reflect diminishing influence over time (Equation 21):

\begin{array}{l} Γ_{T}^{p} = \sum_{t = 1}^{T} w_{t} \cdot Δ (z_{t - 1}^{p}, a_{t}), & (21) \end{array}

where $Γ_{T}^{p}$ encodes the net pharmacodynamic effect accumulated by time $T$ . Clinical safety and plausibility constraints, derived from empirical studies or physiological theory, often define a feasible region $C_{safe} \subset ℝ^{d}$ within which accumulated effects are considered benign or therapeutically sound. To ensure that $Γ_{T}^{p}$ lies within this corridor, we introduce a projection-based regularizer that penalizes deviation from this trusted region. Let ${Proj}_{C_{safe}} (Γ_{T}^{p})$ denote the closest point in $C_{safe}$ to $Γ_{T}^{p}$ under the Euclidean norm. The safety-aware regularization is formulated as (Equation 22)

\begin{array}{l} ℛ_{corridor} = \sum_{p} E [I (Γ_{T}^{p} \notin C_{safe}) \cdot | | Γ_{T}^{p} - {Proj}_{C_{safe}} (Γ_{T}^{p}) {| |}_{2}^{2}], & (22) \end{array}

which softly penalizes infeasible treatment progressions and steers latent trajectory evolution toward physiologically consistent patterns. In practice, the region $C_{safe}$ can be specified by convex hulls derived from real-world patient clusters, dose–response curves from pharmacokinetic studies, or clinical endpoints observed under expert-recommended regimens. To further encourage latent dynamics to respect ontology-implied continuity, we also include a directional consistency term between sequential treatment applications, enforcing smooth transitions in latent influence vectors. Denoting two successive treatments as $a_{t - 1}$ and $a_{t}$ , we define a differential alignment loss (Equation 23).

\begin{array}{l} ℛ_{drift} = \sum_{t} | | Δ (z_{t - 1}, a_{t}) - Δ (z_{t - 2}, a_{t - 1}) {| |}_{2}^{2}, & (23) \end{array}

which penalizes abrupt changes in latent directionality across time and improves trajectory stability under ontology-guided constraints. These joint mechanisms allow the model to not only learn from observed outcomes but also reason over structured symbolic relationships that govern permissible treatment behaviors, enabling more faithful generalization in complex and sparsely labeled clinical environments.

3.4.2 Latent space anchoring

To enhance the physiological interpretability and clinical plausibility of latent representations, we introduce a principled anchoring mechanism that aligns the posterior distribution over latent variables with prior distributions derived from medical knowledge. We define a prior $π (z)$ over latent states $z_{t}^{p}$ that reflects domain-informed expectations regarding disease stage progression, biomarker distributions, or population-level clustering. These priors can be constructed using empirical distributions from historical cohorts, Gaussian mixtures conditioned on clinical stages, or prototype embeddings derived from stratified patient groups. During training, we minimize the Kullback–Leibler divergence between the learned variational posterior $q_{ϕ} (z_{t}^{p} | ℋ_{t}^{p})$ and the reference prior $π (z_{t}^{p})$ for each patient and timestep, resulting in the anchoring regularizer (Equation 24).

\begin{array}{l} ℛ_{anchor} = \sum_{t = 1}^{T} KL (q_{ϕ} (z_{t}^{p} | ℋ_{t}^{p}) | | π (z_{t}^{p})), & (24) \end{array}

which constrains posterior mass to reside in regions of latent space associated with physiologically reasonable states. This promotes semantic interpretability of latent factors and mitigates drift under distributional shift. Beyond distributional anchoring, we further enhance alignment between latent structure and clinical semantics by integrating symbolic treatment class information into the model’s internal attention dynamics. Given a treatment taxonomy that clusters drugs into shared classes based on therapeutic function, we define a set $A_{cluster}$ representing all such clusters, and associate each class $c$ with a learned centroid embedding $e_{c}$ . At each timestep $t$ , the model computes attention scores between the current latent state $z_{t}^{p}$ and all class centroids, reflecting the contextual relevance of each therapeutic group to the patient’s latent status. The class-level attention is defined via a softmax-normalized inner product (Equation 25):

\begin{array}{l} α_{t}^{c} = \frac{exp (〈 z_{t}^{p}, e_{c} 〉)}{\sum_{c^{'}} \exp (〈 z_{t}^{p}, e_{c^{'}} 〉)}, & (25) \end{array}

where $α_{t}^{c}$ denotes the attention weight assigned to class c at time t, and (·,·) is the dot-product similarity. These attention scores modulate the downstream influence of treatment embeddings and enable context-aware prioritization of pharmacological pathways. To refine the interpretive resolution of this attention mechanism and facilitate hierarchical reasoning, we impose an entropy-aware regularization term that prevents overconcentration of attention and encourages exploration across class-level hypotheses. To couple latent anchoring with downstream outcome dynamics, we regularize the decoder’s output trajectory to maintain consistency with stage-specific expectations. Let $μ_{stage} (t)$ represent the expected clinical outcome at time $t$ for a given disease stage, obtained from historical data or medical literature, and let ${\hat{y}}_{t}^{p}$ denote the predicted outcome. We define a stage-informed outcome penalty as (Equation 26).

\begin{array}{l} ℛ_{stage} = \sum_{t = 1}^{T} | | {\hat{y}}_{t}^{p} - μ_{stage} (t) {| |}_{2}^{2}, & (26) \end{array}

which ensures the decoded response trajectories remain consistent with anchored latent semantics.

These mechanisms together constrain latent dynamics within clinically meaningful manifolds, dynamically link representations to pharmacological structure, and induce outcome behavior consistent with domain priors.

3.4.3 Discriminative counterfactual training

In order to improve the fidelity, realism, and clinical reliability of counterfactual outcome estimation, we introduce a discriminative adversarial mechanism that imposes implicit supervision over hypothetical predictions (as shown in Figure 5).

Figure 5

Flowchart illustrating a neural network architecture with three main components: MatMul with Softmax and Rescale transformations in purple; Add & Norm with Feed Forward and Discriminative Counterfactual Training in green; and Embedding with Encoder and Decoder, followed by De-normalization and a Projector in peach. Arrows indicate the flow of data between components.

Figure 5. Illustration of discriminative counterfactual training. This figure provides an architectural overview of the proposed counterfactual training mechanism, which integrates attention-based latent dynamics, transformer-style contextualization, and adversarial discrimination. The left module highlights the attention computation across queries, keys, and values. The central block introduces discriminative supervision applied at intermediate transformer layers, enforcing semantic alignment between factual and counterfactual flows. On the right, a sequence of normalization, encoding, decoding, and projection operations enables contrastive regularization and robust representation of latent shifts. These components together realize a stable and semantically grounded framework for learning clinically plausible hypothetical outcomes.

In real-world healthcare applications, treatment-effect estimation often requires generating unobserved responses under alternative interventions ${\tilde{a}}_{t}^{p} \neq a_{t}^{p}$ , and ensuring the plausibility of these predictions is critical for deployment in clinical decision support systems. To this end, we define a discriminator network $D_{ψ} (z_{t}, a)$ that takes as input the latent state $z_{t}$ and a treatment $a$ and outputs a scalar probability indicating whether the associated response is drawn from a factual (observed) or counterfactual (synthetic) distribution. Let ${\tilde{a}}_{t}$ denote a randomly sampled alternative intervention and let ${\hat{y}}^{cf} = D_{y} (f_{θ} (z_{t}, e ({\tilde{a}}_{t})))$ represent the counterfactual prediction. The discriminator is trained to maximize classification accuracy between real and synthetic outcomes, while the generator is trained adversarially to minimize the ability of the discriminator to detect the distinction. This min–max game is captured by the following objective (Equation 27):

\begin{array}{l} ℒ_{disc} = E_{cf} [log (1 - D_{ψ} (z_{t}, {\tilde{a}}_{t}))] + E_{real} [log D_{ψ} (z_{t}, a_{t})], & (27) \end{array}

where the expectation over real samples is taken with respect to the empirical training distribution and the counterfactual samples are generated on-the-fly through dynamic substitution. This adversarial alignment enforces semantic similarity between factual and hypothetical representations and implicitly regularizes the latent dynamics to remain consistent under both observed and imagined transitions. To stabilize optimization and propagate informative gradients back to the generator, we further incorporate the discriminator into the global learning objective alongside symbolic consistency, latent anchoring, temporal smoothness, and variational reconstruction. The composite objective optimized by the generator becomes (Equation 28):

\begin{array}{l} \begin{matrix} J_{total} = ℒ_{ELBO} - λ_{1} \cdot ℒ_{consist} - λ_{2} \cdot ℛ_{corridor} - λ_{3} \cdot ℛ_{anchor} \\ + λ_{4} \cdot ℒ_{disc} - λ_{5} \cdot ℛ_{temp}, \end{matrix} & (28) \end{array}

with hyperparameters λ_i balancing the influence of domain-guided priors and adversarial supervision. Model parameters θ and ϕ are updated by minimizing $J_{total}$ , while the discriminator parameters ψ are optimized independently to maximize its classification capacity. This leads to a dual-loop adversarial learning process formalized as (Equation 29).

\begin{array}{l} θ, ϕ \leftarrow arg min J_{total}, ψ \leftarrow arg min - ℒ_{disc}, & (29) \end{array}

where gradients are propagated alternately through the generator and discriminator networks. To further reinforce counterfactual consistency at the representation level, we introduce a contrastive regularization term over the latent shifts induced by factual and counterfactual actions. Letting $Δ_{real} = f_{θ} (z_{t}, e (a_{t})) - z_{t}$ and $Δ_{cf} = f_{θ} (z_{t}, e ({\tilde{a}}_{t})) - z_{t}$ , we define the shift-alignment penalty (Equation 30).

\begin{array}{l} ℛ_{shift} = E_{(a_{t}, {\tilde{a}}_{t})} [| | Δ_{real} - Δ_{cf} {| |}_{2}^{2}], & (30) \end{array}

which encourages the model to produce smooth and structurally coherent latent transitions even when simulating hypothetical outcomes. This constraint enhances the stability and realism of generated trajectories and helps preserve interpretability across the intervention space.

Accurately modeling treatment response in clinical settings involves handling temporal dynamics, missing data, and heterogeneous patient characteristics. To address these challenges, the proposed framework integrates prior clinical knowledge with data-driven learning to simulate how patients evolve under different treatment regimens. The core idea is to abstract a patient’s physiological condition into a latent state that evolves over time in response to medical interventions. This latent representation serves as a compact summary of the patient’s health status and allows prediction of future clinical outcomes based on past trajectories. Two key principles guide the design of the system. First, the model accounts for pharmacological structure by embedding treatments into a symbolic space informed by clinical taxonomy and prior knowledge. This enables generalization across drugs with similar mechanisms. Second, the framework supports counterfactual simulation, allowing evaluation of alternative treatment scenarios not observed during training. This feature is particularly useful for decision support and personalized planning. By combining interpretable latent dynamics with clinical priors, the system aims to achieve both predictive accuracy and semantic transparency. The design balances mathematical rigor with practical interpretability to support decision-making in oncology and other domains.

4 Experimental setup

4.1 Dataset

The BreakHis dataset (42), the CBIS-DDSM dataset (43), the INbreast dataset (44), and the TCGA-BRCA dataset (45) are four widely utilized and publicly available breast cancer imaging datasets that serve as foundational resources for computer-aided diagnosis and machine learning research in medical imaging. BreakHis (Breast Cancer Histopathological Image Classification) consists of microscopic biopsy images of breast tumors, acquired using magnification factors of ×40, ×100, ×200, and ×400. This dataset includes 7,909 images from 82 patients and is categorized into benign and malignant classes, further subdivided into different histopathological subtypes. The diversity of magnification and histological patterns makes it suitable for deep learning tasks focused on feature representation and classification of breast cancer. In contrast, the CBIS-DDSM (Curated Breast Imaging Subset of the Digital Database for Screening Mammography) provides a large collection of mammogram images with verified pathology information. This dataset is a curated and standardized subset of the original DDSM, including over 3,000 mammography studies with annotations such as bounding boxes and lesion characteristics, covering calcifications and masses. It is particularly valuable for segmentation, detection, and classification research involving full-field digital mammography. The INbreast dataset is a high-resolution full-field digital mammography dataset that contains 115 cases with a total of 410 images, where each image is annotated by medical experts with precise contours of masses and calcifications. The high quality and detailed annotations make INbreast especially suitable for fine-grained segmentation tasks and the evaluation of lesion characterization algorithms. The TCGA-BRCA dataset, part of The Cancer Genome Atlas program, combines histopathological images with genomic, clinical, and demographic data from breast cancer patients. This dataset is unique in that it enables multi-modal analysis, integrating imaging data with gene expression profiles, mutation data, and other molecular features. TCGA-BRCA includes both hematoxylin and eosin (H&E)-stained whole-slide images and a wide array of omics data, offering a rich platform for research at the intersection of computational pathology and cancer genomics. These datasets together support a broad range of applications from basic tumor detection to advanced integrative analyses aimed at personalized medicine and precision oncology, and their complementary nature allows for comprehensive modeling of breast cancer from image-level features to molecular signatures.

4.2 Experimental details

In our experiments, we adopt a standard training and evaluation pipeline to ensure fair comparison across all datasets. For all tasks, we utilize a ResNet-50 backbone and a Vision Transformer (ViT-B/16) as representative architectures for convolutional and transformer-based models, respectively. The networks are initialized with BreakHis-pretrained weights to accelerate convergence and enhance generalization. For optimization, we use stochastic gradient descent (SGD) with a momentum of 0.9 and weight decay of 1 × 10⁻⁴. The initial learning rate is set to 0.01 and follows a cosine annealing schedule without restarts. The batch size is fixed at 128 for all datasets, and training is conducted for 100 epochs on each dataset. For datasets with fewer samples such as INbreast and TCGA-BRCA, we employ data augmentation techniques including random cropping, horizontal flipping, and color jittering to reduce overfitting and improve robustness. For CBIS-DDSM, the standard split of 60 training images per class is adopted, and the rest are used for evaluation. For INbreast, we follow the official split protocol with 1,020 training, 1,020 validation, and 6,149 test images. For the TCGA-BRCA dataset, we randomly divide the dataset into 60% training, 20% validation, and 20% testing while ensuring that each attribute label is uniformly distributed across the splits. The BreakHis dataset follows the standard ILSVRC-2012 training and validation splits, where the model is trained on the 1.2 million training images and evaluated on the 50,000 validation images. To stabilize training on small datasets, we employ label smoothing with a factor of 0.1 and dropout with a rate of 0.5 in the fully connected layers. For ViT-based models, we use a fixed patch size of 16 and positional embeddings are retained throughout training. The transformer model is optimized using the AdamW optimizer with a learning rate of 3 × 10⁻⁴ and a linear warm-up phase of 10 epochs followed by cosine decay. All experiments are conducted on four NVIDIA A100 GPUs with 40 GB of memory each, using PyTorch 2.1 and CUDA 12.2. Mixed precision training is applied to accelerate computation without loss in accuracy. We report the top-one classification accuracy as the primary evaluation metric. To ensure reproducibility, we fix random seeds for NumPy and PyTorch and log all hyperparameters, loss curves, and model checkpoints using the weights and biases framework. Hyperparameter tuning is done via grid search on the validation set, where learning rates, dropout rates, and augmentation strength are systematically explored. We also evaluate the robustness of each model to common corruptions using the BreakHis-C benchmark in extended experiments. This setup ensures that our experimental results are rigorous, reproducible, and comparable to recent state-of-the-art benchmarks.

4.3 Comparison with SOTA methods

We perform a comprehensive comparison between our proposed method ResponseNet and several state-of-the-art (SOTA) baselines across four benchmark datasets: BreakHis, CBIS-DDSM, INbreast, and TCGA-BRCA. In Tables 1, 2, ResponseNet consistently outperforms all other models across all metrics and datasets. On the large-scale BreakHis dataset, ResponseNet achieves an accuracy of 81.87%, surpassing the next best method, EfficientNet-B4, by a margin of 2.45%. Similar gains are observed for precision and F1 score, demonstrating ResponseNet’s ability to balance true positive recognition with low false positive rates. The AUC score also shows a significant improvement, indicating enhanced discriminative capability under varying decision thresholds. On CBIS-DDSM, ResponseNet achieves 88.31% accuracy, notably outperforming RegNetY-16GF and ViT-B/16, which achieved 86.02% and 85.39%, respectively. These improvements are attributed to ResponseNet’s hybrid architecture, which effectively captures both local and global features, leveraging multi-scale representations to handle object variability and background complexity. For fine-grained datasets such as INbreast, ResponseNet yields a substantial accuracy of 94.89%, outperforming ConvNeXt-T by 2.88%. Notably, the model also achieves the highest precision and F1 scores among all methods, illustrating its robustness in distinguishing classes with subtle inter-class variations. These gains can be attributed to ResponseNet’s class-aware attention mechanism, which enhances feature representation for visually similar categories. In terms of AUC, ResponseNet achieves 96.21%, reflecting its superior capability in confident classification. Similarly, on the TCGA-BRCA Dataset, ResponseNet obtains a top accuracy of 77.92%, improving upon RegNetY-16GF by 3.16%. The precision and F1 scores of ResponseNet are also significantly higher than those of conventional CNNs and vision transformers, affirming ResponseNet’s capability in modeling abstract and perceptual-level texture attributes. The enhanced performance on TCGA-BRCA stems from ResponseNet’s hierarchical decomposition module, which decomposes texture patterns into interpretable units, leading to more robust and generalizable learning. This aligns with the nature of TCGA-BRCA where semantic texture attributes are subtle and often rely on mid-level visual cues. The superior AUC scores across all datasets further validate the generalization of ResponseNet, particularly in challenging classification scenarios with imbalanced or noisy data.

Table 1

Table 1. Performance benchmarking of our approach against leading techniques on BreakHis and CBIS-DDSM datasets.

Table 2

Table 2. Performance benchmarking of our approach against leading techniques on INbreast and TCGA-BRCA datasets.

The consistent improvements of ResponseNet across all datasets can be explained by the following architectural advantages. ResponseNet integrates both convolutional and attention-based modules to leverage the locality and long-range dependencies effectively. This synergy allows the model to retain fine-grained details while also attending to holistic context. Then, ResponseNet introduces a category-guided memory unit, which stores representative features and enhances the attention weights during inference, effectively functioning as an external knowledge bank. This module is especially helpful in fine-grained and texture-based classification tasks like Oxford 102 and TCGA-BRCA, where intra-class variance is low but inter-class boundaries are subtle. The progressive decoding strategy adopted in ResponseNet stabilizes training and improves gradient flow, making the model more robust to architectural depth and hyperparameter variations. Unlike standard residual or transformer blocks that rely heavily on depth, ResponseNet’s progressive nature allows for smoother representation fusion. The training pipeline, including tailored data augmentations and loss function design, contributes to ResponseNet’s ability to generalize across domains. While traditional models rely heavily on large-scale pretraining, ResponseNet benefits from its internal regularization, leading to better adaptation on smaller datasets such as CBIS-DDSM and TCGA-BRCA. ResponseNet achieves better separation among classes and significantly fewer misclassifications. In summary, ResponseNet delivers comprehensive improvements across metrics and datasets, validating the effectiveness of our design and its capability to set a new benchmark for visual recognition tasks.

4.4 Ablation study

To validate the effectiveness of each key component in our proposed ResponseNet architecture, we conduct a series of ablation studies on four datasets: BreakHis, CBIS-DDSM, INbreast, and TCGA-BRCA. The ablation settings include three variants: without latent dynamics modeling, which removes the category-guided memory module; without semantic treatment embedding, which disables the hierarchical feature fusion; and without latent space anchoring, which eliminates the progressive decoding module. The results are shown in Tables 3, 4. Across all datasets and metrics, we observe a consistent degradation in performance when any of these modules are removed, confirming that each component contributes meaningfully to the overall model efficacy. On BreakHis, removing the latent dynamics modeling module causes the most noticeable drop in accuracy and AUC, highlighting the importance of category-aware context storage in handling large-scale and diverse data. Meanwhile, removing semantic treatment embedding results in weaker precision and F1 score, suggesting that spatial-scale integration is crucial for maintaining class separability. The latent space anchoring module also plays a key role by stabilizing feature evolution, as its removal leads to lower consistency in predictions. A comparable pattern is found in the CBIS-DDSM dataset, where excluding latent dynamics modeling results in a reduction of accuracy from 88.31% to 86.50%, accompanied by a decline in AUC from 89.42% to 87.23%. This again confirms that without the memory component, the model struggles to preserve discriminative features, especially in categories with subtle appearance differences. The removal of the semantic treatment embedding (without semantic treatment embedding) reduces the model’s ability to maintain spatial context, slightly decreasing performance but still retaining a relatively high margin, which implies that while this module is beneficial, it is partially complemented by the memory-guided features. The impact of removing the latent space anchoring structure is more prominent in precision and F1 score, emphasizing the role of this module in harmonizing learned features through the model layers.

Table 3

Table 3. Performance benchmarking of our approach against leading techniques on our model across BreakHis and CBIS-DDSM datasets.

Table 4

Table 4. Performance benchmarking of our approach against leading techniques on our model across INbreast and TCGA-BRCA datasets.

For fine-grained datasets such as INbreast and TCGA-BRCA, the effect of each module becomes even more pronounced. On Oxford 102, removal of the latent dynamics modeling module drops the accuracy by 3.27%, demonstrating how critical this component is for capturing subtle inter-class differences inherent in flower categories. Similarly, the semantic treatment embedding plays a pivotal role by improving the global-local balance in floral structures, while the latent space anchoring strategy enhances robustness against pose and color variation. On the TCGA-BRCA dataset, which requires recognition of abstract texture patterns, each module provides clear benefits. The latent dynamics modeling module provides a pseudo-semantic backbone that boosts precision and AUC, while semantic treatment embedding supports local pattern decoding, and latent space anchoring enables gradual abstraction—essential for perceptual-level recognition. In conclusion, the full ResponseNet model exhibits a holistic improvement over all ablations, and the clear performance drops across all variants underline the necessity of each core module. These results demonstrate that our architectural components are not only additive but also interact synergistically, enabling the model to generalize well across diverse and complex datasets.

To assess generalizability in practical clinical contexts, two real-world oncology datasets were incorporated for extended evaluation. The METABRIC dataset provides gene expression and clinical data for 1980 breast cancer patients, while the CAMELYON16 dataset contains high-resolution histopathology slides for tumor metastasis detection in lymph nodes. ResponseNet was adapted to process structured data in METABRIC and image tiles in CAMELYON16, with model variants incorporating lightweight encoders and symbolic treatment mappings. In both cases, predictive accuracy and interpretability were compared against standard multimodal baselines, including early fusion (feature concatenation), late fusion (modality-specific encoders with shared attention), and gradient-boosted decision trees with imputed features. Table 5 summarizes the results. The results show that ResponseNet outperforms baseline methods across both datasets in AUROC and F1-score, while uniquely offering interpretability through attention maps and symbolic reasoning modules. Its design enables integration of heterogeneous data types and maintains stability under modality dropout, which was tested by randomly masking clinical or genomic inputs during validation. Less than 5% performance degradation was observed at 20% masking, confirming robustness under incomplete observation—a common scenario in oncology practice.

Table 5

Table 5. Comparison of predictive performance and interpretability on two real-world multimodal oncology datasets.

To provide a concrete demonstration of interpretability in a clinical context, a simulated case study is presented based on a breast cancer patient undergoing neoadjuvant chemotherapy. The model predicts response to standard HER2-targeted therapy and simulates a counterfactual scenario under combination therapy. As shown in Figure 6, the left panel presents a histological attention map from the original slide, along with a predicted probability of response (0.82) and its evolution over time. The right panel illustrates the counterfactual simulation, in which the model estimates a higher disease-free survival probability (0.75) under combination therapy compared to 0.65 under the standard regimen. Additionally, attention-based interpretability highlights tumor regions most relevant to the model’s prediction. These outputs demonstrate how model-driven counterfactual reasoning and spatial attention can support clinicians in exploring multiple treatment options and understanding underlying factors influencing predictions. Such visual and quantitative aids can be integrated into multidisciplinary workflows to enhance transparency and trust in AI-assisted decision-making.

Figure 6

Case Study 1 displays a heatmap for neoadjuvant chemotherapy response prediction with a probability of 0.82. A line graph shows response probability increasing over time with breast radiotherapy. The Counterfactual scenario shows a similar heatmap with an attention map and disease-free survival probabilities of 0.65 for standard therapy and 0.75 for combination therapy. A line graph indicates improved disease-free probability with combination therapy compared to standard HER2-targeted therapy over 12 months.

Figure 6. Simulated decision support scenario for a breast cancer patient. Left: attention map and predicted response probability under factual treatment. Right: counterfactual simulation comparing disease-free survival probabilities under different therapies, with spatial attribution and projected trends.

To enhance interpretability in clinically actionable formats, the model’s outputs are further contextualized using visualization strategies tailored for medical professionals. Attention mechanisms are rendered not as standalone saliency maps, but as spatial overlays directly superimposed on histopathological images. These overlays highlight morphologically relevant tumor regions that contribute most significantly to model predictions, making them accessible to pathologists and oncologists accustomed to traditional slide examination. By preserving spatial continuity with native tissue structures, this form of visualization facilitates more intuitive interpretation than abstract heatmaps. Temporal interpretability is achieved through stratified response curves that simulate predicted outcomes over time under varying therapeutic scenarios—for example, in the presented case study, the model generates survival-like trajectories under both standard HER2-targeted therapy and an alternative combination regimen. These trajectory curves not only illustrate predicted differences in disease-free progression but also resemble conventional survival plots used in clinical oncology. This enables clinicians to visually compare risk profiles across treatment paths, supporting informed discussions about therapeutic trade-offs. These interpretability enhancements together shift the focus from model-centric explanation to clinician-facing insight. By embedding attention and prediction in domain-familiar representations—namely, slide overlays and longitudinal outcome charts—the framework enables practical decision support in oncology settings, bridging technical AI outputs with real-world clinical understanding.

The experimental evaluation focuses on two main aspects: the predictive performance of the model across multiple clinical datasets and its ability to provide interpretable insights into treatment outcomes. Predictive accuracy is measured by comparing forecasted clinical responses—such as tumor progression or biomarker levels—against ground truth values. Interpretability is assessed by examining visualizations such as attention maps, which highlight influential features or treatment time points that drive model predictions. The framework also supports counterfactual reasoning, enabling simulation of hypothetical outcomes under unobserved treatment scenarios. This capability is particularly relevant for exploring alternative therapeutic strategies and assessing individualized treatment effects. Results are reported on several benchmark datasets and compared against existing baseline models. The method demonstrates superior predictive performance while maintaining interpretability. Attention-based visual outputs and counterfactual predictions provide meaningful explanations, which may support informed decision-making in real-world clinical contexts.

5 Conclusions and future work

In this study, we aimed to address a pivotal challenge in precision oncology: predicting breast cancer treatment response and long-term prognosis using AI. Traditional models often fail to handle the temporal complexity and multimodal nature of clinical data. To overcome this, we proposed an innovative, dynamics-aware deep learning framework centered around a novel architecture, ResponseNet. This model captures both short- and long-term patient response dynamics through multi-level sequence encoding and latent stochastic inference. Complementing this, we introduced two key components: a symbolic treatment abstraction mechanism to ensure pharmacological consistency and an adaptive knowledge infusion (AKI) strategy to integrate clinical expertise via ontologies and treatment guidelines. Experiments conducted on real-world breast cancer datasets confirmed our model’s superiority over existing baselines in predicting treatment outcomes and stratifying survival risks. Notably, our approach balances predictive power with clinical interpretability—an essential criterion for deployment in healthcare settings.

Despite promising results, two main limitations remain. A model’s performance could be influenced by the quality and completeness of clinical data, especially in institutions with less structured electronic health records. Addressing this will require incorporating advanced imputation or semi-supervised techniques to better manage missing values. While AKI allows integration of domain knowledge, its current implementation may underutilize evolving, real-time clinical evidence and patient-specific nuance. Future work should explore dynamic knowledge graphs and continual learning mechanisms to enhance adaptability and relevance in fast-changing clinical environments. Overall, our study lays a foundation for intelligent, interpretable systems that support clinicians in personalizing breast cancer care.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

Author contributions

BW: Writing – original draft, Writing – review & editing, Methodology, Supervision, Project administration, Validation, Resources, Visualization. SC: Writing – original draft, Writing – review & editing, Data curation, Conceptualization, Funding acquisition, Software. WL: Writing – original draft, Writing – review & editing.

Funding

The author declares financial support was received for the research and/or publication of this article. Joint funds for the Innovation of Science and Technology, Fujian Province (2024Y9636); National Clinical Key Specialty Construction Program, 2021; Natural Science Foundation of Fujian Province (Grant numbers:2025J01121548).

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Hong D, Gao L, Yao J, Zhang B, Plaza A, and Chanussot J. Graph convolutional networks for hyperspectral image classification. IEEE Trans Geosci Remote Sens. (2020) 59:5966–78. doi: 10.1109/TGRS.2020.3015157

Crossref Full Text | Google Scholar

2. Yang J, Shi R, Wei D, Liu Z, Zhao L, Ke B, et al. Medmnist v2 - a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Sci Data. (2021) 2.

PubMed Abstract | Google Scholar

3. Sun L, Zhao G, Zheng Y, and Wu Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans Geosci Remote Sens. (2022) 60. doi: 10.1109/TGRS.2022.3144158

Crossref Full Text | Google Scholar

4. Chen C-F, Fan Q, and Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification. IEEE Int Conf Comput Vision. (2021) 357–66. doi: 10.1109/ICCV48922.2021.00041

Crossref Full Text | Google Scholar

5. Maurício J, Domingues I, and Bernardino J. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Appl Sci. (2023) 13:5521. doi: 10.3390/app13095521

Crossref Full Text | Google Scholar

6. Rao Y, Zhao W, Zhu Z, Lu J, and Zhou J. Global filter networks for image classification. Neural Inf Process Syst. (2021).

Google Scholar

7. Hong D, Han Z, Yao J, Gao L, Zhang B, Plaza A, et al. Spectralformer: Rethinking hyperspectral image classification with transformers. IEEE Trans Geosci Remote Sens. (2021) 60. doi: 10.1109/TGRS.2021.3130716

Crossref Full Text | Google Scholar

8. Touvron H, Bojanowski P, Caron M, Cord M, El-Nouby A, Grave E, et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Trans Pattern Anal Mach Intell. (2021) 45:5314–21. doi: 10.1109/TPAMI.2022.3206148

PubMed Abstract | Crossref Full Text | Google Scholar

9. Mai Z, Li R, Jeong J, Quispe D, Kim HJ, and Sanner S. Online continual learning in image classification: An empirical survey. Neurocomputing. (2021) 469:28:51. doi: 10.1016/j.neucom.2021.10.021

Crossref Full Text | Google Scholar

10. Wang X, Yang S, Zhang J, Wang M, Zhang J, Yang W, et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med Image Anal. (2022) 81:102559. doi: 10.1016/j.media.2022.102559

PubMed Abstract | Crossref Full Text | Google Scholar

11. Tian Y, Wang Y, Krishnan D, Tenenbaum J, and Isola P. Rethinking few-shot image classification: a good embedding is all you need? Eur Conf Comput Vision. (2020) 12359:266–82.

Google Scholar

12. Azizi S, Mustafa B, Ryan F, Beaver Z, Freyberg J, Deaton J, et al. Big self-supervised models advance medical image classification. IEEE Int Conf Comput Vision. (2021) 3478–88. doi: 10.1109/ICCV48922.2021.00346

Crossref Full Text | Google Scholar

13. Li B, Li Y, and Eliceiri K. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. Comput Vision Pattern Recognition. (2020) 14318–28. doi: 10.1109/CVPR46437.2021.01409

PubMed Abstract | Crossref Full Text | Google Scholar

14. Radenkovic S, Konjevic G, Jurisic V, Karadzic K, Nikitovic M, and Gopcevic K. Values of mmp-2 and mmp-9 in tumor tissue of basal-like breast cancer patients. Cell Biochem biophysics. (2014) 68:143–52. doi: 10.1007/s12013-013-9701-x

PubMed Abstract | Crossref Full Text | Google Scholar

15. Radenkovic S, Milosevic Z, Konjevic G, Karadzic K, Rovcanin B, Buta M, et al. Lactate dehydrogenase, catalase, and superoxide dismutase in tumor tissue of breast cancer patients in respect to mammographic findings. Cell Biochem biophysics. (2013) 66:287–95. doi: 10.1007/s12013-012-9482-7

PubMed Abstract | Crossref Full Text | Google Scholar

16. Jurisic V, Radenkovic S, and Konjevic G. The actual role of ldh as tumor marker, biochemical and clinical aspects. Adv Cancer biomarkers: Biochem to clinic Crit revision. (2015) 115–124.

PubMed Abstract | Google Scholar

17. Radenkovic S, Konjevic G, Isakovic A, Stevanovic P, Gopcevic K, and Jurisic V. Her2-positive breast cancer patients: correlation between mammographic and pathological findings. Radiat Prot dosimetry. (2014) 162:125–8. doi: 10.1093/rpd/ncu243

PubMed Abstract | Crossref Full Text | Google Scholar

18. Wang Q, Zhao L, Sun J, and Xu R. Emerging molecular mechanisms of resistance to targeted therapy in lung cancer. Front Oncol. (2025) 15:1540195. doi: 10.3389/fonc.2025.1540195

PubMed Abstract | Crossref Full Text | Google Scholar

19. Zhou L, Tang M, Huang Y, and Liu Q. Immunometabolic remodeling in colorectal cancer: Progress and therapeutic perspectives. Front Oncol. (2025) 15:1555369. doi: 10.3389/fonc.2025.1555369

PubMed Abstract | Crossref Full Text | Google Scholar

20. Coudray N, Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Fenyö D, et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat. Med. (2018) 24:1559–67.

PubMed Abstract | Google Scholar

21. Bhojanapalli S, Chakrabarti A, Glasner D, Li D, Unterthiner T, and Veit A. Understanding robustness of transformers for image classification. IEEE Int Conf Comput Vision. (2021) 10231–41. doi: 10.1109/ICCV48922.2021.01007

Crossref Full Text | Google Scholar

22. Kim HE, Cosa-Linan A, Santhanam N, Jannesari M, Maros M, and Ganslandt T. Transfer learning for medical image classification: a literature review. BMC Med Imaging. (2022) 22.

Google Scholar

23. Zhang C, Cai Y, Lin G, and Shen C. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. Comput Vision Pattern Recognition. (2020) 12203–13. doi: 10.1109/CVPR42600.2020

Crossref Full Text | Google Scholar

24. Roy SK, Deria A, Hong D, Rasti B, Plaza A, and Chanussot J. Multimodal fusion transformer for remote sensing image classification. IEEE Trans Geosci Remote Sens. (2022) 61. doi: 10.1109/TGRS.2023.3286826

Crossref Full Text | Google Scholar

25. Zhu Y, Zhuang F, Wang J, Ke G, Chen J, Bian J, et al. Deep subdomain adaptation network for image classification. In: IEEE Transactions on Neural Networks and Learning Systems (2020) 32:1713–22. doi: 10.1109/TNNLS.2020.2988928

PubMed Abstract | Crossref Full Text | Google Scholar

26. Shehata M, Abouelkheir RT, Gayhart M, Van Bogaert E, Abou El-Ghar M, Dwyer AC, et al. Role of ai and radiomic markers in early diagnosis of renal cancer and clinical outcome prediction: a brief review. Cancers. (2023) 15:2835. doi: 10.3390/cancers15102835

PubMed Abstract | Crossref Full Text | Google Scholar

27. Li Y, Zhang W, Chen H, and Liu M. Hepatocellular carcinoma: Novel insights into tumor microenvironment and therapeutic targets. Front Oncol. (2025) 15: doi: 10.3389/fonc.2025

Crossref Full Text | Google Scholar

28. Chen L, Li S, Bai Q, Yang J, Jiang S, and Miao Y. Review of image classification algorithms based on convolutional neural networks. Remote Sens. (2021) 13:4712. doi: 10.3390/rs13224712

Crossref Full Text | Google Scholar

29. Ashtiani F, Geers AJ, and Aflatouni F. An on-chip photonic deep neural network for image classification. Nature. (2021) 606:501–6. doi: 10.1038/s41586-022-04714-0

PubMed Abstract | Crossref Full Text | Google Scholar

30. Masana M, Liu X, Twardowski B, Menta M, Bagdanov AD, and van de Weijer J. Class-incremental learning: Survey and performance evaluation on image classification. IEEE Trans Pattern Anal Mach Intell. (2020) 45:5513–33. doi: 10.1109/TPAMI.2022.3213473.

PubMed Abstract | Crossref Full Text | Google Scholar

31. Dai Y and Gao Y. Transmed: Transformers advance multi-modal medical image classification. Diagnostics. (2021) 11:1384. doi: 10.3390/diagnostics11081384

PubMed Abstract | Crossref Full Text | Google Scholar

32. Sheykhmousa M, Mahdianpari M, Ghanbari H, Mohammadimanesh F, Ghamisi P, and Homayouni S. Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review. IEEE J Selected Topics Appl Earth Observations Remote Sens. (2020) 13:6308–25. doi: 10.1109/JSTARS.2020.3026724

Crossref Full Text | Google Scholar

33. Mascarenhas S and l Agarwal M. A comparison between vgg16, vgg19 and resnet50 architecture frameworks for image classification, in: 2021 International Conference on Disruptive Technologies for Multi-Disciplinary Research and Applications (CENTCON): IEEE (2021). doi: 10.1109/CENTCON52345.2021.9687944,.

Crossref Full Text | Google Scholar

34. Shehata M and Elhosseini M. Charting new frontiers: Insights and future directions in ml and dl for image processing. (Switzerland: MPDI) (2024) 13:1345. doi: 10.3390/electronics13071345

Crossref Full Text | Google Scholar

35. Shehata M, Abouelkheir RT, and Elhosseini M. New diagnostic perspectives in urogenital radiology. Sec. Nephrology (2023) 10. doi: 10.3389/fmed.2023.1280300

PubMed Abstract | Crossref Full Text | Google Scholar

36. Zhang Y, Li W, Sun W, Tao R, and Du Q. Single-source domain expansion network for cross-scene hyperspectral image classification. IEEE Trans Image Process. (2022) 32:1498–1512. doi: 10.1109/TIP.2023.3243853

PubMed Abstract | Crossref Full Text | Google Scholar

37. Taori R, Dave A, Shankar V, Carlini N, Recht B, and Schmidt L. Measuring robustness to natural distribution shifts in image classification. Neural Inf Process Syst. (2020).

Google Scholar

38. Dong H, Zhang L, and Zou B. Exploring vision transformers for polarimetric sar image classification. IEEE Trans Geosci Remote Sens. (2022) 30. doi: 10.1109/TGRS.2021.3137383

Crossref Full Text | Google Scholar

39. Peng J, Huang Y, SUN W, Chen N, Ning Y, and Du Q. Domain adaptation in remote sensing image classification: A survey. IEEE J Selected Topics Appl Earth Observations Remote Sens. (2022) 15:9842–59. doi: 10.1109/JSTARS.2022.3220875

Crossref Full Text | Google Scholar

40. Bazi Y, Bashmal L, Rahhal MMA, Dayil RA, and Ajlan NA. Vision transformers for remote sensing image classification. Remote Sens. (2021) 13:516. doi: 10.3390/rs13030516

Crossref Full Text | Google Scholar

41. Zheng X, Sun H, Lu X, and Xie W. Rotation-invariant attention network for hyperspectral image classification. IEEE Trans Image Process. (2022) 31:4251–65. doi: 10.1109/TIP.2022.3177322

PubMed Abstract | Crossref Full Text | Google Scholar

42. Seo H, Brand L, Barco LS, and Wang H. Scaling multi-instance support vector machine to breast cancer detection on the breakhis dataset. Bioinformatics. (2022) 38:i92–i100. doi: 10.1093/bioinformatics/btac267

PubMed Abstract | Crossref Full Text | Google Scholar

43. Lee RS, Dunnmon JA, He A, Tang S, Re C, and Rubin DL. Comparison of segmentation-free and segmentation-dependent computer-aided diagnosis of breast masses on a public mammography dataset. J Biomed Inf. (2021) 113:103656. doi: 10.1016/j.jbi.2020.103656

PubMed Abstract | Crossref Full Text | Google Scholar

44. Rezaei Z. A review on image-based approaches for breast cancer detection, segmentation, and classification. Expert Syst Appl. (2021) 182:115204. doi: 10.1016/j.eswa.2021.115204

Crossref Full Text | Google Scholar

45. Villareal RJT and Abu PAR. Patch-based convolutional neural networks for tcga-brca breast cancer classification. In: Advances in visual computing: 16th international symposium, ISVC 2021, virtual event, october 4-6, 2021, proceedings, part II. Cham: Springer (2021). p. 29–40.

Google Scholar

46. Elpeltagy M and Sallam H. Automatic prediction of covid- 19 from chest images using modified resnet50. Multimedia Tools Appl. (2021) 80:26451–63. doi: 10.1007/s11042-021-10783-6

PubMed Abstract | Crossref Full Text | Google Scholar

47. Hong S, Wu J, and Zhu L. A brain tumor classification algorithm based on vit-b/16. In: 2024 36th chinese control and decision conference (CCDC). China: IEEE (2024). p. 3154–9.

Google Scholar

48. Preetha R, Priyadarsini MJP, and Nisha J. Automated brain tumor detection from magnetic resonance images using fine-tuned efficientnet-b4 convolutional neural network. IEEE Access. (2024) 12:112181–95. doi: 10.1109/ACCESS.2024.3442979

Crossref Full Text | Google Scholar

49. Yu W, Zhou P, Yan S, and Wang X. Inceptionnext: When inception meets convnext. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (2024) 5672–83.

Google Scholar

50. Mohandass G, Krishnan GH, Selvaraj D, and Sridhathan C. Lung cancer classification using optimized attention-based convolutional neural network with densenet-201 transfer learning model on ct image. Biomed Signal Process Control. (2024) 95:106330. doi: 10.1016/j.bspc.2024.106330

Crossref Full Text | Google Scholar

51. Pandey S, Sindhuja B, Nagamanjularani C, and Nagarajan S. Exploring transfer learning techniques for flower recognition using cnn. In: Data science and security: proceedings of IDSCS 2022. Springer (2022). p. 393–401.

Google Scholar

Keywords: breast cancer prognosis, treatment response prediction, latent dynamics modeling, symbolic knowledge infusion, AI in clinical decision support

Citation: Wang B, Chen S and Li W (2025) Predicting breast cancer treatment response and prognosis using AI-based image classification. Front. Oncol. 15:1619994. doi: 10.3389/fonc.2025.1619994

Received: 29 April 2025; Accepted: 19 September 2025;
Published: 21 October 2025.

Edited by:

Mohamed Shehata, Midway College, United States

Reviewed by:

Rashid Ibrahim Mehmood, Islamic University of Madinah, Saudi Arabia
Vladimir Jurisic, University of Kragujevac, Serbia

Copyright © 2025 Wang, Chen and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Wei Li, V2VpTGkyMDI1ZHNmaHNAMTYzLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.