Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Oncol., 12 January 2026

Sec. Cancer Imaging and Image-directed Interventions

Volume 15 - 2025 | https://doi.org/10.3389/fonc.2025.1643504

Molecular-informed image classification for predicting drug sensitivity in cancer therapy

  • Internet Academy, Anhui University, Hefei, China

Introduction: Understanding and predicting drug sensitivity in cancer therapy demands innovative approaches that integrate multi-modal data to enhance treatment efficacy. In alignment with the advancing scope of precision oncology and the molecularly informed therapeutic decision-making emphasized by contemporary cancer research, this work proposes a dynamic and structure-aware imaging framework for robust molecular-informed image classification. Traditional methodologies often suffer from rigid modeling assumptions and inadequate handling of complex, heterogeneous noise prevalent in biological imaging, which limits their predictive accuracy and generalizability.

Methods: To address these challenges, we introduce a novel dynamic structure-aware imaging network (DSINet) coupled with a progressive structure-guided optimization (PSGO) strategy. DSINet dynamically adapts spatial filters based on local molecular content, preserves critical biological structures through attention mechanisms, and incorporates uncertainty-aware fusion across multiple resolutions. PSGO further refines the reconstruction by progressively focusing optimization on high-confidence regions and adaptively restructuring feature graphs to enhance robustness against variable imaging artifacts.

Results and Discussion: Extensive experimental evaluations demonstrate that our method significantly outperforms techniques in classifying molecular patterns correlated with drug sensitivity, offering a reliable and interpretable foundation for advancing personalized cancer therapy strategies. This approach seamlessly integrates cutting-edge adaptive imaging models with the emerging needs of molecular-insight-driven therapeutic optimization, bridging critical gaps in current cancer informatics research.

1 Introduction

The prediction of drug sensitivity in cancer therapy has become a central focus in precision medicine, aiming to tailor treatment strategies to individual patient profiles. While traditional biomarkers such as genetic mutations provide valuable insights, they are often insufficient to fully explain variations in therapeutic outcomes. This limitation arises from the complex and heterogeneous nature of tumors, including diverse phenotypic traits and dynamic tumor microenvironments that can interfere with drug efficacy (1).

To address these challenges, molecular-informed image classification has emerged as a promising approach. By integrating histopathological imaging with molecular-level data (like gene expression, mutations), this method provides a more comprehensive view of tumor biology. It not only enhances the predictive accuracy of treatment outcomes but also facilitates the discovery of therapeutic targets and resistance mechanisms, thereby supporting more informed clinical decision-making and personalized therapy design (2).

Initial efforts in this field relied heavily on manual interpretation and expert-defined image descriptors. Researchers focused on predefined visual features such as nuclear size, tissue texture, and cellular organization (3). These handcrafted features were typically used in rule-based models guided by human expertise. Although interpretable and biologically grounded, such systems struggled to generalize across diverse cancer types (4) and lacked the flexibility to incorporate molecular-level variability, limiting their applicability for individualized prediction.

Subsequent advancements introduced computational models capable of learning patterns directly from labeled data, reducing dependence on manual feature engineering. Techniques such as support vector classifiers, decision forests, and boosting methods were applied to histopathological image analysis (5), demonstrating better generalization by leveraging statistical patterns in large datasets (6). These models also allowed for the integration of molecular profiles as auxiliary input features. However, with the increasing complexity of data—including whole-slide images and high-dimensional omics information—these approaches began to face challenges related to scalability and sensitivity (7).

More recently, deep learning models have gained prominence due to their ability to extract rich, hierarchical features from raw input data. Convolutional neural networks (CNNs) have been widely adopted for image-based tasks, while multi-branch architectures enable the simultaneous processing of molecular data (8). Joint learning frameworks further enhance the ability to model intricate associations between tissue morphology and molecular alterations, leading to improved performance in drug sensitivity prediction (9). Nevertheless, issues such as lack of interpretability, limited robustness across clinical settings, and the need for large, well-annotated datasets remain open challenges (10).

To overcome these limitations, this work presents a molecular-informed image classification framework that integrates multi-modal data through an efficient, interpretable, and robust architecture. The proposed method is designed to improve the accuracy of drug sensitivity prediction while maintaining adaptability to complex and heterogeneous biomedical data.

● Our method introduces a multi-modal transformer architecture that jointly models histopathological images and molecular data, capturing complex cross-modal relationships with minimal feature engineering.

● It features a modular design that ensures adaptability across different cancer types, demonstrating high efficiency and generalizability in multi-scenario clinical settings.

● Experimental results show that our model significantly outperforms state-of-the-art baselines in multiple benchmark datasets, achieving improved prediction accuracy, robustness, and interpretability.

2 Related work

2.1 Molecular representations in imaging

The integration of molecular information into medical imaging has emerged as a key direction for improving drug sensitivity prediction in cancer treatment. Traditional imaging methods primarily focus on visible tumor features, such as size, shape, and contrast enhancement patterns. While clinically useful, these visual cues often fail to reflect the molecular diversity that underlies variations in treatment response (10). Radiogenomics has established a foundational link between imaging features and molecular characteristics, enabling researchers to identify image-based biomarkers that correspond to specific biological pathways (11). In recent years, deep learning—particularly convolutional neural networks (CNNs)—has been increasingly used to automate this process. These models can learn complex patterns in imaging data that correlate with molecular traits, such as those linked to drug resistance (12). Public datasets like The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA) have supported the development of such integrated models. Studies have shown that incorporating gene expression data into imaging pipelines can significantly improve predictive accuracy (13). Moreover, integration has expanded beyond transcriptomics to include other molecular dimensions such as somatic mutations, copy number variations, and DNA methylation profiles, enriching image-based classification with multi-omics information (14). Despite this progress, several challenges remain. One major issue is the variability across datasets—in both imaging modalities and molecular profiling platforms—which complicates model training and generalization. Addressing this requires advanced normalization methods and domain adaptation strategies (15). Another ongoing research focus is model interpretability. It is critical to ensure that the image features used for prediction correspond to meaningful biological phenomena rather than artifacts or correlations without causation (16). Improving these aspects is essential to gain clinical trust and to enable the practical deployment of molecular-informed imaging models in precision oncology (17).

2.2 Deep learning for drug response prediction

Deep learning models have emerged as essential tools for predicting drug response due to their capability to capture high-dimensional and non-linear relationships intrinsic to biomedical data (18). Traditional approaches to drug sensitivity prediction in cancer, which have predominantly utilized cell line assays or patient-derived xenografts, are constrained by substantial resource requirements and limited scalability (19). The availability of extensive pharmacogenomic datasets, including the Genomics of Drug Sensitivity in Cancer (GDSC) and the Cancer Cell Line Encyclopedia (CCLE), has enabled the development of deep learning architectures that map comprehensive molecular profiles to therapeutic outcomes (20). Autoencoders, graph neural networks, and multimodal deep learning frameworks have been deployed to integrate genomic, transcriptomic, and proteomic data with drug molecular characteristics to forecast efficacy (21). Incorporation of imaging modalities into predictive models has facilitated the learning of joint feature representations that simultaneously capture phenotypic traits and molecular determinants of drug response (22). Multimodal variational autoencoders (MVAEs) have been employed to simultaneously encode histopathology images and molecular profiles, resulting in enhanced prediction performance across heterogeneous cancer types (23). Optimization of these models frequently involves specialized loss functions, including contrastive loss and triplet loss, to ensure alignment between multimodal feature spaces and therapeutic responses (24). Despite advances, significant obstacles persist, notably the scarcity of labeled data, pronounced heterogeneity across cancer subtypes, and difficulties in achieving model generalization across diverse patient cohorts (25). Strategies such as transfer learning and few-shot learning are being actively explored to address these limitations, promoting the development of robust, scalable, and clinically translatable deep learning systems for drug response prediction (26).

2.3 Multimodal data fusion techniques

The integration of multimodal data, encompassing imaging, molecular profiles, clinical metadata, and therapeutic outcomes, represents a pivotal approach for advancing drug sensitivity prediction in oncology (27). Multimodal data fusion techniques are typically categorized into early fusion, intermediate fusion, and late fusion, with each strategy offering distinct trade-offs regarding information preservation and model complexity (28). Early fusion methods involve the concatenation of raw features from disparate modalities prior to modeling, although they often encounter challenges associated with dimensionality explosion and modality dominance (29). Intermediate fusion approaches, which entail learning-modality-specific latent representations before their integration through mechanisms such as attention or latent alignment, have demonstrated a superior balance between modality fidelity and cross-modal interaction (30). Late fusion techniques independently model each modality and subsequently amalgamate predictions using ensemble strategies, enhancing robustness at the potential cost of synergistic feature utilization (31). The application of transformer architectures for multimodal fusion, leveraging cross-attention mechanisms to dynamically model inter-modal dependencies, has recently achieved notable success in enhancing predictive accuracy and interpretability (32). Cross-modal contrastive learning has further strengthened the ability of models to align heterogeneous modality representations within a unified embedding space, promoting generalization across diverse datasets (33). Concurrently, the incorporation of explainable artificial intelligence (XAI) techniques into multimodal fusion frameworks has facilitated the attribution of predictive outcomes to specific modalities, fostering transparency and clinical confidence in model outputs. As multimodal fusion methodologies continue to evolve, they offer transformative potential for personalizing cancer therapy through comprehensive and molecularly informed image-based drug response prediction.

3 Method

3.1 Overview

This section systematically introduces the core components of the proposed framework for imaging problems. Section 3.2 establishes the fundamental concepts and formal notations, where the imaging model is defined, key mathematical abstractions are articulated, and the problem setting is formalized. The imaging formation process is modeled as y=H(x)+n, where x denotes the latent clean image, y represents the observed degraded image, H is the degradation operator, and n denotes additive noise. The degradation operator H can encapsulate blurring, downsampling, or a mixture of complex distortions. This foundation ensures that subsequent developments are rooted in precise mathematical formulations and notational consistency. Section 3.3 presents the newly proposed imaging model, which addresses the complexities inherent in real-world visual degradations. Instead of adopting rigid assumptions on noise distribution or blur kernels, the model parameterizes the degradation process using adaptable structure that adapts to spatially varying conditions. Let the degradation operator be parameterized as Hθ, where θ represents learnable parameters inferred from the degraded observation y. A spatially adaptive convolutional mechanism is embedded within Hθ to model heterogeneous distortions across the image domain. The noise n is treated as a realization from a location-dependent distribution nN(0,σ2(x)), where σ2(x) varies with the underlying content. This formulation significantly enhances the model’s capacity to handle diverse degradation types and non-stationary noise patterns.

Following the model construction, Section 3.4 proposes a progressive optimization strategy for model training. Departing from conventional single-pass or heuristic-guided approaches, the optimization unfolds over multiple refinement stages. At each stage t, the reconstruction x^t is updated by selectively activating high-confidence regions, measured by a certainty map C(x^t). The certainty map is derived from the posterior distribution over the latent image space and guides the learning objective to prioritize reliable regions before addressing more uncertain parts. Formally, the update rule incorporates a masked loss function Lt=C(x^t)(Hθ(x^t)y)2, where denotes element-wise multiplication. This progressive mechanism not only stabilizes convergence but also mitigates the risk of overfitting to corrupted regions during early stages. Throughout these sections, mathematical rigor and methodological clarity are emphasized. Operators, distributions, and functions are carefully defined to ensure transparent interpretations. Optimization objectives are designed to be both theoretically sound and computationally tractable. The methodology combines classical principles from inverse problems with recent advances in deep learning, resulting in a hybrid framework that leverages domain-specific priors while maintaining flexibility through data-driven learning. The degradation operator Hθ, certainty map C(x^t), and noise modeling σ2(x) are seamlessly integrated into a unified architecture, allowing coherent end-to-end training. The overall approach decomposes the imaging reconstruction problem into three structured modules: degradation modeling, noise-adaptive regularization, and confidence-driven optimization. The degradation modeling module parameterizes distortions through a learnable convolutional structure, the noise-adaptive regularization module captures spatially varying uncertainty, and the confidence-driven optimization module progressively refines the reconstruction by exploiting spatial reliability. This decomposition not only improves interpretability but also yields significant empirical performance gains across multiple tasks.

The proposed method establishes a robust framework for a wide range of imaging tasks, including deblurring, denoising, and super-resolution. Extensive experimental results demonstrate that the method consistently surpasses state-of-the-art baselines across various benchmarks. The adaptability to non-uniform degradations and the progressive learning paradigm lead to improved reconstruction fidelity and robustness under both synthetic and real-world degradation scenarios. Each component introduced in the following sections is meticulously designed to interoperate, resulting in a coherent system that balances modeling accuracy, computational efficiency, and learning stability. Section 3.2 specifies the imaging problem setup and defines the notation used throughout the paper. Section 3.3 elaborates the detailed architecture and parameterization of the proposed degradation model. Section 3.4 describes the progressive optimization procedure, detailing how certainty maps are constructed and utilized to guide learning. This section provides a detailed exposition of the methodology, preparing the foundation for the theoretical analysis and experimental validation presented in subsequent parts. By carefully integrating model design, adaptive regularization, and progressive optimization, the proposed framework achieves superior performance with enhanced theoretical guarantees and practical effectiveness.

To provide a clearer understanding of the overall architecture, a system-level overview of the proposed framework is illustrated in Figure 1. The model consists of two main components: DSINet, which performs spatially adaptive dynamic filtering and multi-resolution refinement, and PSGO, which carries out progressive structure-guided optimization. The entire pipeline begins with degradation modeling and noise-adaptive regularization, followed by a dual-stage reconstruction strategy that combines content-awareness and confidence-driven refinement. Figure 2 depicts the internal structure of DSINet. It dynamically generates spatially adaptive filters via a vision transformer and text-guided token sampler. These filters are applied in a context-aware manner to the input image. A structure-preserving attention mechanism ensures that biologically important regions are retained. The model further integrates multi-resolution refinement and uncertainty modeling to support robust image representation. Figure 3 shows the three stages of PSGO, namely: adaptive confidence-guided decomposition, where reliable and uncertain regions are separated using a learned confidence map; iterative restructuring with uncertainty adaptation, which updates graph structures using attention-based mechanisms; and dynamic graph-regularized propagation, which propagates information across spatially consistent regions to obtain a high-fidelity reconstruction. These components together enable the model to adaptively handle noise, structural variation, and uncertainty in biomedical images.

Figure 1
Flowchart illustrating an image restoration process. It starts with a latent image undergoing noise-adaptive regularization and degradation modeling. Progressive and confidence-driven optimizations follow, feeding into a certainty map and refinement. The process leads to image reconstruction.

Figure 1. Overall architecture of the proposed DSINet + PSGO framework. The system begins with degradation modeling of a latent image and proceeds through noise-adaptive regularization and progressive optimization. Certainty maps guide confidence-driven refinement steps, and the entire process is iteratively updated until convergence. This unified pipeline enables robust reconstruction and classification in the presence of molecular and visual heterogeneity.

Figure 2
Diagram illustrating a model for spatially-adaptive dynamic filtering and multi-resolution refinement. The process begins with a vision transformer and text-guided sampler creating informative tokens. These tokens are processed into structure-preserving attention modulation. The attention map and structure features are integrated for uncertainty modeling. Components like predictive uncertainty and structure extraction contribute to refining the output through various equations and models, resulting in modified outputs denoted by \(\tilde{x} = x + \beta\) and \(S(x)\). The diagram uses labeled boxes and arrows to show the flow of data between processes.

Figure 2. Overview of DSINet architecture for robust image reconstruction. The DSINet framework is composed of three main modules: spatially adaptive dynamic filtering, where informative tokens are extracted via a vision transformer and text-guided sampler to generate dynamic convolutional filters Aθ(x); Structure-preserving attention modulation, which combines structural features s with adaptive weights α to retain biologically meaningful regions; and multi-resolution refinement and uncertainty modeling, which aggregates multi-scale features using attention maps and applies uncertainty-aware corrections via Σ(x). The final prediction is computed as x˜ = x + βS(x), enabling structure-consistent and noise-robust reconstruction.

Figure 3
Flowchart depicting three interconnected processes: Adaptive Confidence-Guided Decomposition, Iterative Restructuring with Uncertainty Adaptation, and Dynamic Graph-Regularized Propagation. The flow begins with an Objective Function leading to a Confidence Map, then splits into Reliable and Uncertain Regions. Iterative Restructuring starts with Initial Estimate moving through Attention, Graph Update, and Structural Construction. Dynamic Propagation includes Feature-Based Graph, Reward Computation, and Gradient Normalization, culminating in High-Fidelity Image Reconstruction.

Figure 3. This figure illustrates the overall workflow of the progressive structure-guided optimization (PSGO) framework. The system consists of three sequential components: adaptive confidence-guided decomposition, where the input domain is separated into reliable and uncertain regions based on a structural confidence map; iterative restructuring with uncertainty adaptation, which applies attention-based graph refinement and dynamically adjusts the confidence map over iterations; and dynamic graph-regularized propagation, where structural consistency is reinforced using feature-based graphs, reward-driven updates, and gradient normalization. The pipeline produces a high-fidelity reconstruction robust to spatial and annotation uncertainty.

3.2 Preliminaries

This section formalizes the imaging problem considered in this work by introducing the mathematical notations, degradation models, and key assumptions that underpin subsequent developments.

Let x ∈ n denote the latent sharp image, and y  n the observed degraded image. The degradation process is modeled as (Equation 1):

y=H(x)+ϵ,(1)

where H:nn is a degradation operator and є  n represents additive noise.

A standard instance of H(·) involves convolution with a blur kernel k  m (Equation 2):

H(x)=k * x,(2)

where ∗ denotes the convolution operation under specified boundary conditions.

For spatially variant degradations, the degradation operator is expressed as (Equation 3):

H(x)(i)=j  Ω(i)ki,jx(j),(3)

where Ω(i) defines the neighborhood around pixel i, and ki,jare location-dependent kernel weights. The noise term $ accounts for both Gaussian and structured perturbations decomposed as (Equation 4):

 є=єG+єS,(4)

where єG represents Gaussian noise and єS models sparse outliers such as impulsive noise or saturation 217 artifacts.

s, let Hn×n denote the convolution matrix associated with k, and En×n the covariance matrix of the noise. The degradation model becomes (Equation 5):

y=Hx+e,(5)

where eN(0,E) in the Gaussian component.

The ill-posed nature of the inverse problem necessitates regularization. One widely adopted prior assumes sparsity in the gradient domain (Equation 6):

xsparsedistribution,(6)

where x=(Dhx,Dvx), and Dh, Dv are horizontal and vertical difference operators, respectively.

To capture high-level structures, a feature extraction operator F:np is introduced (Equation 7):

z=F(x),(7)

where zp encodes salient features including edges, textures, or semantic patterns.

Modeling complex degradations often requires incorporating nonlinearities. The forward model is thus extended to (Equation 8):

y=G(H(x))+є,(8)

where G:nn denotes a nonlinear transformation accounting for effects such as clipping, gamma correction, or sensor-specific distortions.

The imaging recovery objective is formulated as an optimization problem (Equation 9):

x^=arg minxD(y,H(x)),(9)

where D(·,·) measures the discrepancy between the observation and reconstruction.

In blind deconvolution settings, both the latent image x and the blur kernel k are unknown (Equation 10):

(x^,k^)=arg minx,kD(y,k  x).(10)

To explicitly handle structured noise, an auxiliary variable on is introduced, leading to the modified observation model (Equation 11):

y=Hx+o+єG,(11)

and the corresponding joint estimation problem is (Equation 12):

(x^,o^)=arg minx,oD(y,Hx+o)+λΨ(o),(12)

where Ψ(·) is a sparsity-promoting regularizer and λ is a positive parameter balancing fidelity and noise modeling.

To stabilize the inversion when H is ill-conditioned, a Tikhonov regularization is introduced (Equation 13):

R(x)=Lx22,(13)

where L denotes a Laplacian or higher-order differential operator.

Multiscale approaches are utilized to progressively refine reconstructions. Let {xs}s=1S represent the latent images at multiple scales, with the degradation model at each scale formulated as (Equation 14):

ys=Hs(xs)+єs,(14)

where Hs(·) and єs denote the degradation operator and noise at scale s, respectively.

An attention mechanism is introduced to adaptively weigh spatial locations. Defining an attention map α:n[0,1]n, the degradation model becomes (Equation 15):

y=αH(x)+(1α)η,(15)

where ⊙ denotes element-wise multiplication and η models dominant noise.

All subsequent methods are built upon this general formalism, enabling flexible modeling of diverse degradation processes and guiding the design of robust imaging recovery algorithms.

3.3 Dynamic structure-aware imaging network

In this section, we present the proposed dynamic structure-aware imaging network (DSINet), a novel imaging framework specifically tailored to robustly reconstruct high-fidelity latent images from heavily degraded observations. Traditional imaging models often fail under severe degradations due to their reliance on static filters and inadequate modeling of structural information. DSINet fundamentally rethinks this paradigm by introducing three key innovations that jointly enable dynamic, structure-aware, and uncertainty-guided imaging. The detailed design and mathematical formulation of DSINet are provided below. The DSINet architecture incorporates multimodal interaction by integrating visual and textual tokens within the adaptive filtering module. As illustrated in Figure 2, both Vision Transformer embeddings and language-derived query embeddings are passed to a token-level sampler. The resulting informative tokens serve as semantic priors to guide the generation of spatially adaptive filters in Aθ(x). These tokens influence the selection and modulation of convolutional kernels in a data-dependent manner. The framework enables conditioning of local image processing on both morphological features and semantic textual cues, enhancing both accuracy and interpretability. The structure-preserving module further fuses adaptive outputs with anatomical priors to retain spatial fidelity. To improve consistency between the mathematical formulation and architectural illustration, a mapping table is provided (see Table 1) to clarify the correspondence between pixel-level variables and token-level representations. This alignment facilitates a better understanding of how vision and text modalities are integrated within DSINet. Pixel-wise operators such as Aθ(x) and K(fi) are dynamically modulated by token embeddings derived from both the Vision Transformer and Text-Guided Sampler. This joint representation supports spatially adaptive processing that remains semantically grounded in both visual and linguistic contexts.

Table 1
www.frontiersin.org

Table 1. Mapping between pixel-level and token-level representations in DSINet.

3.3.1 Spatially adaptive dynamic filtering

DSINet departs from conventional convolutional operators by introducing a spatially adaptive dynamic convolution mechanism. Let x  n represent the unknown latent image and y  n denote the observed degraded image. The degradation process can be modeled as (Equation 16):

y=H(x)+є,(16)

where H(·) denotes the degradation operator and is the additive noise. Instead of using fixed filters, DSINet defines an adaptive operator Aθ(x) parameterized by learnable weights θ, dynamically adapting based on the local input content (Equation 17):

Aθ(x)(i)=jΩ(i)wi,j(x)·x(j),(17)

where Ω(i) denotes the neighborhood centered at pixel i and wi,j(x) are context-dependent weights. To generate these dynamic weights, a feature encoder E:nd is employed, followed by a dynamic kernel generator K:dΩ(i) (Equation 18):

fi=E(x)(i),(18)
wi=K(fi).(19)

The term є(x)(i) represents (Equation 19) an auxiliary feature embedding extracted from the input image x centered at spatial location i. This embedding is designed to capture both low-level and mid-level visual cues that are relevant for enhancing spatial adaptivity in the DSINet framework. є(x)(i) is implemented as a shallow convolutional block that aggregates local information from a fixed-size receptive field (like 5 × 5 or 7 × 7 window). Unlike handcrafted descriptors, this module is learnable and trained jointly with the rest of the network. The features captured by ∈(x)(i) include intensity variations, texture patterns, and local contrast, all of which are implicitly learned through convolutional filters. In addition to spatial gradients and edge-related features, the embedding also encodes contextual patterns that correlate with molecular characteristics, especially when integrated with the attention mechanism. While the primary focus is on local context, the use of dilated convolutions and multi-scale aggregation allows the embedding to incorporate a limited degree of broader contextual information. This ensures that є(x)(i) is sensitive not only to pixel-level changes but also to regional structure and texture, which is essential in medical imaging tasks involving subtle morphological variations.

Normalization of withrough the softmax function ensures numerical stability (Equation 20):

wi,j=exp (wi(j))jΩ(i)exp (wi(j)).(20)

The adaptive nature of Aθ(x) enables DSINet to effectively adjust its filtering behavior based on varying local degradations, improving robustness against diverse image corruptions.

The adaptive operator Aθ(x) employs a dynamic filtering mechanism in which the kernel generator K(fi) plays a central role in capturing local context. Rather than using a globally shared kernel, the generator K(fi) produces pixel-wise dynamic filters conditioned on the local feature embedding fiat each spatial location. For every pixel i in the input feature map, a unique kernel is generated based on the local appearance and contextual features surrounding i. This design ensures that the filtering operation is spatially adaptive and content-aware, enabling the model to respond to heterogeneity in tissue morphology and molecular context. The kernel generator is implemented as a lightweight convolutional subnetwork that takes the intermediate feature map as input and outputs a set of per-pixel convolution kernels with fixed spatial size (like 3 × 3). These generated kernels are then applied via depth-wise convolution over the local neighborhood of each pixel. As a result, different neighborhoods are processed using distinct, dynamically generated kernels, rather than a shared static kernel. This mechanism is crucial for capturing fine-grained structures, particularly in medical imaging scenarios where boundary precision and local variations are important. Computational complexity is managed through channel grouping and kernel compression techniques to maintain efficiency during training and inference.

3.3.2 Structure-preserving attention modulation

To further enhance reconstruction quality, DSINet integrates a structure-preserving feature modulation mechanism. A structure feature map s ∈ n is extracted by applying a structure extractor S(·) to the input (Equation 21):

s=S(x),(21)

capturing prominent edges, textures, or ridges. Based on both the dynamic features fiand the structural cues <

, an attention map α:n[0,1]n is computed (Equation 22):

α(i)=σ(g(fi,s(i))),(22)

where g(·,·) is a gating function combining dynamic and structural information, and σ(·) denotes the sigmoid function. The final modulated feature at each location is (Equation 23):

x˜(i)=α(i)·Aθ(x)(i)+(1α(i))·x(i).(23)

This modulation effectively leverages both local adaptivity and global structural information, enabling DSINet to preserve important visual structures while correcting degradations.

Furthermore, DSINet introduces an adaptive normalization layer to dynamically accommodate spatially varying noise characteristics. A noise estimator N:n predicts the local noise level σ at each pixel, and intermediate features are normalized accordingly (Equation 24):

f˜i=fiμi(σ)σi2(σ)+є,(24)

where µi(σ) and σi2(σ) are the mean and variance conditioned on the noise estimate, and є is a small constant ensuring numerical stability. This mechanism enables DSINet to maintain high performance under varying noise levels without explicit noise-level supervision.

3.3.3 Multi-resolution refinement and uncertainty modeling

To effectively capture multi-scale contextual dependencies, DSINet employs a hierarchical refinement mechanism across S resolution scales {xs}s=1S (as shown in Figure 4).

Figure 4
Diagram showing a process for channel-wise attention mapping in neural networks. It begins with multi-scale feature maps undergoing MaxPool and AvgPool operations, resulting in two 1x1x1xc outputs. These are processed through a channel attention MLP, producing two outputs. These outputs are combined, passed through a sigmoid function, and result in a final 1x1x1xc channel-wise attention map.

Figure 4. This figure illustrates the channel attention mechanism used in the multi-resolution refinement and uncertainty modeling framework. Multi-scale feature maps undergo average and max pooling, followed by a shared MLP to compute a channel-wise attention map. The resulting attention weights, after sigmoid activation, enhance feature consistency across scales and contribute to uncertainty-aware reconstruction in DSINet.

At each scale s, the feature representation is recursively updated as (Equation 25):

xs+1=U (Fs(xs)) ,(25)

where Fs(·) denotes the DSINet processing module at scale s, and U(·) is an upsampling operator. The observed degraded image y is similarly downsampled to match each scale, and a residual is computed (Equation 26):

rs=ysHs(x˜s),(26)

guiding the network toward progressively refined reconstructions.

To ensure coherence between different scales, a scale alignment loss is introduced (Equation 27):

Lalign=s=1S1𝒟(xs+1)xs22,(27)

where D(·) is a downsampling operator compatible with U(·). This encourages consistent feature representations across scales, reducing artifacts due to scale mismatch.

The final high-resolution output x^ is obtained by merging the outputs from all scales via a fusion module Φ(·) (Equation 28):

x^=Φ(x1,x2,,xS),(28)

allowing the network to integrate fine and coarse information.

To further improve reconstruction robustness, DSINet explicitly models the uncertainty associated with its predictions. A predictive uncertainty map Σ(x) is generated, and the final prediction is corrected based on this uncertainty (Equation 29):

x^(i)=x(i)+βΣ(x)(i),(29)

where β is a learnable scalar controlling the correction magnitude. By explicitly accounting for prediction uncertainty, DSINet can adaptively refine uncertain regions, leading to overall more reliable reconstructions.

Through these innovations, DSINet successfully addresses the challenges posed by complex and severe degradations in real-world imaging tasks. Its design enables dynamic adjustment to spatial context, effective structural preservation, hierarchical refinement, and robust uncertainty-aware reconstruction without relying on any fixed prior assumptions.

3.4 Progressive structure-guided optimization

In this section, we introduce progressive structure-guided optimization (PSGO), a novel iterative framework designed to refine the imaging reconstruction produced by DSINet. The core idea of PSGO is to progressively focus on reliable structural components while dynamically adjusting the optimization pathway based on intermediate recovery states (as shown in Figure 3).

3.4.1 Adaptive confidence-guided decomposition

Given the degraded observation y ∈ n and the current estimation xt ∈ n at iteration t, PSGO defines an adaptive objective function (Equation 30):

Jt(x)=D(y,H(x))+λtRt(x),(30)

where D(·,·) measures the fidelity to the observation, and Rt(·) is a structure-aware regularizer evolving over iterations. At each iteration t, the image domain is decomposed into a reliable region Itand an uncertain region Itc, according to a structural confidence map Ct:n[0,1]n (Equation 31):

It={i ∈ [n]Ct(i)τt},(31)

where τtis a dynamically decreasing threshold controlling the inclusion of pixels. The confidence map Ct is estimated by evaluating the consistency between the observation and the forward model (Equation 32):

Ct(i)=exp ((y(i)(H(xt))(i))2σ2),(32)

where σ is a robustness parameter estimated empirically or via auxiliary networks. To emphasize reliable regions during optimization, PSGO applies a spatial weighting scheme (Equation 33):

ωit={1,ifi ∈ It,γt,ifi ∈ Itc,(33)

where γt∈ (0,1) progressively increases with t to allow gradual inclusion of uncertain regions. The modified discrepancy loss at iteration t becomes (Equation 34):

Dt(y,H(x))=i=1nωit(y(i)(H(x))(i))2.(34)

3.4.2 Dynamic graph-regularized propagation

The structure-guided regularization Rt(x) incorporates dynamic graph-based constraints. A graph Gt=(V,Et) is defined over the image domain, where V={1,2,,n} and Et connects structurally similar pixels based on feature similarities (Equation 35):

Et={(i,j)F(xt)(i)F(xt)(j)2єt}.(35)

Here F(x) is a feature extractor and єt is a tightening threshold decreasing over time. The graph-based regularizer is defined as (Equation 36):

Rt(x)=(i,j)Etwijtx(i)x(j)22,(36)

where wijt are adaptive edge weights based on feature affinities (Equation 37):

wijt=exp (F(xt)(i)F(xt)(j)22βt),(37)

and βt decreases gradually to sharpen structural attention. The optimization at each iteration t updates xt+1 according to (Equation 38):

xt+1=xtηtJt(xt),(38)

where ηt is an adaptive step size schedule satisfying (Equation 39):

ηt=η01+ρt,(39)

with η0 being the initial learning rate and ρ controlling the decay rate. To prevent over-smoothing and preserve fine structures, a residual correction term is introduced (Equation 40):

xt+1=xt+1+ζtRs(xt),(40)

where Rs(·) denotes a residual sharpening operator and ζt is a decaying coefficient.

3.4.3 Iterative restructuring with uncertainty adaptation

At every K iterations, PSGO performs a re-structuring phase, re-estimating the graph Gt and recalibrating the confidence map Ct, ensuring adaptability to the evolving reconstruction (as shown in Figure 5).

Figure 5
Diagram illustrating a system for multimodal fusion with uncertainty-aware reweighting. Visual, speech, and text inputs are represented as \( L_v \), \( L_s \), and \( L_t \), processed into \( z_v \), \( z_s \), and \( z_t \). These undergo multimodal fusion, followed by iterative uncertainty-aware optimization. The output is divided into two branches: generation with a latent decoder producing a decoded output, and classification with guidance leading to class labels.

Figure 5. Iterative restructuring with uncertainty adaptation for multimodal fusion. The figure presents a multimodal learning system integrating visual, speech, and text modalities. Inputs xv, xs, and xtare encoded into modality-specific features zv, zs, and zt, which are then fused via a multimodal fusion with uncertainty-aware reweighting module. The fused representation is processed through an iterative uncertainty-aware optimization (IUO) framework, enabling two tasks: generation via a latent decoder and classification via an IUO-guided classification head. At every K iteration, the system performs restructuring by updating the graph Gt and recalibrating the confidence map Ct, using adaptive thresholding and uncertainty-guided reweighting. This leads to structure-consistent, robust reconstruction and improved task performance through dynamically regularized optimization.

The re-structuring updates the confidence threshold as (Equation 41):

τt+K=κτt,(41)

where κ ∈ (0,1) controls the relaxation speed. Moreover, PSGO integrates an uncertainty-aware reweighting mechanism to focus learning efforts (Equation 42):

ωit=ωit·(1+δΣ(xt)(i)),(42)

where Σ(xt) is the predictive uncertainty map from DSINet, and δ is a positive scalar adjusting the modulation strength. To guarantee convergence, a diminishing residual energy condition is enforced (Equation 43):

yH(xt)22yH(xt+1)22ξt,(43)

where ξtis a non-increasing sequence ensuring steady improvement. Finally, the overall reconstruction after T iterations is (Equation 44):

x^=xT,(44)

providing a high-fidelity, structure-consistent, and noise-robust latent image.

Through this progressively structured, adaptively weighted, and dynamically regularized optimization procedure, PSGO enhances the reconstruction quality beyond traditional static optimization strategies, effectively handling complex degradation patterns and diverse noise characteristics across imaging tasks.

The feature extractor F(xt) used for graph construction is implemented as a lightweight encoder module within DSINet, designed to capture mid-level semantic and structural cues. Specifically, the features are extracted from an intermediate layer of the encoder, balancing between spatial resolution and contextual richness. This design enables the graph construction process to preserve both local texture and global structure information without incurring excessive computational cost. Regarding the choice of hyperparameters in Equations, each plays a role in controlling the dynamics of confidence propagation and structure-guided optimization. The parameter σ determines the sensitivity of edge weights in the graph and is empirically set based on the standard deviation of feature distances within a batch. βtgoverns the strength of temporal smoothing and is scheduled to decay logarithmically with iteration t to allow early-stage exploration and late-stage stabilization. The coefficients γtand ζtare initialized to small positive constants (like 0.01 and 0.05, respectively) and updated adaptively based on the gradient norm of the confidence map, encouraging stronger corrections in uncertain regions. The graph and confidence map are restructured every K = 5 iterations, which reflects a trade-off between computational cost and structural adaptability. A smaller K increases responsiveness but also overhead, while a larger K may delay convergence in rapidly changing regions. Empirically, K = 5 yielded stable and efficient convergence across all datasets. The adaptive step size ηtis updated using a momentum-based scheme that incorporates the variance of previous updates, promoting smoother convergence and avoiding oscillations. In our implementation, this mechanism leads to faster stabilization of confidence propagation compared to fixed-step schemes. Constructing a fully connected graph over all pixels is computationally infeasible for high-resolution images. Therefore, we adopt a local neighborhood approximation where each node connects to its k nearest neighbors (like k = 16) in feature space using an efficient approximate nearest neighbor algorithm (like FAISS). This approximation reduces graph complexity from O(N2) to O(Nk) while maintaining structural fidelity. Additional sparsity is enforced by thresholding weak edges to zero, resulting in a sparse adjacency matrix suitable for GPU-accelerated computation.

4 Experimental setup

4.1 Dataset

The BraTS Dataset (34) is a widely used benchmark in the field of medical image analysis, specifically designed for the segmentation of brain tumors in multi-modal magnetic resonance imaging (MRI) scans. It contains annotated images from patients with gliomas, both high-grade and low-grade, across multiple institutions and scanners, ensuring diversity and robustness. The dataset includes several MRI sequences such as T1, T1-contrast enhanced, T2, and FLAIR, providing comprehensive structural information crucial for tumor delineation. The ground truth labels are manually segmented by experienced radiologists, distinguishing tumor core, enhancing tumor, and edema regions. BraTS has been the focus of multiple international challenges, driving advancements in automated brain tumor segmentation and serving as a reference for validating new methodologies. Its rigorous curation, multi-center acquisition, and detailed annotations make it indispensable for developing and benchmarking machine learning algorithms for neuro-on. The OASIS Dataset (35) is an open-access collection of brain imaging data aimed at advancing the understanding of normal aging and Alzheimer’s disease. It includes cross-sectional and longitudinal MRI scans from a large cohort of participants ranging in age and cognitive status, from young healthy adults to elderly individuals with varying degrees of cognitive impairment and dementia. The dataset provides structural MRI volumes, demographic information, and clinical assessments such as the Clinical Dementia Rating (CDR). OASIS is widely used for studying brain morphometry, the progression of neurodegenerative diseases, and for training and validating automated brain segmentation and classification algorithms. The careful design and broad scope of OASIS enable comprehensive studies on aging, neuroanatomical changes, and disease progression, fostering reproducibility and comparison across different research groups and methodologies in the neuroscience community. The LUNA16 Dataset (36) is a large-scale, curated resource for the detection of lung nodules in computed tomography (CT) images. Derived from the publicly available LIDC-IDRI database, LUNA16 consists of a carefully selected subset of CT scans with annotated pulmonary nodules by multiple radiologists. Each nodule annotation is provided with spatial coordinates and diameter, ensuring precise localization and quantification. The dataset focuses on nodules greater than 3 mm in diameter, reflecting clinical relevance in lung cancer screening. LUNA16 has served as the foundation for the Lung Nodule Analysis Grand Challenge, enabling standardized evaluation and comparison of algorithms for automated nodule detection and classification. Its comprehensive annotation protocol, high-resolution CT data, and public accessibility have made it a benchmark for advancing computer-aided detection systems in thoracic imaging. The MURA Dataset (37) is a large-scale musculoskeletal radiograph dataset designed for the development and assessment of algorithms for abnormality detection in bone X-rays. It contains over 40,000 images from more than 14,000 studies, spanning seven standard upper extremity radiographic examinations such as the elbow, finger, hand, humerus, forearm, shoulder, and wrist. Each study is labeled by board-certified radiologists as either normal or abnormal, reflecting clinically relevant findings encountered in routine practice. The MURA dataset enables both classification and localization tasks, as it provides image-level labels and, for a subset, bounding box annotations. Its scale, diversity of anatomical regions, and expert labeling have made it a reference dataset for evaluating deep learning models in musculoskeletal imaging and for facilitating research on automated abnormality detection in radiographs. A detailed summary of all datasets used in this study is provided in Table 2, including the disease type, imaging modality, availability of molecular information, classification endpoints, sample size, and the data splitting strategy. This table is intended to clarify the experimental settings and facilitate reproducibility and comparison with related studies.

Table 2
www.frontiersin.org

Table 2. Summary of datasets used in this study, including imaging modality, classification endpoint, molecular information, and sample allocation.

4.2 Experimental details

All experiments were conducted on a workstation equipped with NVIDIA RTX 3090 GPUs, 256GB RAM, and Intel Xeon CPUs, running Ubuntu 20.04 LTS. The full framework was implemented using PyTorch (version 1.12) with CUDA 11.3 acceleration, and random seeds were fixed at the beginning of each run to ensure reproducibility. The preprocessing steps included intensity normalization, resampling to a standardized spatial resolution (voxel spacing for 3D datasets or pixel spacing for 2D images), and cropping or padding to fixed input dimensions. For MRI datasets, additional neuroimaging-specific steps such as bias field correction and skull stripping were applied to remove non-brain tissue. Data augmentation strategies were applied during training, including random rotations, scaling, horizontal and vertical flipping, elastic deformations, and intensity jittering, to enhance robustness and reduce overfitting. In all experiments, the DSINet framework served as the primary backbone for both image enhancement and downstream medical image analysis. For volumetric 3D datasets such as BraTS and LUNA16, a 3D variant of DSINet was constructed using 3D convolutional layers and patch-based input processing. For 2D datasets such as MURA and OASIS, a 2D version of DSINet was used, in which a Vision Transformer extracted visual tokens that were then fused with text- or molecule-derived semantic embeddings through a cross-attention module. The progressive structure-guided optimization (PSGO) module was integrated across all configurations to refine predictions by iteratively updating confidence-aware spatial features. Model weights were initialized using He initialization for convolutional layers and Xavier initialization for fully connected layers. Training was conducted with a batch size of 4 for 3D datasets and 32 for 2D datasets, using the Adam optimizer with an initial learning rate of 1 × 10−4. The learning rate was reduced by a factor of 0.5 when the validation loss did not improve for 10 consecutive epochs. Early stopping with a patience of 20 epochs was applied to prevent overfitting, and training proceeded for a maximum of 200 epochs. The final model used for evaluation corresponded to the checkpoint with the lowest validation loss. Although DSINet is initially designed as a structure-aware image enhancement model, the learned representations from its encoder were reused for classification and segmentation tasks. For classification, a softmax layer was added after global average pooling on the encoder outputs. For segmentation, a lightweight decoder was attached to the multi-scale visual features, trained end-to-end using a combination of Dice loss and binary cross-entropy loss. For classification tasks, categorical cross-entropy was used. Evaluation was performed on held-out test sets using metrics appropriate to each task, including Dice similarity coefficient (DSC), intersection over union (IoU), sensitivity, specificity, area under the ROC curve (AUC), accuracy, and average precision. Each experiment was repeated three times with different random seeds, and mean and standard deviation were reported to ensure statistical robustness. Data splits were performed in a stratified manner to maintain class balance. External validation was also conducted when possible, using either publicly available test sets or cross-institutional data, to assess the generalizability of the proposed method.

To specify the molecular data utilized in this study, gene expression profiles were obtained from The Cancer Genome Atlas (TCGA) and the METABRIC dataset. In particular, the TCGA-LGG (low grade glioma) and TCGA-BRCA (breast cancer) cohorts provided matched histopathological images and transcriptomic data. The expression values were normalized, filtered using variance thresholds, and z-score standardized. Sample correspondence across modalities was ensured using unique patient identifiers. During model construction, molecular features were concatenated with visual embeddings to form a unified representation that incorporates both anatomical and biological information. The multimodal learning architecture adopts two parallel encoding pathways. Visual information is processed through a Vision Transformer (ViT) to extract semantic features from imaging data. In parallel, molecular vectors are processed using a lightweight transformer encoder. The resulting feature embeddings are fused via a cross-attention module, which dynamically aligns and weights information from both modalities. This design allows molecular context to influence spatial filtering and prediction, improving robustness under ambiguous or noisy visual conditions. The joint representation is subsequently refined through structure-aware filtering and confidence-guided optimization mechanisms implemented in DSINet and PSGO.

4.3 Comparison with SOTA methods

Tables 3 and 4 illustrate a systematic contrast between our algorithm and high-performing baseline methods across four commonly adopted datasets: BraTS, OASIS, LUNA16, and MURA. For each dataset, multiple standard evaluation metrics are reported, including accuracy, precision, recall, and F1 score, enabling a robust and multi-faceted assessment of classification and detection performance. On the BraTS dataset, which focuses on brain tumor segmentation, our method achieves a significant improvement over established models such as ResNet50, DenseNet121, EfficientNet-B0, ViT, ConvNeXt, and DeiT. Specifically, our model obtains an accuracy rate of 92.61 ± 0.03, which is 2.39% higher than ConvNeXt, the best-performing baseline. Similarly, on the OASIS dataset, which addresses Alzheimer’s disease prediction from structural MRI, our method attains an accuracy rate of 91.98 ± 0.02, surpassing the closest SOTA competitor, ConvNeXt, by nearly 3%. In addition to overall accuracy, our model consistently outperforms competitors across all other key metrics, demonstrating balanced improvements in both precision and recall, leading to higher F1 scores and thus more reliable detection. The standard deviations across repeated experiments are also notably lower for our approach, underscoring its robustness and reproducibility. Performance gains are not limited to classification; improvements are observed in segmentation precision, indicating that our method can better delineate boundaries in complex neuroimaging data, which is critical for clinical applicability. In the case of the LUNA16 and MURA datasets, which focus on lung nodule detection in chest CT and musculoskeletal abnormality detection in radiographs, respectively, the superiority of our approach remains evident. On LUNA16, our method records an accuracy of 91.48 ± 0.03, outperforming ConvNeXt-Tiny and Swin-Transformer by margins of nearly 3%. For the MURA dataset, which is particularly challenging due to high inter-class variability and subtle abnormality cues, our model achieves an accuracy rate of 87.76 ± 0.03, distinctly higher than all baseline models, with a significant boost in both recall and F1 score. This indicates that our method not only correctly identifies more abnormal cases but also maintains a low rate of false positives and negatives, which is vital in clinical screening settings. Our analysis reveals that while transformer-based methods such as ViT, Swin-Transformer, and MAE have demonstrated notable improvements over traditional convolutional neural network (CNN) architectures, especially in capturing global context and long-range dependencies, their performance still lags behind our proposed approach. One likely reason is that our model introduces adaptive feature fusion modules and attention mechanisms specifically designed for medical imaging, allowing for enhanced extraction of domain-relevant features and more effective integration of multi-scale contextual information. This design mitigates some of the common pitfalls encountered by standard transformers, such as overfitting on small datasets and insufficient representation of fine-grained patterns.

Table 3
www.frontiersin.org

Table 3. Empirical study of our model versus top-performing methods on BraTS and OASIS datasets.

Table 4
www.frontiersin.org

Table 4. Assessment of our solution compared with SOTA algorithms on LUNA16 and MURA.

The marked improvement of our method across diverse datasets and imaging modalities can be attributed to several core design principles and technical innovations. First, our method leverages advanced regularization strategies and cross-domain data augmentation, which collectively enhance generalization and mitigate dataset-specific biases. For instance, the inclusion of modality-adaptive normalization enables the model to dynamically adjust to intensity and distributional variations present in MR, CT, and X-ray images, addressing a key limitation of many existing models. Second, the hierarchical attention mechanism integrated into our network architecture enables both global and local context modeling, ensuring that the model can focus on subtle anatomical abnormalities while also leveraging broader contextual cues, a capability essential for accurate medical image interpretation. Third, as highlighted in method ablation studies and reflected in Table 3, our approach incorporates task-specific loss functions and balanced sampling strategies, effectively addressing the class imbalance issues prevalent in medical data sets. Furthermore, efficient parameterization of our network, with a focus on lightweight modules and reduced computational overhead, allows practical deployment in real-world clinical scenarios without sacrificing performance. This efficiency is further evidenced by the relatively low variance in performance metrics across repeated runs, demonstrating stability and reliability.

To contextualize the performance comparisons, a brief summary of the baseline models is presented. ResNet50 and ResNet34 are classical convolutional neural network (CNN) architectures that rely on residual connections to facilitate gradient flow and improve training stability. They are widely used in medical imaging tasks but are generally limited in modeling long-range dependencies and require large datasets to generalize effectively. DenseNet121 and DenseNet169 improve upon standard CNNs by introducing dense connectivity between layers, promoting feature reuse and better gradient propagation. These models are efficient and parameter-light but still rely on fixed receptive fields, which may be suboptimal for highly variable biomedical structures. EfficientNet-B0 and MobileNetV2 are lightweight CNN variants optimized for performance-efficiency trade-offs. While effective in resource-constrained environments, their relatively shallow architectures may limit expressiveness in complex classification tasks. Vision Transformers (ViT), Swin-Transformer, and MAE belong to the transformer-based family of models. ViT applies self-attention mechanisms directly to flattened image patches, enabling global context modeling. Swin-Transformer introduces hierarchical feature representations with shifted windows to balance local and global feature extraction. MAE leverages masked image modeling as a self-supervised pretraining method. These models offer improved representation learning but often require large-scale pretraining and are sensitive to data scarcity. ConvNeXt and ConvNeXt-Tiny represent a hybrid design that modernizes CNNs using architectural concepts from transformers, such as layer normalization and GELU activations, while retaining convolutional inductive biases. They offer strong performance in various vision tasks but still lack mechanisms to incorporate uncertainty or spatial confidence. In contrast, the proposed DSINet + PSGO architecture incorporates spatially adaptive filtering, structure-preserving attention, and uncertainty-aware optimization. This combination allows it to dynamically adjust to image content, selectively focus on reliable regions, and robustly integrate molecular and imaging data—capabilities that are not present in the baseline models.

4.4 Ablation study

To validate the contribution of each proposed component, we conducted an ablation study across the BraTS, OASIS, LUNA16, and MURA datasets. We systematically removed each key module—spatially adaptive dynamic filtering, structure-preserving attention modulation, and multi-resolution refinement with uncertainty modeling—from the full DSINet framework to examine their individual impact. The experimental results, summarized in Tables 5 and 6, demonstrate consistent performance degradation across all major evaluation metrics, comprising performance measures—accuracy, precision, recall, and F1—under conditions of single-module ablation.

Table 5
www.frontiersin.org

Table 5. Ablation study results on BraTS and OASIS datasets.

Table 6
www.frontiersin.org

Table 6. Ablation study results on LUNA16 and MURA datasets.

The exclusion of spatially adaptive dynamic filtering leads to the most substantial accuracy and recall drops, reflecting its role in dynamically adjusting filtering behavior according to local image degradations. Removing structure-preserving attention modulation notably reduces precision and F1 scores, indicating its effectiveness in preserving essential structural information during reconstruction. The omission of multi-resolution refinement and uncertainty modeling results in significant decreases in recall and F1 scores, underscoring the importance of capturing multi-scale dependencies and managing prediction uncertainty. These results affirm that all three components jointly address diverse challenges in medical image reconstruction and are indispensable for achieving robust and high-fidelity results.

To enhance the interpretability of the molecular-informed classification framework and evaluate robustness under clinical conditions, several additional experiments were conducted. Each dataset was processed independently, and separate models were trained due to variations in imaging modalities and the presence or absence of molecular annotations. Molecular features were incorporated only for datasets where such data were available (like APOE genotype in the OASIS dataset). A unified model across all datasets was intentionally avoided to prevent confounding due to heterogeneous input distributions. To quantify the contribution of molecular features, a SHAP (SHapley Additive exPlanations) analysis was performed on the OASIS dataset. Table 7 presents the importance ranking of features used in the classification task. The APOE genotype was found to be the most predictive molecular factor, exceeding key imaging-derived features in impact. In a modality ablation study, molecular features were removed at test time to assess their impact on predictive performance. As shown in Table 8, a noticeable decline in F1 score was observed, particularly on datasets with informative molecular annotations, highlighting the value of incorporating such features. To simulate real-world clinical scenarios involving incomplete annotations, molecular features were randomly masked at rates of 10%, 30%, and 50%. The classification performance degraded gracefully, with less than 6% decline in F1 score even under 50% missing molecular input, indicating that the model retained robustness under partial information.

Table 7
www.frontiersin.org

Table 7. SHAP-based feature importance ranking on OASIS dataset.

Table 8
www.frontiersin.org

Table 8. Performance impact of removing molecular inputs across datasets.

To complement the ablation studies and provide insight into computational feasibility, this section reports the efficiency metrics of the full and simplified versions of the proposed architecture. Table 9 summarizes the key performance indicators: number of trainable parameters, GPU memory consumption, training time per epoch, and inference time per image. All experiments were conducted on a single NVIDIA RTX 3090 GPU with 24GB memory. The results indicate that the full version of DSINet + PSGO requires approximately 42.8 million parameters and 15.6 GB of GPU memory during training. Inference time is 247 ms per image on average. In contrast, the lightweight version—created by reducing ViT depth and disabling multiscale uncertainty refinement—achieves a 45% reduction in latency and over 50% memory savings, with only a 2.7% absolute drop in F1 score on the BraTS dataset. These findings suggest that while the full model offers maximum accuracy, lighter configurations may be more suitable for real-time or mobile deployment scenarios. Such trade-offs between accuracy and efficiency can be selected based on deployment context. Future work will further explore model pruning and quantization techniques to reduce inference overhead in clinical applications.

Table 9
www.frontiersin.org

Table 9. Comparison of computational efficiency for DSINet + PSGO and a lightweight variant.

To strengthen the translational relevance of the proposed method, additional experiments were conducted to explore whether the image regions highlighted by DSINet correspond to biologically meaningful pathways and molecular processes related to drug sensitivity. A subset of 116 matched cases from the TCGA-LGG cohort was used, where both histopathological whole-slide images and transcriptomic data were available. For each case, attention maps were extracted from DSINet outputs. The top 10% most salient regions were localized and spatially registered back to the original slides. Gene expression profiles from the same patients were used to calculate pathway activity scores using single-sample Gene Set Enrichment Analysis (ssGSEA), with KEGG, Reactome, and Hallmark gene sets as references. Figure-level attention saliency and patient-level transcriptomic profiles were then correlated. A two-group comparison (high-attention vs. low-attention regions) revealed that the former exhibited significantly higher activity in several known drug-response pathways. The enriched pathways include PI3K/AKT signaling, p53 pathway, mismatch repair, and DNA damage response, all of which are implicated in treatment resistance or sensitivity in glioma and other cancers. The top enriched pathways are summarized in Table 10, along with FDR-adjusted p-values computed via Benjamini–Hochberg correction. These findings provide supporting evidence that the attention-driven outputs of the model are not only spatially meaningful but also biologically interpretable, reinforcing the clinical transparency of the framework. Future work will focus on embedding these biological priors into the model architecture and further validating these findings through clinical collaborations and pathway-aware training strategies.

Table 10
www.frontiersin.org

Table 10. Top enriched biological pathways in high-attention regions using ssGSEA on TCGA-LGG matched cases.

To assess whether the proposed framework can generalize across cancer types, a cross-disease transfer experiment was performed. DSINet + PSGO was first trained on the TCGA-LGG cohort (glioma) and then evaluated without fine-tuning on a small subset of breast cancer (BRCA) and lung adenocarcinoma (LUAD) samples obtained from TCGA, where both histopathology and molecular response proxies (pathway activation scores) were available. Table 11 presents the F1 scores of the model on each dataset under two scenarios: direct inference (zero-shot) and light fine-tuning (3-epoch transfer). The full model trained on glioma data achieved reasonable performance in both BRCA and LUAD cases, and fine-tuning further improved performance. This suggests that core architectural components such as structure-aware filtering and confidence-guided optimization are transferrable across tissue types. In addition to the experimental evidence, it is important to consider the biological basis of drug sensitivity variation across cancers. As Solimando et al. (44) note, the tumor microenvironment, immune infiltration, and bone marrow niche interactions in multiple myeloma contribute significantly to treatment resistance. Similar heterogeneity exists in breast and lung cancers, where spatial proteomics has identified localized signaling changes that influence response. These findings highlight the necessity for predictive models to either incorporate domain adaptation mechanisms or demonstrate transferability across contexts. The modular nature of DSINet + PSGO, with separate components for image filtering, structure modulation, and uncertainty-based optimization, enables future extensions involving domain adaptation, few-shot fine-tuning, or federated learning to further improve real-world applicability.

Table 11
www.frontiersin.org

Table 11. Cross-cancer transferability of DSINet + PSGO across TCGA tumor types.

To better assess the clinical relevance of the proposed framework, additional evaluations were conducted using metrics that extend beyond conventional classification accuracy. In the context of drug sensitivity prediction, clinically meaningful assessment requires not only correct classification but also reliable probability estimates and decision-level utility. Probability calibration was first evaluated by constructing calibration curves for both the TCGA-LGG and TCGA-BRCA cohorts. The proposed DSINet + PSGO produced probability outputs that closely aligned with observed outcome frequencies, indicating strong calibration properties. In addition, the concordance index (C-index) was used to measure how well predicted probabilities ranked patients according to sensitivity likelihood. DSINet + PSGO achieved a C-index of 0.821 and 0.801 on the TCGA-LGG and TCGA-BRCA datasets, respectively, surpassing baseline models such as ConvNeXt and Swin-Transformer. To evaluate clinical decision-making utility, decision curve analysis (DCA) was performed across a range of decision thresholds. The DSINet + PSGO consistently yielded higher net benefit across clinically relevant threshold intervals (0.3–0.7), compared to default treatment strategies (like treat-all or treat-none). These results suggest that the model provides not only high predictive performance but also well-calibrated and actionable outputs for potential clinical use. A summary of the clinically relevant evaluation metrics is presented in Table 12.

Table 12
www.frontiersin.org

Table 12. Clinically relevant evaluation metrics for drug sensitivity prediction across cancer cohorts.

To further evaluate the effectiveness of the proposed progressive structure-guided optimization (PSGO) module, a comparative experiment was conducted against several widely used traditional refinement techniques. These include Dense Conditional Random Fields (DenseCRF), mean-field inference-based smoothing, and post-hoc Gaussian filtering. All methods were applied in conjunction with DSINet, where the refinement strategy was used after the initial segmentation output from the encoder-decoder backbone.

As shown in Table 13, the PSGO module consistently outperformed traditional optimization methods across all major segmentation metrics. Compared to DenseCRF, which smooths label maps by leveraging low-level intensity similarities, PSGO utilizes confidence-aware spatial features and structure-guided refinement that adaptively modulate ambiguous regions. Furthermore, unlike fixed-rule filtering methods such as Gaussian smoothing, PSGO is task-specific and data-adaptive, optimizing both feature propagation and boundary delineation. The results demonstrate that PSGO not only preserves fine-grained anatomical boundaries but also enhances semantic coherence under challenging imaging conditions. In particular, improvements were observed in Dice score and precision, indicating reduced over-segmentation and better focus on relevant structures. This confirms the advantage of structure-aware, learnable optimization mechanisms over static or heuristic refinement techniques in the context of medical image segmentation.

Table 13
www.frontiersin.org

Table 13. Comparison of PSGO with traditional refinement methods on the BraTS segmentation task.

Molecularly informed image-based drug response prediction refers to a modeling approach that infers drug sensitivity by integrating histopathological imaging features with transcriptomic signals. Although no direct prediction of drug response metrics such as IC50 or AUC is performed, the classification tasks conducted in this work—such as molecular subtype prediction, receptor status classification, and mutation-based grouping—serve as surrogate indicators of therapeutic response. For example, the use of ER/HER2 status in the TCGA-BRCA cohort and IDH mutation status in TCGA-LGG is clinically linked to specific treatment outcomes. The proposed framework enhances prediction accuracy by embedding molecular signals into the imaging feature extraction process. Visual and molecular features are fused using a cross-attention mechanism, and the resulting representation undergoes structure-aware filtering and confidence-based optimization. The downstream analyses, including SHAP-based interpretation and pathway enrichment, further indicate that the model captures biologically relevant drug-response pathways. While explicit drug response prediction is not conducted, the framework supports indirect inference through biologically stratified image phenotypes.

5 Discussion

One limitation of the current framework lies in its reliance on fully labeled multimodal datasets, which are often unavailable in clinical settings due to the cost and effort required for molecular annotation. To improve applicability, future extensions of the framework will explore both semi-supervised and self-supervised learning paradigms. For semi-supervised learning, one promising direction involves using consistency regularization between labeled and unlabeled samples. Given that DSINet provides uncertainty-aware spatial features, these features can be used to propagate pseudo-labels from high-certainty to low-certainty regions within unlabeled images. In addition, PSGO’s confidence-guided optimization offers a natural mechanism for bootstrapping predictions during iterative self-labeling. For self-supervised learning, the multi-branch structure of the model lends itself well to contrastive or predictive objectives. For example, image-only reconstruction tasks (like masked autoencoding) could be used to pretrain the DSINet encoder, while cross-modal alignment losses (like contrastive learning between image patches and gene expression embeddings) could enable joint representation learning in the absence of paired labels. In both scenarios, the structure-preserving and uncertainty-aware design of the framework can serve as inductive biases that enhance learning under weak supervision. These strategies will be essential to support real-world deployment, where multimodal labels are often incomplete or noisy. Demonstrating performance under partial supervision is a key direction for future work.

Another major limitation of the current framework is its static design, which relies on a single time-point imaging-molecular snapshot to predict baseline drug sensitivity. In clinical oncology, however, therapeutic response is often a moving target, shaped by selective pressure, microenvironmental feedback, and clonal adaptation over time. Resistance mechanisms are known to evolve during treatment, leading to progressive loss of drug efficacy and shifts in tumor biology. This limitation is particularly relevant in hematologic malignancies such as multiple myeloma, where resistance can emerge rapidly due to genomic instability and dynamic changes in the bone marrow microenvironment. A static model, while useful for initial stratification, cannot capture temporal changes in tumor architecture or molecular signaling pathways that underpin acquired resistance. To address this challenge, future extensions of the framework may incorporate longitudinal data from serial biopsies, sequential imaging scans, or liquid biopsies. Recurrent or transformer-based architectures could be applied to model temporal dependencies, enabling prediction of future resistance based on prior treatment response trajectories. Dynamic graph representations may also be used to update molecular-image interactions across time points. Self-supervised pretraining using temporal consistency (contrastive learning across time) may enhance the model’s capacity to track resistance evolution. Integrating time-aware uncertainty modeling could further support real-world deployment in adaptive treatment planning. While the current study does not include temporal data, the modular structure of DSINet + PSGO provides a flexible foundation for longitudinal extensions. Exploring dynamic resistance tracking remains an important direction for future research, especially in diseases characterized by high clonal plasticity and rapid evolution under therapeutic pressure.

Imaging-molecular predictive models, such as the proposed DSINet + PSGO framework, offer increasing potential in the landscape of translational oncology. As precision medicine continues to evolve, integrating high-dimensional histopathological imaging with molecular profiles can support multiple facets of clinical decision-making. One key application is patient stratification, where predictive models can help identify subgroups likely to benefit from specific therapies. For example, accurate prediction of molecularly driven drug sensitivity can support early identification of responders vs. non-responders, enabling more personalized and effective treatment allocation. Another important avenue lies in treatment adaptation. In cancers where resistance mechanisms evolve dynamically, predictive models may be used to monitor phenotypic or molecular shifts over time, guiding timely changes in therapeutic strategy. Although the current study focuses on baseline prediction, future extensions with longitudinal integration (as discussed above) could support dynamic monitoring. Probabilistic output from such models—when calibrated appropriately—can be incorporated into risk-benefit frameworks to inform joint decision-making between clinicians and patients. This supports not only technical performance but also clinical trust and interpretability, both of which are crucial for translational adoption. Positioned within this broader context, the proposed framework provides a foundational step toward multimodal decision support systems in oncology, with potential impact on drug development, response tracking, and individualized therapy planning in diverse cancer populations.

6 Conclusions and future work

In this study we address the pressing challenge of accurately predicting drug sensitivity in cancer therapy by integrating molecular and imaging data. We propose a novel framework, the dynamic structure-aware imaging network (DSINet), combined with a (PSGO) strategy. DSINet is designed to dynamically adapt spatial filters based on local molecular content, utilize attention mechanisms to preserve essential biological structures, and fuse information across multiple resolutions while considering uncertainty. The PSGO strategy refines image reconstruction by progressively focusing on regions with high confidence and adaptively restructuring feature graphs to increase robustness against diverse imaging artifacts. Extensive experimental evaluations show that our approach significantly surpasses traditional methods in classifying molecular patterns related to drug sensitivity. The results suggest that our model offers a robust, interpretable, and reliable foundation for advancing personalized cancer therapy, effectively integrating adaptive imaging models with the evolving needs of precision oncology.

However, our approach presents two primary limitations. First, while DSINet and PSGO substantially improve prediction accuracy and robustness, their dependence on high-quality annotated molecular imaging data may restrict applicability in scenarios with limited or noisy labels. Second, although our method showed strong performance in controlled experimental settings, real-world clinical implementation will require further validation to address variations in imaging protocols and patient heterogeneity. In a future work, we plan to enhance data efficiency through semi-supervised or transfer learning techniques and to collaborate with clinical partners for prospective studies to confirm the generalizability and practical impact of our framework in diverse clinical environments.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

Author contributions

CQ: Writing – original draft, Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Acknowledgments

This is a short text to acknowledge the contributions of specific colleagues, institutions, or agencies that aided the efforts of the authors.

Conflict of interest

The author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Chen C-F, Fan Q, and Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification. IEEE Int Conf Comput Vision. (2021). doi: 10.1109/ICCV48922.2021.00041

Crossref Full Text | Google Scholar

2. Maurício J, Domingues I, and Bernardino J. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Appl Sci. (2023). doi: 10.3390/app13095521

Crossref Full Text | Google Scholar

3. Hong D, Han Z, Yao J, Gao L, Zhang B, Plaza A, et al. Spectralformer: Rethinking hyperspectral image classification with transformers. IEEE Trans Geosci RemoteSensing. (2021). doi: 10.1109/TGRS.2021.3130716

Crossref Full Text | Google Scholar

4. Touvron H, Bojanowski P, Caron M, Cord M, El-Nouby A, Grave E, et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Trans Pattern Anal Mach Intell. (2021). Available online at: https://ieeexplore.ieee.org/abstract/document/9888004.

PubMed Abstract | Google Scholar

5. Wang X, Yang S, Zhang J, Wang M, Zhang J, Yang W, et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med Image Anal. (2022). doi: 10.1016/j.media.2022.102559

PubMed Abstract | Crossref Full Text | Google Scholar

6. Tian Y, Wang Y, Krishnan D, Tenenbaum J, and Isola P. Rethinking few-shot image classification: a good embedding is all you need? Eur Conf Comput Vision. (2020). doi: 10.1007/978-3-030-58568-6_16

Crossref Full Text | Google Scholar

7. Yang J, Shi R, Wei D, Liu Z, Zhao L, Ke B, et al. Medmnist v2 - a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Sci Data. (2021). Available online at: https://www.nature.com/articles/s41597-022-01721-8.

PubMed Abstract | Google Scholar

8. Hong D, Gao L, Yao J, Zhang B, Plaza A, and Chanussot J. Graph convolutional networks for hyperspectral image classification. IEEE Trans Geosci Remote Sens. (2020). doi: 10.1109/TGRS.2020.3015157

Crossref Full Text | Google Scholar

9. Sun L, Zhao G, Zheng Y, and Wu Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans Geosci Remote Sens. (2022). doi: 10.1109/TGRS.2022.3144158

Crossref Full Text | Google Scholar

10. Rao Y, Zhao W, Zhu Z, Lu J, and Zhou J. Global filter networks for image classification. Neural Inf Process Syst. (2021). Available online at: https://proceedings.neurips.cc/paper/2021/hash/07e87c2f4fc7f7c96116d8e2a92790f5-Abstract.html.

Google Scholar

11. Mai Z, Li R, Jeong J, Quispe D, Kim HJ, and Sanner S. Online continual learning in image classification: An empirical survey. Neurocomputing. (2021). doi: 10.1016/j.neucom.2021.10.021

Crossref Full Text | Google Scholar

12. Azizi S, Mustafa B, Ryan F, Beaver Z, Freyberg J, Deaton J, et al. Big self-supervised models advance medical image classification. In: IEEE international conference on computer vision (2021). Available online at: https://openaccess.thecvf.com/content/ICCV2021/html/Azizi_Big_Self-Supervised_Models_Advance_Medical_Image_Classification_ICCV_2021_paper.html.

Google Scholar

13. Li B, Li Y, and Eliceiri K. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. Comput Vision Pattern Recognition. (2020). doi: 10.1109/CVPR46437.2021.01409

PubMed Abstract | Crossref Full Text | Google Scholar

14. Bhojanapalli S, Chakrabarti A, Glasner D, Li D, Unterthiner T, and Veit A. Understanding robustness of transformers for image classification. IEEE Int Conf Comput Vision. (2021). doi: 10.1109/ICCV48922.2021.01007

Crossref Full Text | Google Scholar

15. Kim HE, Cosa-Linan A, Santhanam N, Jannesari M, Maros M, and Ganslandt T. Transfer learning for medical image classification: a literature review. BMC Med Imaging. (2022). doi: 10.1186/s12880-022-00793-7

PubMed Abstract | Crossref Full Text | Google Scholar

16. Zhang C, Cai Y, Lin G, and Shen C. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. Comput Vision Pattern Recognition. (2020). doi: 10.1109/CVPR42600.2020

Crossref Full Text | Google Scholar

17. Roy SK, Deria A, Hong D, Rasti B, Plaza A, and Chanussot J. Multimodal fusion transformer for remote sensing image classification. IEEE Trans Geosci Remote Sens. (2022). doi: 10.1109/TGRS.2023.3286826

Crossref Full Text | Google Scholar

18. Zhu Y, Zhuang F, Wang J, Ke G, Chen J, Bian J, et al. Deep subdomain adaptation networkfor image classification. IEEE Trans Neural Networks Learn Syst. (2020). Available online at: https://ieeexplore.ieee.org/abstract/document/9085896.

PubMed Abstract | Google Scholar

19. Chen L, Li S, Bai Q, Yang J, Jiang S, and Miao Y. Review of image classification algorithms based on convolutional neural networks. Remote Sens. (2021). doi: 10.3390/rs13224712

Crossref Full Text | Google Scholar

20. Ashtiani F, Geers AJ, and Aflatouni F. An on-chip photonic deep neural network for image classification. Nature. (2021). doi: 10.1038/s41586-022-04714-0

PubMed Abstract | Crossref Full Text | Google Scholar

21. Masana M, Liu X, Twardowski B, Menta M, Bagdanov AD, and van de Weijer J. Class- incremental learning: Survey and performance evaluation on image classification. IEEE Trans Pattern Anal Mach Intell. (2020). Available online at: https://ieeexplore.ieee.org/abstract/document/9915459.

PubMed Abstract | Google Scholar

22. Mascarenhas S and l Agarwal M. A comparison between vgg16, vgg19 and resnet50 architecture frameworks for image classification. 2021 Int Conf Disruptive Technol Multi-Disciplinary Res Appl (CENTCON). (2021). doi: 10.1109/CENTCON52345.2021.9687944

Crossref Full Text | Google Scholar

23. Sheykhmousa M, Mahdianpari M, Ghanbari H, Mohammadimanesh F, Ghamisi P, and Homayouni S. Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review. IEEE J Selected Topics Appl Earth Observations Remote Sens. (2020). doi: 10.1109/JSTARS.2020.3026724

Crossref Full Text | Google Scholar

24. Zhang Y, Li W, Sun W, Tao R, and Du Q. Single-source domain expansion network for cross-scene hyperspectral image classification. IEEE Trans Image Process. (2022). doi: 10.1109/TIP.2023.3243853

PubMed Abstract | Crossref Full Text | Google Scholar

25. Bansal M, Kumar M, Sachdeva M, and Mittal A. Transfer learning for image classification using vgg19: Caltech-101 image data set. J Ambient Intell Humanized Computing. (2021). doi: 10.1007/s12652-021-03488-z

PubMed Abstract | Crossref Full Text | Google Scholar

26. Dai Y and Gao Y. Transmed: Transformers advance multi-modal medical image classification. Diagnostics. (2021). doi: 10.3390/diagnostics11081384

PubMed Abstract | Crossref Full Text | Google Scholar

27. Taori R, Dave A, Shankar V, Carlini N, Recht B, and Schmidt L. Measuring robustness to natural distribution shifts in image classification. Neural Inf Process Syst. (2020). Available online at: https://proceedings.neurips.cc/paper/2020/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html.

Google Scholar

28. Peng J, Huang Y, SUN W, Chen N, Ning Y, and Du Q. Domain adaptation in remote sensing image classification: A survey. IEEE J Selected Topics Appl Earth Observations Remote Sens. (2022). doi: 10.1109/JSTARS.2022.3220875

Crossref Full Text | Google Scholar

29. Bazi Y, Bashmal L, Rahhal MMA, Dayil RA, and Ajlan NA. Vision transformers for remote sensing image classification. Remote Sens. (2021). doi: 10.3390/rs13030516

Crossref Full Text | Google Scholar

30. Zheng X, Sun H, Lu X, and Xie W. Rotation-invariant attention network for hyperspectral image classification. IEEE Trans Image Process. (2022). doi: 10.1109/TIP.2022.3177322

PubMed Abstract | Crossref Full Text | Google Scholar

31. Brumfield GL, Knoche SM, Doty KR, Larson AC, Poelaert BJ, Coulter DW, et al. Amyloid precursor-like protein 2 expression in macrophages: differentiation and m1/m2 macrophage dynamics. Front Oncol. (2025) 15:1570955. doi: 10.3389/fonc.2025.1570955

PubMed Abstract | Crossref Full Text | Google Scholar

32. Cao W, Ma S, Han L, Xing H, Li Y, Jiang Z, et al. Bispecific antibodies in immunotherapy for acute leukemia: latest updates from the 66th annual meeting of the american society of hematolog. Front Oncol. (2025) 15:1566202. doi: 10.3389/fonc.2025.1566202

PubMed Abstract | Crossref Full Text | Google Scholar

33. Sun L, Zhao Q, and Miao L. Combination therapy with oncolytic viruses for lung cancer treatment. Front Oncol. (2025) 15:1524079. doi: 10.3389/fonc.2025.1524079

PubMed Abstract | Crossref Full Text | Google Scholar

34. Dequidt P, Bourdon P, Tremblais B, Guillevin C, Gianelli B, Boutet C, et al. Exploring radiologic criteria for glioma grade classification on the brats dataset. IRBM. (2021) 42:407–14. doi: 10.1016/j.irbm.2021.04.003

Crossref Full Text | Google Scholar

35. Basheer S, Bhatia S, and Sakri SB. Computational modeling of dementia prediction using deep neural network: analysis on oasis dataset. IEEE Access. (2021) 9:42449–62. doi: 10.1109/ACCESS.2021.3066213

Crossref Full Text | Google Scholar

36. Lalitha S and Murugan D. Segmentation and classification of 3d lung tumor diagnoses using convolutional neural networks. In: 2023 second international conference on augmented intelligence and sustainable systems (ICAISS). IEEE (2023). p. 230–8 Available online at: https://ieeexplore.ieee.org/abstract/document/10250625.

Google Scholar

37. Kandel I and Castelli M. Improving convolutional neural networks performance for image classification using test time augmentation: a case study using mura dataset. Health Inf Sci Syst. (2021) 9:33. doi: 10.1007/s13755-021-00163-7

PubMed Abstract | Crossref Full Text | Google Scholar

38. Dong H, Zhang L, and Zou B. Exploring vision transformers for polarimetric sar image classification. IEEE Trans Geosci Remote Sens. (2022). doi: 10.1109/TGRS.2021.3137383

Crossref Full Text | Google Scholar

39. He X, Chen Y, and Lin Z. Spatial-spectral transformer for hyperspectral image classification. Remote Sens. (2021). doi: 10.3390/rs13030498

Crossref Full Text | Google Scholar

40. Lanchantin J, Wang T, Ordonez V, and Qi Y. General multi-label image classification with transformers. Comput Vision Pattern Recognition. (2020). doi: 10.1109/CVPR46437.2021.01621

Crossref Full Text | Google Scholar

41. Dong Y, Fu Q-A, Yang X, Pang T, Su H, Xiao Z, et al. Benchmarking adversarial robustness on image classification. Comput Vision Pattern Recognition. (2020). doi: 10.1109/CVPR42600.2020

Crossref Full Text | Google Scholar

42. Cai L, Gao J, and Zhao D. A review of the application of deep learning in medical image classification and segmentation. Ann Trans Med. (2020). doi: 10.21037/atm.2020.02.44

PubMed Abstract | Crossref Full Text | Google Scholar

43. Vermeire T, Brughmans D, Goethals S, de Oliveira RMB, and Martens D. Explainable image classification with evidence counterfactual. Pattern Anal Appl. (2022). doi: 10.1007/s10044-021-01055-y

Crossref Full Text | Google Scholar

44. Solimando AG, Malerba E, Leone P, Prete M, Terragna C, Cavo M, et al. Drug resistance in multiple myeloma: soldiers and weapons in the bone marrow niche. Front Oncol. (2022) 12:973836. doi: 10.3389/fonc.2022.973836

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: drug sensitivity prediction, molecular-informed imaging, adaptive imaging model, structure-aware optimization, cancer therapy classification

Citation: Qu C (2026) Molecular-informed image classification for predicting drug sensitivity in cancer therapy. Front. Oncol. 15:1643504. doi: 10.3389/fonc.2025.1643504

Received: 13 June 2025; Accepted: 29 October 2025;
Published: 12 January 2026.

Edited by:

Stefano Trebeschi, The Netherlands Cancer Institute (NKI), Netherlands

Reviewed by:

Shailesh Tripathi, University of Applied Sciences Upper Austria, Austria
Antonio Giovanni Solimando, University of Bari Aldo Moro, Italy

Copyright © 2026 Qu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chunmei Qu, bDVlb2x6bzBAMTYzLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.