Multi-subspace mapping and adaptive learning: MMAL-CL for cross-domain few-shot image identification across scenarios

Du, Qian; Xia, Xingyou; Liu, Qilin; Lv, Yanfei; Li, Lu; Miao, Zhuang

doi:10.3389/fphy.2025.1681254

ORIGINAL RESEARCH article

Front. Phys., 21 October 2025

Sec. Radiation Detectors and Imaging

Volume 13 - 2025 | https://doi.org/10.3389/fphy.2025.1681254

Multi-subspace mapping and adaptive learning: MMAL-CL for cross-domain few-shot image identification across scenarios

Qian Du

Xingyou Xia

Qilin Liu

Yanfei Lv

Lu Li

Zhuang Miao*

Information Research Center of Military Science, Academy of Military Science of the People’s Liberation Army, Beijing, China

Image detection plays a critical role in quality control across manufacturing and healthcare sectors, yet existing methods struggle to meet real-world requirements due to their heavy reliance on large labeled datasets, poor generalization across different domains, and limited adaptability to diverse application scenarios. These limitations significantly hinder the deployment of AI solutions in practical industrial settings where data scarcity and domain variations are common. To address these issues, we propose MMAL-CL, a unified deep learning framework that integrates an Edge Feature Module (EFM) with multi-subspace mapping attention and an Adaptive Deep Learning Module (ADLM) for cross-domain feature decoupling. The EFM extracts translation-invariant features through residual convolution blocks and a novel multi-subspace attention mechanism, enhancing the model’s ability to capture interdependencies between features. The ADLM enables few-shot learning by mixing task-irrelevant auxiliary data with target domain samples and optimizing feature separation via a dual-classifier strategy. Finally, we evaluated the model’s performance on five datasets (two industrial and three medical) demonstrate that MMAL-CL achieves 99.7% precision on the NEU-CLS dataset with full data and maintains 71.3% precision with only 20 samples per class, outperforming other methods in few-shot settings. The framework shows remarkable cross-domain generalization capability, with an average 12.8% improvement in F1-score over existing methods. These results highlight MMAL-CL’s potential as a practical solution for image detection that can operate effectively with limited training data while maintaining high accuracy across diverse application scenarios.

1 Introduction

Image identification and classification have emerged as fundamental technologies supporting modern industrial systems and quality control processes [1, 2]. These techniques enable automated identification and analysis of critical features in manufacturing inspection, medical diagnosis, and other application domains [3]. The integration of machine learning with neural networks has significantly advanced this field, driving progress in diverse areas ranging from precision surface defect identification to complex medical image analysis [4–6]. However, as industrial applications diversify, current deep learning-based solutions reveal some challenges that demand urgent attention. Although these methods have achieved good performance under certain conditions, traditional image identification methods rely on manually designed feature extraction methods, so that image identification is limited by the quality of feature selection and design, resulting in limited identification performance of the model [7–10]. Compared with the time-consuming and labor-intensive traditional methods, deep learning-based methods have gradually become a research hotspot in the field of image identification due to its robust feature recognition capabilities and high accuracy [11–15].

With the outstanding performance of AlexNet in the 2012 ImageNet competition, it successfully opened a new chapter in deep learning in the field of image identification [16–20, 27]. Many studies have demonstrated that AI technology has broad application prospects in medical diagnosis, industrial identification and other fields. In terms of image identification, Yu et al. proposed a RegNet that can identify sewer pipe defects. The model uses dropout to improve the overfitting problem of the model and uses LeakyReLU to further optimize the performance of the model [21]. Joon-Hyung et al. summarized the characteristics of industrial PCB images, analyzed the factors that may cause image data changes in the industrial field, and proposed a convolution-based PCB defect identification method on this basis [22]. Said et al. proposed a ResNet model based on transfer learning, which can use pathological images to diagnose whether a patient has breast cancer [23]. Sasank et al. built a model based on a deep residual network that can analyze the patient’s brain health based on CT images [24]. Gabriele et al. proposed a weakly supervised deep learning framework based on convolutional networks, which can analyze cancer based on whole slide pathological images [25]. In additional, deep learning technology has received widespread attention due to its ability to achieve performance similar to human experts in a short period of time [26, 27]. Jun et al. also proposed a data augmentation and weight allocation method to solve the problem of insufficient samples and uneven sample distribution [28]. Xiao et al. proposed a model for assisting in the diagnosis of COVID-19 using CT images. The model uses contrastive learning to train the encoder to capture the necessary features on large-scale datasets, to reduce the model’s demand for the quantity of raw data [29]. Tae Keun et al. proposed a feasibility study method, which improves the performance of the model in small-scale samples through data amplification [30]. Ling et al. proposed a DLA-MatchNet, which integrates channel attention, spatial attention and feature networks to enable it to handle small amounts of data samples [55].

Despite their demonstrated potential, current deep learning-based identification methods still have limitations. A primary challenge stems from the substantial data dependence of these models. Variations in imaging devices and acquisition protocols across different facilities often lead to significant domain shifts, resulting in poor generalization performance when applying models to new clinical or industrial settings. While data augmentation strategies can partially mitigate this issue, collecting large-scale annotated datasets remains prohibitively expensive in many real-world scenarios. This is particularly evident in medical imaging, where patient privacy concerns restrict data availability, and in industrial inspection systems, where rare defects are inherently difficult to capture. Furthermore, existing approaches are typically designed for specific application domains, with limited transferability across different identification tasks. This domain specificity necessitates the development of specialized models for each application scenario, significantly increasing deployment costs and complexity. Although transfer learning techniques have shown promise in reducing data requirements, they still rely heavily on the availability of relevant source domain data. In practice, the requirement for semantically similar pre-training data often cannot be satisfied, particularly for novel or rare defect types. In order to alleviate the above problems, transfer learning was proposed. Although these studies have achieved many breakthroughs, they still need to be further optimized. Transfer learning typically uses similar data to pre-train the model for a target task. However, in reality, similar data is often difficult to obtain. More and more visual tasks are showing a need to reduce the number of training samples [31–34]. Architectural limitations of conventional CNNs present additional challenges. The local receptive fields of convolutional operations constrain the model’s ability to capture long-range feature dependencies, while simply increasing network depth leads to prohibitive computational costs. Recent studies have shown that attention mechanisms can improve feature learning, but most implementations fail to maintain an optimal balance between performance and computational efficiency, especially in resource-constrained industrial environments.

To solve the above problems, this paper proposes a novel deep learning-based image identification method and with the following contributions. First, we develop a unified deep learning architecture that integrates multi-scale feature extraction with attention mechanisms, enabling high-precision identification while maintaining computational efficiency. This design effectively balances model performance and resource requirements. Second, the proposed model significantly reduces data requirements through an innovative few-shot learning paradigm. Third, our solution introduces a scenario-adaptive mechanism that automatically adjusts feature representation according to different application domains. This innovation enables seamless deployment across diverse industrial and medical scenarios without architectural modifications. Fourth, we introduce a cross-domain learning mechanism that utilizes task-irrelevant auxiliary data. This component enhances model generalization while optimizing storage and computation costs. Finally, we propose a multi-subspace attention module that overcomes the limitations of conventional CNNs. The principle of the framework proposed is illustrated in Figure 1.

Figure 1

Diagram illustrating an adaptive deep learning module used in factory and hospital scenarios. In the factory scenario, edge devices collect data, leading to the Edge Feature Module, which sends primary features to the adaptive module. This module processes and updates the model, sending detect results back to the factory. In the hospital scenario, a similar process occurs, with diagnosis results sent back to the hospital. The adaptive module uses cloud and network representations.

Figure 1. Workflow of proposed framework.

2 Methods

This section presents the technical details of MMAL-CL, a unified framework designed for multi-scenario image identification in both medical and industrial scenarios. As illustrated in Figure 1. MMAL-CL mainly includes Edge Feature Module (EFM) and Adaptive Deep Learning Module (ADLM). Firstly, the edge devices in each scenario will collect the required data, which is then be feed into the EFM to obtain edge primary features. And then the edge features will be sent to the ADLM for scenario allocation and feature identification. The aforementioned scenario allocation means that the ADLM will send information to the corresponding neural network based on the data source. After obtaining the feature identification results, the ADLM will return the results to the doctor or engineer and update the network parameters of the EFM according to the needs. Next, we will describe the model construction method in detail.

2.1 Edge feature module

Edge Feature Module is responsible for automatically extracting primary features of scenario data at each edge devices, including pathological images collected from hospitals and steel surface images collected from factories. In order to make the model easier to converge, we will use an auxiliary dataset to initialize its parameters, as shown in Figure 2.

Figure 2

Diagram displaying a neural network architecture with two main modules: the Edge Feature Module and the Classification Module. The Edge Feature Module includes components labeled Conv-2D, BN, MMB, and NL. The Classification Module includes elements labeled dense, FC, and softmax, leading to pre-train classification. Arrows indicate data flow between layers.

Figure 2. Workflow of proposed framework. EFM parameter initialization schematic diagram (Conv-2D means 2D convolution layer, BN means batch normalization, MMB means multi-subspace mapping block, NL means normalization, FC means fully connection layer. The same color represents the same layer).

We use Mini-ImageNet as initialization auxiliary samples. It is worth noting that the samples used in pre-training are target-independent data, which is consistent with the original intention mentioned above. In the initial parameter initialization stage, task-independent auxiliary data will be sent to the EFM to obtain potential features, and then the above features will be used by the classification module (CM) to obtain the classification results. When the expected classification performance is obtained in the auxiliary dataset, the pre-training phase is completed, and the EFM will be taken out and embedded into the edge server and the cloud.

Specifically, EFM mainly consists of convolution block and multi-subspace mapping block (MMB). The convolution block is designed as a residual structure, which consists of convolution layers and batch normalization. The residual structure helps improve the ability of model to learn different scales features and improve the fluidity of gradients. After being processed by the residual convolution block, the information will be fed into the convolution layer to further refine the required features. Afterwards, the feature map will be sent to MMB, and the principle is shown in Figure 3.

Figure 3

Diagram of a neural network architecture showing feature maps flowing into feature mapping and feature channel components. It includes multi-subspace mapping with Q, K, V, LN, and R-LN layers. Processes involve concatenation, multiplication, softmax, dense layers, and dropout, organized within distinct sections demarcated by dashed lines.

Figure 3. Multi-subspace mapping block schematic Workflow of proposed framework.

To make the model can learn feature dependencies, this paper designs a data analysis structure based on self-attention, inspired by the transformer. After obtaining the feature maps from the above convolution layer, the model will linearly map these feature maps. This step allows the model to adjust input information more flexibly according to target requirements, allowing the model to converge more robustly. In addition, to a certain extent, this step can also be regarded as a process of appropriately mixing noise, which can improve the robustness of the model. After the feature mapping process, the feature maps will be converted into feature vectors. At this point, we embed the feature channel encoding into the corresponding feature vector. Feature encoding represents the identity information of different feature channels, which helps the model learn the interdependence between different features. The embedded feature vector will be sent to the LN layer. LN refers to layer normalization. The principle is shown in Equation 1.

LN (e f v) = G e L u (\frac{β_{1}}{\sqrt{σ^{2} + θ}} ⊙ (e f v - \bar{e f v}) + β_{2}) (1)

Among them, efv represents the above-mentioned embedded feature vector, GeLu (*) represents the activation function GeLu, $β_{1}$ and $β_{2}$ are learnable parameters, $σ$ is the variance, $θ$ is the minimum value to ensure that the fraction is meaningful, and $\bar{e f v}$ is the average of efv. As shown in Equation 1, layer normalization is not affected by the data batch size. Compared to the classical batch normalization whose performance is limited by the batch size, LN can provide more stable statistics for the model, thus making its convergence process more robust. After this, the corresponding vector will be fed into multi-subspace mapping layer. $Q, K, V$ are three input channels respectively, each with independent mapping parameters, as shown in Equation 2.

Q^{'} = θ_{Q} Q K^{'} = θ_{K} K V^{'} = θ_{V} V (2)

Here, $θ_{(*)}$ is the mapping parameter of the respective channel, and $Q^{'}, K^{'}, V^{'}$ are the mapped channel vectors respectively. Independent mapping parameters can not only improve the flexibility of the model, but can also be regarded as the mapping of the same input in different feature spaces to a certain extent, which helps to improve the robustness of the model. After mapping, $Q^{'}$ and $K^{'}$ will be dot producted, as shown in Equation 3.

M u l (Q^{'}, K^{'}) = Q^{'} {K^{'}}^{T} (3)

In order to suppress the variance and avoid the difficulty of model learning caused by the vanishing gradient, we further perform a scaling operation based on Equation 3, which corresponds to S (*) in Figure 3. The principle is shown in Equation 4.

S (M u l (Q^{'}, K^{'})) = \frac{M u l (Q^{'}, K^{'})}{\sqrt{O_{Q}}} (4)

Here, $O_{Q}$ is the dimension of the input vector Q. Combining Equations 3, 4, the role of the dot product operation is to learn the interdependence between features. After processed by S (*), we can use the softmax function to obtain the weights assigned to different features by the model. Based on the above weights, the output of a single subspace can be obtained by weighting the feature expression, as shown in Equation 5.

s p a c e (Q^{'}, K^{'}, V^{'}) = s o f t m a x (S (M u l (Q^{'}, K^{'}))) V^{'} (5)

Where, space (*) represents the output of a single subspace. By splicing the outputs of multiple subspaces and assigning corresponding weights, the final output of the multi-subspace mapping layer can be obtained, as shown in Equation 6.

M M L (Q^{'}, K^{'}, V^{'}) = f ({s p a c e}_{1}, {s p a c e}_{2}, \dots, {s p a c e}_{n}) θ_{m m l} (6)

In the equation, $M M L (*)$ is the output of multi-subspace mapping layer, f (*) represents the splicing operation, spacen represents the output of the nth subspace, $θ_{m m l}$ is the corresponding parameter. Because it combines the output of multiple feature spaces, it is called a multi-subspace mapping layer.

As mentioned above, in order to improve the model’s ability to grasp features of different scales, this paper also designs residual links. As shown in Figure 3, after passing through the multi-subspace mapping layer, the data will also be processed by the residual normalization layer, as shown in Equation 7.

R_LN (x) = LN (x + M M L (x)) (7)

Here, $R_LN (*)$ represents the R-LN layer output, and x represents the channel-embedded feature vector. As shown in Figure 3, after being processed by the dense layer and dropout layer, the final output of MMB can be obtained.

After two consecutive MMB processes, the feature vector will be normalized and then sent to the CM. As shown in Figure 2, CM is a classic classification network. In the pre-training stage, the role of CM is to classify the auxiliary data. By training a classification model, it determines a good initial weight for EFM. The loss function of CM is cross-entropy. Compared with the mean square error, cross-entropy is not affected by the gradient of the activation function, which can avoid the problem of vanishing gradient to a certain extent and make the model easier to converge.

2.2 Adaptive deep learning module

Adaptive Deep Learning Module is responsible for using task-independent data to assist the model in learning the target task. This can reduce the data scale for achieving the expected performance of the model in this paper, save storage space and deployment costs, and improve the model’s generalization ability for multi-center or multi-scenario data, thereby accelerating the further extension of AI-based image identification applications. It is worth noting that ADLM uses task-independent data to assist model learning instead of similar data, which reduces the model’s dependence on original data to a certain extent. The principle of ADLM is shown in Figure 4.

Figure 4

Flowchart illustrating a data analysis framework. The Task Independent Data is processed through an Edge Feature Module, followed by Dense and BN layers, generating Sample and Source features. These features are classified into Sample Type and Data Source by Sample and Source Classifiers. Transition Data is mixed with Task Data, and further analysis branches continue. The image includes a photo of a goose and a microscopic image as data examples.

Figure 4. Adaptive deep learning module principal diagram.

We hope that the model can use task-independent data to assist itself in learning target data, but there are huge domain differences between data in different fields, which will make model learning difficult. To alleviate this problem, we mixed task independence data and task data to form transition data to alleviate the problem of difficulty in understanding domain differences. The mixing principle is shown in Equation 8.

T R D ({T D}_{(m)}, {T I D}_{(n)}) = φ \cdot {T D}_{(m)} + (1 - φ) \cdot {T I D}_{(n)} (8)

Where, TRD is transition data, TD(m) represents the random mth type sample in task data, TID(n) represents the nth type of sample in task independent data, $φ$ is the mixing parameter. $φ \sim b e t a (ε, ε)$ , here, $ε \subseteq (0, \infty)$ . The transition samples only involve mutual mixing between single-class samples, and do not involve cross-mixing of multi-class samples.

As shown in Figure 4, the three types of samples are sent to three branches respectively. Among them, the structures of transition data analysis branch and task data analysis branch are the same as task independent data analysis branch. After obtaining relevant information, the sample features are sent to the sample classifier. The function of this classifier is to identify the type of target data, that is, to output the disease type or part surface defect type. The sample features and source features are fed into the source classifier. The function of the classifier is to identify the source of the data, that is, whether the data comes from the target data or auxiliary data. The purpose of designing two classifiers is to enable the model to automatically decouple sample features and source features. In other words, this design approach allows the model to separate features that are similar to the target task (sample features) and features that are not highly correlated to the target task (source features) from the task independent data. This is because sample features can determine the sample type, so it can be considered relevant to the target task. On the contrary, source features can determine the source of the sample, so it contains huge domain bias and is quite different from the target task. When processing transition data and task data, the model classifies them separately based on the categories of their respective datasets, obtaining classification losses $L_{T D}$ and $L_{T I D}$ . Because it is a classic classification model, the above two losses are cross-entropy. As stated in Equation 8, the transition data is a mixture of the above two types of data, so the loss function of the sample classifier can be expressed as Equation 9. Where, $L_{S C}$ is the loss function of the sample classifier.

L_{S C} = φ \cdot L_{T D} + (1 - φ) \cdot L_{T I D} (9)

Source classifier is responsible for identifying the source of the data. Based on the above, we hope that the sample classifier can separate potential features that are closely related to the target task in task-irrelevant data, while the source classifier is intended to distinguish domain-specific differences. To achieve the above functions, we designed a special loss function for the source classifier. We want the sample feature to be domain-independent, so the goal of the sample feature is to confuse the source classifier. When the classifier uses sample feature, the labels of all corresponding samples are 0.5, and the loss function is shown in Equation 10.

\begin{array}{c} L_{S O U C_{T D}} = \frac{1}{3 N} \sum_{x}^{} [K L (S O U C ({\hat{y}}_{T I D}), l a b e l) + K L (S O U C ({\hat{y}}_{T R D}), l a b e l) \\ + K L (S O U C ({\hat{y}}_{T D}), l a b e l)] \end{array} (10)

Here, KL (*) is KL divergence, N is sample size, SOUC(*) is the output of the source classifier, and label refers to the aforementioned label, ${\hat{y}}_{T I D}, {\hat{y}}_{T R D}, {\hat{y}}_{T D}$ are the outputs of task independent data analysis branch, transition data analysis branch, and task data analysis branch respectively. When the classifier processes source features, it is expected to be able to distinguish data sources, and the loss function is shown in Equation 11.

\begin{array}{c} L_{S O U C_{T I D}} = \frac{1}{3} [C E (S O U C ({\hat{y}}_{T I D}), l_{T I D}) + C E (S O U C ({\hat{y}}_{T D}), l_{T D}) + φ \cdot \\ C E (S O U C ({\hat{y}}_{T R D}), l_{T D}) + (1 - φ) \cdot C E (S O U C ({\hat{y}}_{T R D}), l_{T I D}) \end{array} (11)

Here, CE (*) is cross-entropy, $l_{T I D}$ and $l_{T D}$ are data source labels, corresponding to task independent data and task data respectively. Finally, the overall loss function of the model can be written as follows in Equation 12.

L = L_{S C} + L_{S O U C_T D} + L_{S O U C_T I D} (12)

Finally, ADLM will return recognition results to terminal devices located in different scenarios and return model parameters to EFM. Through the above method, the model proposed can adapt to different data analysis needs in various scenarios, including hospitals and factories, by using the same auxiliary data. At the same time, this method also reduces the amount of data required for the model, saves deployment costs, improves the model’s generalization ability for multi-center or multi-scenario data, and accelerates the further extension of AI-based image identification applications.

3 Experiment and discussion

In this section, we primarily describe the experimental data and evaluate the performance of our proposed model using precision, recall, and F1-score as the evaluation metrics. Additionally, we analyze and discuss the experimental results. The proposed approach is implemented by Pytorch on a workstation with NVIDIA GeForce GTX2080Ti.

3.1 Data description

In this paper, we use five datasets from different scenarios, including two factory datasets (NEU-CLS and SEVERSTAL) [35, 36] and three hospital datasets [37–39]. In addition, in order to explore the identification ability of our model on small-scale samples, we extracted 5, 10, and 20 samples per class from all datasets for training. These correspond to 5-data, 10-data and 20-data respectively. Specifically, 5-data refers to randomly selecting five samples from each type of data, and the remaining data is used as the test set. The corresponding extraction methods for 10 and 20 are the same. Mini-ImageNet [40] is a universal pre-training data. The experimental result is the average of five repeated experiments.

3.2 Results and analysis

We demonstrate the performance of our model and conduct many comparative experiments. For the factory scenario, the model performance is shown in Table 1 and Figure 5. The confusion matrix is shown in Supplementary Appendix Figures A1–A6. Here, task a is NEU-CLS and task b is SEVERSTAL. The methods of comparison include AlexNet [41], VGG19 [42], RESNET50 [43], INCEPTION [44], DPN92 [45], Transformer [46].

Table 1

Table 1. The performance of each model in the factory scenario for different sample sizes in tasks a and b, where task a and task b represent NEU-CLS and SEVERSTAL respectively, OURS represents MMAL-CL.

Figure 5

Bar charts comparing F1-score, precision, and recall for different models on Tasks a and b. Models include ALexNet, VGG19, ResNet50, Inception, DPN92, Transformer, and Ours, shown at 5-shot, 10-shot, and 20-shot intervals. Task a is at the top, Task b at the bottom.

Figure 5. Model’s performance in the factory scenario.

From the experimental results, the framework of this paper achieves better performance. Even though all comparison methods also use relevant data for pre-training, they still do not exceed the proposed framework. We speculate that the reason for this phenomenon is that the feature decoupling structure designed in this paper allows the model to mine potentially valuable information in task-irrelevant data to assist its own learning. The performance of the model based on ResNet is higher than the classic convolutional network VGG. This situation illustrates the necessity of designing the residual structure in this paper. The residual structure helps improve the model’s ability to learn features of different scales. The performance of Transformer is also lower than ResNet, which also proves the necessity of the feature extraction module designed in this paper. It can effectively alleviate the difficulty of Transformer in learning translation invariance. In addition, the performance of ResNet does not exceed the proposed model. The reason for this may be that the MMB can effectively alleviate the difficulty of CNN in learning feature dependencies. The existence of a single structure can be further enhanced [47–49].

For the hospital scenario, we used data from three centers, including five tasks, namely, task a (assessing benign or colon cancer with colon tissue), task b (assessing the lung squamous cell carcinoma, lung adenocarcinoma or benign with lung tissue), task c (assessing the tumor, inflammation or benign with colon tissue), task d (assessing lobular carcinoma, mucinous carcinoma and papillary carcinoma) and task e (assessing adenosis, fibroadenoma, phyllodes tumor, tubular adenoma). The model performance is shown in Table 2 and Figure 6. The confusion matrices are shown in Supplementary Appendix Figures A7–A24. The methods of comparison include PROTO [50], ANIL [51], MAML [52], DN4 [53].

Table 2

Table 2. The performance of each model in the hospital scenario for different numbers of sample size in tasks a, b, c, d and e, where ours represents MMAL-CL.

Figure 6

A series of nine heatmap graphs showing evaluation metrics (F1 score, precision, recall) for different models (OURS, w/o MMDB, PROTO-MAML, etc.) across three different tasks (a, b69, c57) and three settings (5-shot, 10-shot, 20-shot). Color gradients range from yellow to purple, indicating varying value intensities.

Figure 6. Model’s performance of task a, b, c, d and e in the hospital scenario.

It can be seen from the experimental results that compared with other methods, the proposed method has advantages in various evaluation indicators. We speculate that this is because models based only on convolution are difficult to mine useful information from task-unrelated data to assist their own learning. Therefore, when faced with lightweight data, the models have difficulty in demonstrating good generalization ability. At the same time, the performance of w. o. MMB does not exceed the proposed model. Here, w. o. MMB refers to replacing MMB with convolutional layers, using only the feature decoupling strategy proposed in this paper. This experimental phenomenon once again proves that the MMB designed can effectively alleviate the problem that classic convolutional networks are difficult to learn feature dependencies. In addition, compared with other models, this method does not suffer from task differences in performance. This also confirms the above viewpoint that the multi-subspace mapping method designed in this paper enables the model to master different expression forms of feature.

Representation, thereby enhancing the robustness of the model. Meanwhile, it can be noted that the performance of the model in tasks d and e is limited. We speculate that there may be the following reasons. First, Task e involves four pathological categories and Task d includes three carcinoma subtypes, whereas Task a (2 categories) and Task b (3 less overlapping categories) have fewer or more distinguishable classes. More categories increase inter-class feature overlap, making it harder for the model to learn discriminative representations. Second, potential subtle class imbalance and finer-grained pathological differences further challenge feature decoupling. Thirdly, The difficulty of obtaining high-quality medical samples is usually high, so image quality is also one of the reasons for this phenomenon. How to further improve the performance of the model is our main research work in the future.

The performance of different models in cross scenario scenarios is also compared. In order to express the results concisely, the F1 score had been chosen for comparison, which can objectively represent the comprehensive performance of the model. Meanwhile, we conducted a comparison of model stability by randomly selecting samples and repeating the experiment five times, the results as shown in Table 3. The methods of comparison include DLA-MatchNet [54], LFT [55].

Table 3

Table 3. Cross scenario performance validation, where F-Task a and F-Task b represent Task a and b in industrial scenarios,H-Task a and H-Task b represent Task a and b in hospital scenarios.

The core challenge of cross-domain few-shot detection is to maintain stable performance when scenarios switch. MMAL-CL achieves this goal through the unified framework of “EFM + ADLM”. As shown in the experimental results, the LFT model outperforms traditional models but lags behind DLA-MatchNet and MMAL-CL. This confirms that feature transformation can alleviate domain shift; however, the multi-subspace attention (EFM) and adaptive feature disentanglement (ADLM) proposed in this paper can achieve more accurate cross-domain feature alignment. Meanwhile, this experimental phenomenon covers both industrial and medical scenarios. Such consistency verifies that the multi-subspace attention (EFM) (which captures cross-scale feature dependencies) and adaptive feature disentanglement (ADLM) (which separates domain-invariant features) of MMAL-CL can effectively address scenario-specific challenges without modifying the model architecture. Second, even under the 5-shot setting where data is most scarce, the method proposed in this paper still achieves acceptable performance. This indicates that the method can mine task-related information from task-irrelevant auxiliary data via ADLM, reduce reliance on target domain samples, and thus exhibits low sensitivity to the number of samples. Additionally, MMAL-CL maintains a low standard deviation in both industrial and medical scenarios, which proves that the proposed method possesses a certain degree of robust adaptability to different data distributions.

Combined with the experimental results of the above two scenarios, our framework shows better performance in multiple application scenarios and multiple data centers. This model is applicable to multiple scenarios, including hospitals and factories. With only a small amount of data, the model can achieve acceptable performance. The model achieves different performances in different tasks, we speculate that this is due to variations in data quality collected by different edge devices. At the same time, the model proposed uses the same auxiliary data for different application scenarios, without having to re-find different auxiliary data according to task requirements, which further increases the potential of the framework to be applied.

4 Conclusion

This paper proposes MMAL-CL, a novel unified deep learning model for cross-domain image identification that addresses challenges in current systems. First, our cross-domain learning mechanism establishes a principled approach for decoupling domain-invariant and domain-specific features through a dual-pathway design. This enables effective knowledge transfer from task-irrelevant auxiliary data while preventing negative transfer, as demonstrated by consistent performance across medical and industrial testbeds.

Second, the unified feature representation framework achieves significant improvements in both data efficiency and deployment flexibility. The system maintains robust performance with limited training samples through its hybrid attention-convolution feature extractor and adaptive scenario allocation module. In addition, when faced with multiple data centers, the model can still provide auxiliary analysis results stably and efficiently without the need to retrain after collecting a large amount of data. Meanwhile, the proposed method can use the same task-independent dataset to assist the model in learning target tasks in different scenarios, further paving the way for the development of the AI-based devices. From the experimental results, in multiple datasets and application scenarios (including hospitals and factories), our method has achieved better performance. This proves that the method proposed in this paper has better robustness to a certain extent. The next step is to further improve the model’s ability to mine data features and explore its auxiliary capabilities for more application scenarios.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Ethics statement

Ethical approval was not required for the studies involving humans because all biological data used are from publicly available datasets. The studies were conducted in accordance with the local legislation and institutional requirements. The human samples used in this study were acquired from all biological data used are from publicly available datasets. Written informed consent to participate in this study was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and the institutional requirements.

Author contributions

QD: Conceptualization, Validation, Visualization, Writing – original draft, Writing – review and editing. XX: Conceptualization, Validation, Writing – review and editing. QL: Conceptualization, Validation, Writing – review and editing. YL: Conceptualization, Funding acquisition, Validation, Visualization, Writing – review and editing. LL: Conceptualization, Validation, Writing – review and editing. ZM: Conceptualization, Project administration, Validation, Writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This paper was supported by National Key Research and Development Program of China (2024YFB3312904).

Acknowledgments

The authors thank the reviewers and editors for their comments and suggestions.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphy.2025.1681254/full#supplementary-material

Abbreviations

DL, Deep Learning; CNN, Convolutional Neural Networks; BN, Batch Normalization; CM, Classification Module; ADLM, Adaptive Deep Learning Module; EFM, Edge Feature Module.

References

1. Bajić F, Job J. Review of chart image detection and classification. Int J Document Anal Recognition (Ijdar) (2023) 26(4):453–74. doi:10.1007/s10032-022-00424-5