<?xml version="1.0" encoding="utf-8"?>
    <rss version="2.0">
      <channel xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <title>Frontiers in Computer Science | Computer Vision section | New and Recent Articles</title>
        <link>https://www.frontiersin.org/journals/computer-science/sections/computer-vision</link>
        <description>RSS Feed for Computer Vision section in the Frontiers in Computer Science journal | New and Recent Articles</description>
        <language>en-us</language>
        <generator>Frontiers Feed Generator,version:1</generator>
        <pubDate>2026-04-27T22:15:22.716+00:00</pubDate>
        <ttl>60</ttl>
        <item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2026.1824259</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2026.1824259</link>
        <title><![CDATA[Correction: An improved contrastive learning loss function for automated clock-drawing test grading with implications for cognitive impairment screening]]></title>
        <pubdate>2026-03-31T00:00:00Z</pubdate>
        <category>Correction</category>
        <author>Ning Liu</author><author>Qian Sun</author><author>Xiaoyin Xu</author><author>Haifeng Mou</author><author>Xinhai Liao</author><author>Bokai Rong</author><author>Lingxing Wang</author>
        <description></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2026.1753764</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2026.1753764</link>
        <title><![CDATA[Single-item training for multi-dish recognition: a class-agnostic framework for Indian food platters]]></title>
        <pubdate>2026-03-10T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Keerthi Garisa</author><author>Ravi Kant Kumar</author><author>Priyanka Singh</author>
        <description><![CDATA[Accurate dietary assessment is increasingly dependent on automated food recognition systems capable of operating effectively in real-world environments. While most vision-based models perform well on single-item datasets, their performance degrades significantly in complex multi-dish settings. This scenario is particularly evident in Indian thalis, which contain overlapping food items with diverse textures and high visual variability. These challenges make large-scale multi-dish annotation expensive and limit practical deployment of such systems. To address this gap, we propose a novel two-stage framework that enables recognition of multi-dish food images using only single-item training data. The proposed pipeline incorporates class-agnostic segmentation using the Segment Anything Model (SAM), followed by classification with an SE-DenseNet121 network optimized via Optuna-based hyperparameter tuning.The model is trained exclusively on single-item annotated images and generalizes to multi-item thali images at inference time through a segmentation-classification mapping strategy. This zero-shot segmentation approach eliminates the need for multi-dish ground-truth annotations. As a result, the annotation complexity is reduced from O(N × M) to O(N). The proposed system achieves accuracy of 97.48% on single-item food image classification and demonstrates strong applicability to multi-dish Indian thali images through region-wise inference on segmented food items. Furthermore, the framework is computationally efficient, achieving 2 × faster inference with a latency of 1.58 ms while using only 70% of the parameters required by transformer-based baselines. It operates with low computational cost (2.90 GFLOPs), significantly fewer parameters (8.06M compared to 26.69–86.77M), and delivers higher throughput (633.32 samples/s). These results demonstrate that the proposed method provides a scalable and practical solution for real-time dietary assessment applications.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2026.1690044</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2026.1690044</link>
        <title><![CDATA[An improved contrastive learning loss function for automated clock-drawing test grading with implications for cognitive impairment screening]]></title>
        <pubdate>2026-02-20T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Ning Liu</author><author>Qian Sun</author><author>Xiaoyin Xu</author><author>Haifeng Mou</author><author>Xinhai Liao</author><author>Bokai Rong</author><author>Lingxing Wang</author>
        <description><![CDATA[Contrastive learning has been attracting much interest in recent years for its ability to train without labeled data. An important factor in its success is the loss function, which guides the search for prominent features that separate the positive and negative classes. The triplet loss function is widely used in contrastive learning, in which the objective is to attract a pair of positive instances while pushing away a negative instance from the anchor instance, where one of the positive instances is often an augmented version of the anchor. To improve the performance of contrastive learning in automated Clock-Drawing Test (CDT) grading, this paper proposes a more comprehensive triplet loss function that aims to not only keep the distance between the anchor and a positive instance small and the distance between the anchor and a negative instance large, but also keep the distance between the positive and negative instances large. Experimental results show that the improved loss function significantly improves the model’s accuracy, precision, recall, and F1-score by 3–5% on both CIFAR-10 and CDT datasets, providing a new method for improving the accuracy of automatic CDT scoring and early detection of cognitive impairments.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2026.1763780</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2026.1763780</link>
        <title><![CDATA[PrecisionMicro-DETR: enhancing small pulmonary nodule detection in CT scans with multi-scale feature fusion and lightweight design]]></title>
        <pubdate>2026-02-11T00:00:00Z</pubdate>
        <category>Methods</category>
        <author>Jianle Chen</author><author>Jianyu Zhu</author><author>YuYan Lin</author><author>Fuqin Deng</author><author>Lanhui Fu</author><author>Huilian Liao</author>
        <description><![CDATA[To address the common issue of insufficient accuracy in existing detection models when dealing with morphologically complex and minute pulmonary nodules, this study proposes an enhanced detection model called PrecisionMicro-DETR based on the RT-DETR architecture. The model introduces a feature enhancement fusion module tailored for small targets in the detection head to strengthen the feature extraction capability for subtle structures (Strengthen the integration of small target features, SSTF). It also incorporates a Modulation Fusion Module (MFM) to effectively improve discriminative performance in areas with blurred boundaries between lesions and normal tissues. Additionally, a lightweight neck network based on SNI-GSConvE is introduced to optimize computational load while maintaining high accuracy. Experimental evaluation shows that PrecisionMicro-DETR achieves a mean average precision (mAP) of 94.9% on the publicly available Tianchi dataset. Its robustness and generalization ability in real diagnostic environments are further validated through clinical CT images from hospital PACS systems. This study provides a high-precision and efficient solution for CT pulmonary nodule detection, contributing positively to advancing the clinical application of intelligent assisted diagnostic systems.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2026.1721892</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2026.1721892</link>
        <title><![CDATA[Adaptive self-attention for enhanced segmentation of adult gliomas in multi-modal MRI]]></title>
        <pubdate>2026-02-05T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Evan P. Savaria</author><author>Jiangwen Sun</author>
        <description><![CDATA[Every year there are an estimated 80,000–90,000 new glioma cases, highlighting the need for reliable imaging-based decision support. Although deep learning has improved tumor sub-region segmentation, many state-of-the-art models fail to fully capture complementary information across T1, T1Gd, T2, and FLAIR MRI modalities and often operate as “black boxes,” limiting physician trust when precise delineation is critical for surgical planning, radiation targeting, and treatment monitoring. To address these limitations, we propose AIMS, an Adaptive Integrated Multi-Modal Segmentation framework that maintains modality-specific feature streams and employs adaptive self-attention within a hierarchical CNN-Transformer architecture to prioritize and fuse multi-modal MRI features. We evaluated AIMS on the BraTS 2019 adult glioma dataset using five-fold cross-validation and compared it against strong hybrid baselines with paired statistical testing; generalization was assessed on an independent BraTS 2021 cohort without fine-tuning. AIMS achieved high ensemble Dice Similarity Coefficients of 0.936 for enhancing tumor, 0.942 for tumor core, and 0.931 for whole tumor on BraTS 2019, with statistically significant improvements over competing methods, and maintained strong performance on BraTS 2021 despite protocol and scanner variability. Finally, Grad-CAM-based explanations applied to adaptive attention and fusion layers, together with quantitative sanity checks, provided modality-aware and spatially meaningful visualizations that support clinical interpretation. By improving both segmentation accuracy and model transparency relative to strong baselines, AIMS advances multi-modal glioma segmentation and helps bridge human–machine teaming by enabling faster, clinician-aligned tumor delineation without sacrificing reliability.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1639421</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1639421</link>
        <title><![CDATA[Using domain adaptation and transfer learning techniques to enhance performance across multiple datasets in COVID-19 detection]]></title>
        <pubdate>2026-01-22T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Vengai Musanga</author><author>Colin Chibaya</author><author>Serestina Viriri</author>
        <description><![CDATA[This study presents a hybrid neuro-symbolic framework for COVID-19 detection in chest CT that combines multiple deep learning architectures with rule-based reasoning and domain-adversarial adaptation. By aligning features across four heterogeneous public datasets, the system maintains high, site-independent performance (average accuracy = 97.7%, AUC-ROC = 0.996) without retraining. Symbolic rules and Grad-CAM visualizations provide clinician-level interpretability, achieving near-perfect agreement with board-certified radiologists (κ = 0.89). Real-time inference (23.4 FPS) and low cloud latency (1.7 s) meet hospital PACS throughput requirements. Additionally, the framework predicts key treatment outcomes, such as intensive care unit (ICU) admission risk and steroid responsiveness, using retrospective EHR data. Together, these results demonstrate a scalable, explainable solution that addresses cross-institutional generalization and clinical acceptance challenges in AI-driven COVID-19 diagnosis.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1692523</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1692523</link>
        <title><![CDATA[DeepGeoFusion: personalized facial beauty prediction through geometric-visual fusion]]></title>
        <pubdate>2026-01-15T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Kunwei Wang</author><author>Yanzhi Li</author><author>Dong Huang</author><author>Junmei Feng</author><author>Xiaoyi Feng</author>
        <description><![CDATA[IntroductionPersonalized facial beauty prediction is a critical advancement beyond population-level models with transformative applications in aesthetic surgery planning and user-centric recommendation systems, while contemporary methods face limitations in modeling aesthetically sensitive facial regions, fusing heterogeneous geometric and visual features, and reducing extensive annotation dependency for personalization.MethodsWe propose DeepGeoFusion, a novel framework that synergizes Vision Mamba-extracted global visual features with anatomically constrained facial graphs (constructed from 86 landmarks via Delaunay triangulation), using the Graph Node Attention Projection Fusion (GNAPF) block for cross-modal alignment and a lightweight adaptation mechanism to generate personalized preference vectors from 10 seed images via confidence-gated optimization.ResultsExtensive experiments on SCUT-FBP5500 demonstrate statistically significant improvements in personalized prediction accuracy and robust performance across genders and ethnicities compared to state-of-the-art methods.DiscussionDeepGeoFusion effectively addresses key limitations of existing methods by integrating complementary geometric and visual features, enabling efficient personalization with minimal annotation and highlighting practical value for aesthetic-related applications requiring personalized assessments.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1714394</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1714394</link>
        <title><![CDATA[Toward real-time emotion recognition in fog computing-based systems: leveraging interpretable PCA_CNN, YOLO with self-attention mechanism]]></title>
        <pubdate>2026-01-14T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Nora EL Rashidy</author><author>Eman Allogmani</author><author>Esraa Hassan</author><author>Khaled Alnowaiser</author><author>Hela Elmannai</author><author>Zainab H. Ali</author>
        <description><![CDATA[Emotion estimation from face expression analysis has been extensively examined in computer science. In contrast, classifying expressions depends on appropriate facial features and their dynamics. Despite the promising accuracy results in handled and favorable conditions, processing faces acquired at a distance, entailing low-quality images, still needs an influential performance reduction. The primary objective of this study is to introduce a Real-Time Emotion Recognition system-based Fog Technique, which was developed to track and observe human emotional states in real time. This paper provides a comprehensive integration of PCA-based feature selection with a specific version of YOLO (YOLOv8), in addition to spatial attention for real-time recognition. The developed system demonstrates superiority in edge deployment capabilities compared to existing approaches. The proposed model is compared with the CNN_PCA hybrid model. First, Principal Component Analysis (PCA) is employed as a dimension-reduction tool, focusing on the most informative characteristics during training, and then CNN as classification layer. The proposed system's performance is assessed via a dataset of 35,888 facial photos classified into seven classes: anger, fear, happiness, neutral, sadness, surprise, and disgust. The constructed model surpasses established pre-trained models, such as VGG, ResNet, and MobileNet, with different evaluation metrics. First, the PCA_CNN model achieved superior accuracy, precision, recall, and Area Under the Curve (AUC) scores of 0.936, 0.971, 0.843, 0.871, and 0.943.YOLO v8 aith attention model achieved 0.986, 0.902, 0.941, and 0.952. Additionally, the model exhibits significantly faster processing time, completing computations in just 610 seconds than other pre-trained models. To validate the model's superiority, extensive testing on additional datasets consistently yields promising performance results, further validating the efficiency and effectiveness of our developed model in real-time emotion recognition for advancing affective computing applications.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1700167</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1700167</link>
        <title><![CDATA[EAC-YOLO: a surface damage identification method of lightweight membrane structure based on improved YOLO11]]></title>
        <pubdate>2026-01-12T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Zihang Yin</author><author>Limei Zhang</author><author>Huarong Liu</author><author>Qiuyue Du</author><author>Chongchong Yu</author>
        <description><![CDATA[Different surface damage can cause harm to membrane structures, and traditional manual inspection methods are inefficient and prone to missed detections and false alarms. At the same time, the current mainstream detection algorithms are highly complex, which is not conducive to deployment on resource-constrained devices. To achieve automatic identification of typical surface damage in membrane structures, we construct a dataset comprising five damage types based on common types of surface damage in membrane structures and propose a lightweight identification algorithm for membrane structure surface damage, specifically EAC-YOLO. Firstly, the SPPF module is reconstructed, and the ECA lightweight attention mechanism is introduced to enhance the model’s ability to distinguish easily confused features. Secondly, ADown is introduced to replace the original down-sampling method, improving the retention ability of multi-scale damage features. Finally, the CGBlock and C3k2 modules are combined and reconstructed in the neck network to reduce the interference of damage background factors and capture more features of the damage and its surrounding environment. Experimental evaluation results on the established dataset show that the improved mAP50 value reaches 87.5%, and the number of parameters, computational cost, and model size are reduced by approximately 28, 25, and 28%, respectively, compared with the original model, demonstrating the advantages of a small size and high accuracy.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1744581</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1744581</link>
        <title><![CDATA[Editorial: Foundation models for healthcare: innovations in generative AI, computer vision, language models, and multimodal systems]]></title>
        <pubdate>2025-11-26T00:00:00Z</pubdate>
        <category>Editorial</category>
        <author>Sokratis Makrogiannis</author>
        <description></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1626359</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1626359</link>
        <title><![CDATA[Segments-aware universal adversarial perturbations purification on 3D point cloud classifiers]]></title>
        <pubdate>2025-11-25T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Yang Gao</author><author>Xianrui Chang</author><author>Haoran Li</author><author>Jian Xu</author>
        <description><![CDATA[Introduction3D point cloud classifiers, while powerful for representing real-world objects and environments, are vulnerable to adversarial perturbations, particularly Universal Adversarial Perturbations (UAPs). These UAPs pose significant security threats due to their input-agnostic nature. Current purification methods exhibit critical limitations: they typically operate independently of the target classifier and treat perturbations as isolated points without considering the coherent, structural nature of UAPs in 3D point clouds (such as outlier-like shapes with continuous curvature). This fundamental oversight limits their effectiveness, primarily because distinguishing between genuine geometric features and structured adversarial patterns presents a significant challenge.MethodsWe propose a novel purification framework that leverages model interpretability to identify and remove adversarial regions in a holistic manner. Our approach uniquely identifies influential regions within adversarial samples that maximally impact the classifier's predictions. Recognizing that UAPs often manifest as structured segments rather than random points, we employ graph wavelet transforms to isolate suspicious curvature segments. These identified segments undergo a transplantation test where they are transferred to clean samples; segments are classified as adversarial if this transfer consistently induces misclassification. The identified adversarial regions are then removed to sanitize the point cloud. This model-guided, structure-aware approach treats UAPs as coherent structures rather than isolated perturbations.ResultsWe conducted extensive experiments on two public 3D point cloud datasets using four different state-of-the-art classifiers. Our framework demonstrated remarkable improvements in robustness against various UAP attacks compared to existing purification methods. The results show significant accuracy recovery rates after purification, with consistent performance across different classifier architectures and attack methods. Our method particularly excels at preserving genuine geometric features while removing adversarial structures, maintaining high classification accuracy on clean samples while effectively neutralizing UAP threats.DiscussionOur findings demonstrate that considering the structural nature of UAPs and leveraging model interpretability are crucial for effective defense. Unlike previous point-wise approaches, our framework's ability to identify and process coherent adversarial segments addresses the fundamental limitation in current purification methods. The transplantation test provides a reliable mechanism to distinguish between legitimate features and adversarial artifacts. This work highlights the importance of model-guided purification strategies and opens new directions for defending geometric deep learning systems against structured adversarial attacks. Future work could extend this approach to other geometric data representations and explore adaptive defense mechanisms against evolving attack strategies.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1542813</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1542813</link>
        <title><![CDATA[RWAFormer: a lightweight road LiDAR point cloud segmentation network based on transformer]]></title>
        <pubdate>2025-11-06T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Zirui Li</author><author>Lei Chen</author><author>Ying Liu</author><author>Shuang Zhao</author><author>Qinghe Guan</author>
        <description><![CDATA[Point cloud semantic segmentation technology for road scenes plays an important role in the field of autonomous driving. However, accurate semantic segmentation of large-scale and non-uniformly dense LiDAR road point clouds still faces severe challenges. To this end, this paper proposes a road point cloud semantic segmentation algorithm called RWAFormer. First, a sparse tensor feature encoding module (STFE) is introduced to enhance the network’s ability to extract local features of point clouds. Secondly, a radial window attention module (RWA) is designed to dynamically select the neighborhood window size according to the distance of the point cloud data from the center point, effectively aggregating the information of long-distance sparse point clouds to the adjacent dense areas, significantly improving the segmentation effect of long-distance point clouds. Experimental results show that our method achieves an average intersection over union (mIoU) of 75.3 and 82.0% on the Semantickitti and Nuscenes datasets, and an accuracy (Acc) of 94.5 and 97.4%. These results validate the effectiveness and superiority of RWAFormer in road point cloud semantic segmentation.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1658556</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1658556</link>
        <title><![CDATA[Optimized encoder-based transformers for improved local and global integration in railway image classification]]></title>
        <pubdate>2025-11-05T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Lilan Li</author><author>Xuemei Zhan</author><author>TianTian Wu</author><author>Hua Ma</author>
        <description><![CDATA[Railway image classification (RIC) represents a critical application in railway infrastructure monitoring, involving the analysis of hyperspectral datasets with complex spatial-spectral relationships unique to railway environments. Nevertheless, Transformer-based methodologies for RIC face obstacles pertaining to the extraction of local features and the efficiency of training processes. To address these challenges, we introduce the Pure Transformer Network (PTN), an entirely Transformer-centric framework tailored for the effective execution of RIC tasks. Our approach improves the amalgamation of local and global data within railway images by utilizing a Patch Embedding Transformer (PET) module that employs an “unfold + attention + fold” mechanism in conjunction with a Transformer module that incorporates relative attention. The PET module harnesses attention mechanisms to replicate convolutional operations, enabling adaptive receptive fields for varying spatial patterns in railway infrastructure, thus circumventing the constraints imposed by fixed convolutional kernels. Additionally, we propose a Memory Efficient Algorithm that achieves 35% training time reduction while preserving accuracy. Thorough assessments conducted on four hyperspectral railway image datasets validate the PTN's exceptional performance, demonstrating superior accuracy compared to existing CNN- and Transformer-based baselines.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1613648</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1613648</link>
        <title><![CDATA[Improving remote sensing scene classification with data augmentation techniques to mitigate class imbalance]]></title>
        <pubdate>2025-10-08T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Ping Wang</author><author>Xin Zhao</author><author>Yuanhui Chen</author><author>Lili Zhan</author>
        <description><![CDATA[High-resolution remote sensing imagery is a powerful tool that provides massive information about ground objects. However, conventional methods often fail to achieve satisfactory results for complex urban scene classification. This is attributed to the fact that conventional methods are unable to meet the requirements of high-accuracy remote sensing image scene classification (RSSC) and are hindered by challenges such as limited labeled samples and class imbalance, which may lead to classification bias in classifiers. On the contrary, deep learning-based RSSC represents an important approach for understanding semantic information. This paper explores the feasibility of mitigating classification bias by reducing the imbalance ratio (IR) of the dataset. First, a class-imbalanced dataset was constructed using very high-resolution (VHR) images, labeled into nine land use/land cover (LULC) categories. Second, comprehensive data augmentation techniques (mirroring, rotation, cropping, Hue, Saturation, Value (HSV) perturbation, and gamma transformation) were applied, successfully reducing the dataset's IR from 9.38 to 1.28. Subsequently, four architectures, MobileNet-v2, ResNet101, ResNeXt101_32 × 32d, and Transformer, were trained and evaluated on both class-balanced and class-imbalanced datasets. The results indicate that the classification bias caused by class imbalance was alleviated, significantly improving the classifier's performance. Specifically for the most severely underrepresented category (intersections), precision and recall improvements reached up to 128% and 102%, respectively, narrowing the gap with other categories and reducing classification bias. Furthermore, the average Kappa and overall accuracy (OA) increased by 11.84% and 12.97%, respectively, with reduced standard deviations in recall and precision, demonstrating enhanced model stability.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1644044</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1644044</link>
        <title><![CDATA[Efficient rotation invariance in deep neural networks through artificial mental rotation]]></title>
        <pubdate>2025-09-19T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Lukas Tuggener</author><author>Thilo Stadelmann</author><author>Jürgen Schmidhuber</author>
        <description><![CDATA[Humans and animals recognize objects irrespective of the beholder's point of view, which may drastically change their appearance. Artificial pattern recognizers strive to also achieve this, e.g., through translational invariance in convolutional neural networks (CNNs). However, CNNs and vision transformers (ViTs) both perform poorly on rotated inputs. Here we present AMR (artificial mental rotation), a method for dealing with in-plane rotations focusing on large datasets and architectural flexibility, our simple AMR implementation works with all common CNN and ViT architectures. We test it on randomly rotated versions of ImageNet, Stanford Cars, and Oxford Pet. With a top-1 error (averaged across datasets and architectures) of 0.743, AMR outperforms rotational data augmentation (average top-1 error of 0.626) by 19%. We also easily transfer a trained AMR module to a downstream task to improve the performance of a pre-trained semantic segmentation model on rotated CoCo from 32.7 to 55.2 IoU.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1626641</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1626641</link>
        <title><![CDATA[From shades to vibrance: a comprehensive review of modern image colorization techniques]]></title>
        <pubdate>2025-09-18T00:00:00Z</pubdate>
        <category>Mini Review</category>
        <author>Oshen Geenath</author><author>Y. H. P. P. Priyadarshana</author>
        <description><![CDATA[Image colorization has become a significant task in computer vision, addressing the challenge of transforming grayscale images into realistic, vibrant color outputs. Recent advancements leverage deep learning techniques, ranging from generative adversarial networks (GANs) to diffusion models, and integrate semantic understanding, multi-scale features, and user-guided controls. This review explores state-of-the-art methodologies, highlighting innovative components such as semantic class distribution learning, bidirectional temporal fusion, and instance-aware frameworks. Evaluation metrics, including PSNR, FID, and task-specific measures, ensure a comprehensive assessment of performance. Despite remarkable progress, challenges like multimodal uncertainty, computational cost, and generalization remain. This paper provides a thorough analysis of existing approaches, offering insights into their contributions, limitations, and future directions in automated image colorization.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1626346</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1626346</link>
        <title><![CDATA[LLaVA-GM: lightweight LLaVA multimodal architecture]]></title>
        <pubdate>2025-09-01T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Zhiyin Han</author><author>Xiaoqun Liu</author><author>Juan Hao</author>
        <description><![CDATA[Multimodal large-scale language modeling has become the mainstream approach in natural language processing tasks and has been applied to various cross-modal fields such as image description and visual question answering. However, large-scale language modeling has high computational complexity and a large operational scale, which presents significant challenges for deployment in many resource-constrained scenarios. To address such problems, a lightweight multimodal framework, LLaVA-GM, is proposed, based on LLaVA, which can be deployed on devices with low resource requirements and has greatly reduced model parameters. It can also be tested on common VQA tasks and achieves good performance. The main contributions and work are as follows: First, it is found that the backbone of the Vicuna language model in LLaVA is too redundant. When fine-tuning downstream tasks, a very small amount of data sets is difficult to affect the language model. It is replaced with a new Gemma language model, thereby achieving fast task-specific adaptation with fewer parameters and data. Second, in response to the problem of information redundancy, the MoE mixed expert model is introduced. This model can be used in combination with itself, combining the MoE mixed expert model with Gemma to reduce the amount of computation while maintaining performance. Directly training the entire model will lead to a decline in performance. A multi-stage training strategy is adopted to maintain performance. First, the MLP layer is trained for visual adaptation, then the entire Gemma model is trained to improve multimodal capabilities, and finally only the MoE layer is trained for sparsification to ensure a smooth transition from dense models to sparse models. The experiment was tested on multiple VQA datasets and achieved good performance, confirming the potential of this compact model in downstream multimodal applications.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1569017</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1569017</link>
        <title><![CDATA[Heterogeneous ensemble learning: modified ConvNextTiny for detecting molecular expression of breast cancer on standard biomarkers]]></title>
        <pubdate>2025-09-01T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Indo Intan</author><author>Andrea Stevens Karnyoto</author><author>Sitti Harlina</author><author>Berti Julian Nelwan</author><author>Devin Setiawan</author><author>Amalia Yamin</author><author>Ririn Endah Puspitasari</author>
        <description><![CDATA[Breast cancer is the highest-ranking type of cancer, with 2.3 million new cases diagnosed each year. Immunohistochemistry (IHC) is the gold standard “examination” for determining the expression of cancer malignancies in patients with the ultimate goal of determining prognosis and therapy. Immunohistochemistry refers to the four WHO standard biomarkers: estrogen receptor, progesterone receptor, human epidermal growth factor receptor-2, and Ki-67. These biomarkers are assessed based on the quantity of cell nuclei and the intensity of brown cell membranes. Our study aims to detect the expression of breast cancer malignancy as an initial step in determining prognosis and therapy. We implemented homogeneous and heterogeneous ensemble learning models. The homogeneous ensemble learning model uses the majority vote technique to select the best performance between the Xception, ResNet50V2, InceptionResNet50V2, and ConvNextTiny models. The heterogeneous ensemble learning model takes the ConvNextTiny model as the best model. Feature engineering in ConvNextTiny combines convolution and cell-quantification features as feature fusion. ConvNextTiny, which applies feature fusion, can detect the expression of cancer malignancy. Heterogeneous ensemble learning outperforms homogeneous ensemble learning. The model performs well for accuracy, precision, recall, F1-score, and receiver operating characteristic-area under the curve (ROC-AUC) of 0.997, 0.973, 0.991, 0.982, and 0.994, respectively. These results indicate that the model can classify the malignancy expressions of breast cancer well. This model still requires the configuration of the visual laboratory device to test the real-time model capabilities.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1576958</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1576958</link>
        <title><![CDATA[Deep learning for vision screening in resource-limited settings: development of multi-branch CNN for refractive error detection based on smartphone image]]></title>
        <pubdate>2025-07-30T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Muhammad Syauqie</author><author>Harry Patria</author><author>Sutanto Priyo Hastono</author><author>Kemal Nazaruddin Siregar</author><author>Nila Djuwita Farieda Moeloek</author>
        <description><![CDATA[IntroductionUncorrected refractive errors are a leading cause of preventable vision impairment globally, particularly affecting individuals in low-resource regions where timely diagnosis and screening access remain significant challenges despite the availability of economical treatments.AimThis study introduces a novel deep learning-based system for automated refractive error classification using photorefractive images acquired via a standard smartphone camera.MethodsA multi-branch convolutional neural network (CNN) was developed and trained on a dataset of 2,139 corneal images collected from an Indonesian public eye hospital. The model was designed to classify refractive errors into four categories: significant myopia, significant hypermetropia, insignificant refractive error, and not applicable to classified. Grad-CAM visualization was employed to provide insights into the model’s interpretability.ResultsThe 3-branch CNN architecture demonstrated superior performance, achieving an overall test accuracy of 91%, precision of 96%, and recall of 98%, with an area under the curve (AUC) score of 0.9896. Its multi-scale feature extraction pathways were pivotal in effectively addressing overlapping red reflex patterns and subtle variations between classes.ConclusionThis study establishes the feasibility of smartphone-based photorefractive assessment integrated with artificial intelligence for scalable and cost-effective vision screening. By training the CNN model with a real-world dataset representative of Southeast Asian populations, this system offers a reliable solution for early refractive error detection with significant implications for improving accessibility to eye care services in resource-limited settings.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fcomp.2025.1576775</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fcomp.2025.1576775</link>
        <title><![CDATA[Convolutional spatio-temporal sequential inference model for human interaction behavior recognition]]></title>
        <pubdate>2025-07-15T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Lizhong Jin</author><author>Rulong Fan</author><author>Xiaoling Han</author><author>Xueying Cui</author>
        <description><![CDATA[IntroductionHuman action recognition is a critical task with broad applications and remains a challenging problem due to the complexity of modeling dynamic interactions between individuals. Existing methods, including skeleton sequence-based and RGB video-based models, have achieved impressive accuracy but often suffer from high computational costs and limited effectiveness in modeling human interaction behaviors.MethodsTo address these limitations, we propose a lightweight Convolutional Spatiotemporal Sequence Inference Model (CSSIModel) for recognizing human interaction behaviors. The model extracts features from skeleton sequences using DINet and from RGB video frames using ResNet-18. These multi-modal features are fused and processed using a novel multiscale two-dimensional convolutional peak-valley inference module to classify interaction behaviors.ResultsCSSIModel achieves competitive results across several benchmark datasets: 87.4% accuracy on NTU RGB+D 60 (XSub), 94.1% on NTU RGB+D 60 (XView), 80.5% on NTU RGB+D 120 (XSub), and 84.9% on NTU RGB+D 120 (XSet). These results are comparable to or exceed those of state-of-the-art methods.DiscussionThe proposed method effectively balances accuracy and computational efficiency. By significantly reducing model complexity while maintaining high performance, CSSIModel is well-suited for real-time applications and provides a valuable reference for future research in multi-modal human behavior recognition.]]></description>
      </item>
      </channel>
    </rss>