Comparative Analysis of Multimodal Architectures for Effective Skin Lesion Detection Using Clinical and Image Data

Das, Adriteyo; Agarwal, Vedant; SHETTY, NISHA

doi:10.3389/frai.2025.1608837

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Machine Learning and Artificial Intelligence

Volume 8 - 2025 | doi: 10.3389/frai.2025.1608837

This article is part of the Research TopicThe Applications of AI Techniques in Medical Data ProcessingView all 13 articles

Comparative Analysis of Multimodal Architectures for Effective Skin Lesion Detection Using Clinical and Image Data

Provisionally accepted

Adriteyo Das

Vedant Agarwal

NISHA SHETTY^*

Manipal Institute of Technology, Manipal, Karnataka, India

The final, formatted version of the article will be published soon.

Skin lesion classification is a critical task in dermatology, with significant implications for early diagnosis and treatment. In this study, a novel multimodal data fusion approach is proposed that integrates dermatoscopic images and clinical metadata to improve classification accuracy.We systematically evaluate various fusion techniques, including simple concatenation, weighted concatenation, and advanced methods such as self-attention and cross-attention, using the widely recognized HAM10000 dataset. Our results demonstrate that combining clinical features with image data significantly enhances classification performance, with cross-attention fusion achieving the highest accuracy because it effectively captures inter-modal dependencies and contextual relationships between different data modalities. Additionally, we utilize Grad-CAM to enhance the interpretability of our model, shedding light on the significance of clinical attributes in the decision-making process. Despite these improvements, obstacles like class disparity and the computational demands of sophisticated integration techniques remain. We also explore potential avenues for improving model performance and clarity, particularly for clinical applications in settings with limited resources.

Keywords: Skin lesion classification, multimodal fusion, Dermatoscopic images, Clinical metadata, cross-attention, HAM10000, Interpretability, deep learning

Received: 09 Apr 2025; Accepted: 21 Jul 2025.

Copyright: © 2025 Das, Agarwal and SHETTY. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: NISHA SHETTY, Manipal Institute of Technology, Manipal, 576104, Karnataka, India

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.