AUTHOR=Gan Quan , Wang Chen , Zhong Zhaoman , Wu Jiaying , Ge Qiwei , Shi Lei , Shang Jiaqing , Liu Chuanxia TITLE=Cross-modal attention model integrating tongue images and descriptions: a novel intelligent TCM approach for pathological organ diagnosis JOURNAL=Frontiers in Physiology VOLUME=Volume 16 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/physiology/articles/10.3389/fphys.2025.1580985 DOI=10.3389/fphys.2025.1580985 ISSN=1664-042X ABSTRACT=IntroductionTongue diagnosis is a fundamental technique in traditional Chinese medicine (TCM), where clinicians evaluate the tongue’s appearance to infer the condition of pathological organs. However, most existing research on intelligent tongue diagnosis primarily focuses on analyzing tongue images, often neglecting the important descriptive text that accompanies these images. This text is an essential component of clinical diagnosis. To overcome this gap, we propose a novel Cross-Modal Pathological Organ Diagnosis Model that integrates tongue images and textual descriptions for more accurate pathological classificationMethodsOur model extracts features from both the tongue images and the corresponding textual descriptions. These features are then fused using a cross-modal attention mechanism to enhance the classification of pathological organs. The cross-modal attention mechanism enables the model to effectively combine visual and textual information, addressing the limitations of using either modality aloneResultsWe conducted experiments using a self-constructed dataset to evaluate our model’s performance. The results demonstrate that our model outperforms common models regarding overall accuracy. Additionally, ablation studies, where either tongue images or textual descriptions were used alone, confirmed the significant benefit of multimodal fusion in improving diagnostic accuracy.DiscussionThis study introduces a new perspective on intelligent tongue diagnosis in TCM by incorporating visual and textual data. The experimental findings highlight the importance of cross-modal feature fusion for improving the accuracy of pathological diagnosis. Our approach not only contributes to the development of more effective diagnostic systems but also paves the way for future advancements in the automation of TCM diagnosis.