Automatic Fused Multimodal Deep Learning for Plant Identification

Lapkovskis, Alfreds; Nefedova, Natalia; Beikmohammadi, Ali

doi:10.3389/fpls.2025.1616020

ORIGINAL RESEARCH article

Front. Plant Sci.

Sec. Technical Advances in Plant Science

Volume 16 - 2025 | doi: 10.3389/fpls.2025.1616020

This article is part of the Research TopicMachine Vision and Machine Learning for Plant Phenotyping and Precision Agriculture, Volume IIView all 39 articles

Automatic Fused Multimodal Deep Learning for Plant Identification

Provisionally accepted

Alfreds Lapkovskis

Natalia Nefedova

Ali Beikmohammadi^*

Stockholm University, Stockholm, Sweden

The final, formatted version of the article will be published soon.

Introduction: Plant classification is vital for ecological conservation and agricultural productivity, enhancing our understanding of plant growth dynamics and aiding species preservation. The advent of deep learning (DL) techniques has revolutionized this field by enabling autonomous feature extraction, significantly reducing the dependence on manual expertise. However, conventional DL models often rely solely on single data sources, failing to capture the full biological diversity of plant species comprehensively.Recent research has turned to multimodal learning to overcome this limitation by integrating multiple data types, which enriches the representation of plant characteristics. This shift introduces the challenge of determining the optimal point for modality fusion.In this paper, we introduce a pioneering multimodal DL-based approach for plant classification with automatic modality fusion. Utilizing the multimodal fusion architecture search, our method integrates images from multiple plant organs-flowers, leaves, fruits, and stems-into a cohesive model. To address the lack of multimodal datasets, we contributed Multimodal-PlantCLEF, a restructured version of the PlantCLEF2015 dataset tailored for multimodal tasks.Results: Our method achieves 82.61% accuracy on 979 classes of Multimodal-PlantCLEF, outperforming late fusion by 10.33%. Through the incorporation of multimodal dropout, our approach demonstrates strong robustness to missing modalities. We validate our model against established benchmarks using standard performance metrics and McNemar's test, further underscoring its superiority.Discussion: The proposed model surpasses state-of-the-art methods, highlighting the effectiveness of multimodality and an optimal fusion strategy. Our findings open a promising direction in future plant classification research.

Keywords: Plant identification, plant phenotyping, multimodal learning, fusion automation, multimodal fusion, Architecture search, Neural architecture search, Multimodal dataset

Received: 22 Apr 2025; Accepted: 04 Jul 2025.

Copyright: © 2025 Lapkovskis, Nefedova and Beikmohammadi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Ali Beikmohammadi, Stockholm University, Stockholm, Sweden

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.