A Lightweight Tri-Modal Few-Shot Detection Framework for Fruit Diversity Recognition Toward Digital Orchard Archiving

Xu, Huaqiang; Li, Honghan; Zhao, Ji

doi:10.3389/fpls.2025.1696622

ORIGINAL RESEARCH article

Front. Plant Sci.

Sec. Sustainable and Intelligent Phytoprotection

A Lightweight Tri-Modal Few-Shot Detection Framework for Fruit Diversity Recognition Toward Digital Orchard Archiving

Provisionally accepted

Huaqiang Xu¹

Honghan Li^1,2

Ji Zhao^1,3*

¹University of Science and Technology Liaoning, Anshan, China
²Oulun yliopisto, Oulu, Finland
³National University of Singapore, Singapore, Singapore

The final, formatted version of the article will be published soon.

Few-shot object detection (FSOD) addresses the challenge of object recognition under limited annotation conditions, offering practical advantages for smart agriculture, where large-scale labeling of diverse fruit cultivars is often infeasible. To handle the visual complexity of orchard environments—such as occlusion, subtle morphological differences, and dense foliage—this study presents a lightweight tri-modal fusion framework. The model initially employs a CLIP-based semantic prompt encoder to extract category-aware cues, which guide the Segment Anything Model (SAM) in producing structure-preserving masks. These masks are then incorporated via a Semantic Fusion Module (SFM) : a Mask-Saliency Adapter (MSA) and a Feature Enhancement Recomposer (FER), enabling spatially aligned and semantically enriched feature modulation. An Attention-Aware Weight Estimator (AWE) further optimizes the fusion by adaptively balancing semantic and visual streams using global saliency cues. The final predictions are subsequently generated by a YOLOv12 detection head. Experiments conducted on four fruit detection benchmarks—Cantaloupe.v2, Peach.v3, Watermelon.v2, and Orange.v8—demonstrate that the proposed method consistently surpasses five representative FSOD baselines. Performance improvements include +7.9% AP@0.5 on Cantaloupe.v2, +5.4% Precision on Peach.v3, +7.4% Precision on Watermelon.v2, and +5.9% AP@0.75 on Orange.v8. These results underscore the model's effectiveness in orchard-specific scenarios and its potential to facilitate cultivar identification, digital recordkeeping, and cost-efficient agricultural monitoring.

Keywords: Fruit detection, Digital Orchard, FSOD, CLIP Prompt, SAM Mask, multimodal fusion, Attention weighting, LightweightAgriculture AI

Received: 01 Sep 2025; Accepted: 31 Oct 2025.

Copyright: © 2025 Xu, Li and Zhao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Ji Zhao, zhaoji_ustl@outlook.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.