Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Plant Sci.

Sec. Sustainable and Intelligent Phytoprotection

This article is part of the Research TopicIntegrating Visual Sensing and Machine Learning for Advancements in Plant Phenotyping and Precision AgricultureView all articles

SSViT-YOLOv11: Fusing lightweight YOLO & ViT for Coffee fruit maturity detection

Provisionally accepted
Yifan  LiuYifan Liu1Qiudong  YuQiudong Yu1*Shuze  GengShuze Geng1Shiyi  GuoShiyi Guo2Ling  LiuLing Liu2
  • 1Tianjin University of Technology and Education, Tianjin, China
  • 2Tianjin Agricultural University, Tianjin, China

The final, formatted version of the article will be published soon.

Accurate identification of coffee fruit maturity is critical for optimizing harvest timing and ensuring bean quality, but manual inspection is time-consuming and prone to subjectivity. Automated visual detection faces challenges including subtle color differences among maturity stages, frequent occlusions within fruit clusters, variable lighting, and abundant small-scale targets. In this paper, we propose SSViT-YOLOv11, an improved YOLOv11n-based framework that integrates Single Scale Vision Transformer (SSViT) into the backbone and refines multi-scale feature fusion to enhance context modeling and small-object representation. The C3K2 modules in YOLOv11n are integrated with Arbitrary Kernel Convolution (AKConv) and multi-scale convolutional attention (MSCA) is added in the head, effectively improving detection accuracy and rendering the model more lightweight. Experimental results show that SSViT-YOLOv11 achieves superior performance across multiple evaluation metrics. Specifically, the model attains a precision of 81.1%, a recall of 77.4%, and a mean Average Precision (mAP@50) of 84.54%, while operating at 23 FPS and requiring only 2.16 million parameters. These results indicate that the proposed model offers a favorable balance of accuracy, inference speed, and model compactness, making it well suited for assisting farmers in coffee fruit maturity assessment.

Keywords: object detection, Coffee fruit maturity, YOLOv11, vision Transformer, Lightweight

Received: 24 Aug 2025; Accepted: 13 Nov 2025.

Copyright: © 2025 Liu, Yu, Geng, Guo and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Qiudong Yu, 2020070006@tute.edu.cn

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.