Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Plant Sci.

Sec. Sustainable and Intelligent Phytoprotection

Volume 16 - 2025 | doi: 10.3389/fpls.2025.1597673

Using Preprocessed Datasets to Construct and Interpret Multiclass Identification Models

Provisionally accepted
Cong  WangCong Wang1Yufeng  FuYufeng Fu2Ran  WanRan Wan1Zhao  LeZhao Le1Hongbo  WangHongbo Wang1Junwei  GuoJunwei Guo1Qiang  LiuQiang Liu2Shan  LiShan Li3Shengtao  MaShengtao Ma2Zhicai  WangZhicai Wang3Wei  HuangWei Huang3Huimin  LiuHuimin Liu1Song  YangSong Yang1*Cong  NieCong Nie1
  • 1Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, China
  • 2China Tobacco Henan Industrial Co., Ltd., Zhengzhou, Henan Province, China
  • 3China Tobacco Gansu Industrial Co., Ltd., Lanzhou, China

The final, formatted version of the article will be published soon.

Introduction: Image and near-infrared (NIR) spectroscopic data are widely used for constructing analytical models in precision agriculture. While model interpretation can provide valuable insights for quality control and improvement, the inherent ambiguity of individual image pixels or spectral data points often hinders practical interpretability when using raw data directly. Furthermore, the presence of imbalanced datasets can lead to model overfitting and consequently, poor robustness. Therefore, developing alternative approaches for constructing interpretable and robust models using these data types is crucial.Methods: This study proposes using preprocessed data-specifically, morphological features extracted from images and chemical component concentrations predicted from NIR spectra-to build multiclass identification models. Combined kernel SVM based models were proposed to identify the rice variety and cultivation region of tobacco. The determination of kernel parameters and percentage of different types of kernel functions were accomplished by PSO, which make the approach selfadaptive. Feature importance and contribution analyses were conducted using Shapley additive explanations (SHAP) The resulting models demonstrated high robustness and accuracy, achieving classification success rates of 97.9 and 97.4% via n-fold crossvalidation on rice and tobacco datasets, respectively, and 97.7% on an independent test set (tobacco dataset 2). This analysis identified key variables and elucidated their specific contributions to the model predictions.Discussion: This study expands the applicability of image and NIR spectroscopic data, offering researchers an effective methodology for investigating factors crucial to the quality control and improvement of agricultural products.

Keywords: Multiclass identification, preprocessed data, Kernel support vector machine, Model interpretation, Shap, image analysis, near-infrared spectroscopy

Received: 16 Apr 2025; Accepted: 28 Jul 2025.

Copyright: © 2025 Wang, Fu, Wan, Le, Wang, Guo, Liu, Li, Ma, Wang, Huang, Liu, Yang and Nie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Song Yang, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.