Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Chem.

Sec. Analytical Chemistry

Rapid discrimination of geographical origin and analysis of chemical characterization of tobacco leaves from multiple countries

Provisionally accepted
Ranran  KouRanran Kou1Cong  WangCong Wang1Ran  WanRan Wan1Mingliang  SuMingliang Su2Heng  XuHeng Xu3Yufeng  FuYufeng Fu3Yun  LinYun Lin4Le  ZhaoLe Zhao1Junwei  GuoJunwei Guo1Hongbo  WangHongbo Wang1Zechun  LiuZechun Liu2*Song  YangSong Yang1*Cong  NieCong Nie1*
  • 1Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, China
  • 2Technology Center, China Tobacco Fujian Industrial Co., Ltd., Xiamen 361021, China, Xiamen, China
  • 3Technology Center, China Tobacco Henan Industrial Co., Ltd., Zhengzhou 450000, China, Zhengzhou, China
  • 4Technology Center, China Tobacco Guangdong Industrial Co., Ltd., Guangzhou 510385, China, Guangzhou, China

The final, formatted version of the article will be published soon.

Tobacco is a globally cultivated crop featuring distinct quality variations among leaves from different geographical origins. To develop a rapid, robust, and accurate method for multi-origin traceability, this study employed near-infrared spectroscopy combined with rapid chemical composition analysis to obtain 70 chemical components in samples from nine major tobacco-producing regions in China and four other countries (the United States, Brazil, Zimbabwe, and Zambia). One-way analysis of variance (ANOVA) and hierarchical cluster analysis (HCA) were used to investigate regional chemical differences. Discrimination models were built using a support vector machine (SVM), a backpropagation neural network, and a random forest. The best model was interpreted using permutation feature importance (PFI) to identify key markers for origin discrimination. One-way ANOVA revealed significant differences (p ≤ 0.001), and HCA demonstrated clear regional patterns. The SVM-hybrid kernel achieved the best performance with 97.96% test accuracy and macro-average recall, precision, and F1 scores of 0.9836, 0.9806, and 0.9821, respectively. The PFI algorithm was employed to identify and rank the top 20 key chemical components influencing the geographical origin discrimination. The top ten key components were Fru-Asn, succinic acid, rutin, Fru-Val, sulfate, serine, phosphate, starch, potassium, and Fru-Gly. This study integrated chemometrics, near-infrared, rapid chemical analysis, and interpretable machine learning to accurately distinguish tobacco origins, reveal regional traits, and offer insights into geographical traceability and chemical profiling.

Keywords: Chemical composition, chemical featureinterpretation, chemometrics, Geographical origin traceability of tobacco, Permutation feature importance, Support Vector Machine with hybrid kernel

Received: 16 Oct 2025; Accepted: 10 Feb 2026.

Copyright: © 2026 Kou, Wang, Wan, Su, Xu, Fu, Lin, Zhao, Guo, Wang, Liu, Yang and Nie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Zechun Liu
Song Yang
Cong Nie

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.