ORIGINAL RESEARCH article
Front. Chem.
Sec. Analytical Chemistry
Rapid discrimination of geographical origin and analysis of chemical characterization of tobacco leaves from multiple countries
Provisionally accepted- 1Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou, China
- 2Technology Center, China Tobacco Fujian Industrial Co., Ltd., Xiamen 361021, China, Xiamen, China
- 3Technology Center, China Tobacco Henan Industrial Co., Ltd., Zhengzhou 450000, China, Zhengzhou, China
- 4Technology Center, China Tobacco Guangdong Industrial Co., Ltd., Guangzhou 510385, China, Guangzhou, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Tobacco is a globally cultivated crop featuring distinct quality variations among leaves from different geographical origins. To develop a rapid, robust, and accurate method for multi-origin traceability, this study employed near-infrared spectroscopy combined with rapid chemical composition analysis to obtain 70 chemical components in samples from nine major tobacco-producing regions in China and four other countries (the United States, Brazil, Zimbabwe, and Zambia). One-way analysis of variance (ANOVA) and hierarchical cluster analysis (HCA) were used to investigate regional chemical differences. Discrimination models were built using a support vector machine (SVM), a backpropagation neural network, and a random forest. The best model was interpreted using permutation feature importance (PFI) to identify key markers for origin discrimination. One-way ANOVA revealed significant differences (p ≤ 0.001), and HCA demonstrated clear regional patterns. The SVM-hybrid kernel achieved the best performance with 97.96% test accuracy and macro-average recall, precision, and F1 scores of 0.9836, 0.9806, and 0.9821, respectively. The PFI algorithm was employed to identify and rank the top 20 key chemical components influencing the geographical origin discrimination. The top ten key components were Fru-Asn, succinic acid, rutin, Fru-Val, sulfate, serine, phosphate, starch, potassium, and Fru-Gly. This study integrated chemometrics, near-infrared, rapid chemical analysis, and interpretable machine learning to accurately distinguish tobacco origins, reveal regional traits, and offer insights into geographical traceability and chemical profiling.
Keywords: Chemical composition, chemical featureinterpretation, chemometrics, Geographical origin traceability of tobacco, Permutation feature importance, Support Vector Machine with hybrid kernel
Received: 16 Oct 2025; Accepted: 10 Feb 2026.
Copyright: © 2026 Kou, Wang, Wan, Su, Xu, Fu, Lin, Zhao, Guo, Wang, Liu, Yang and Nie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Zechun Liu
Song Yang
Cong Nie
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
