- 1Shanghai Key Laboratory of Anesthesiology and Brain Functional Modulation, Clinical Research Center for Anesthesiology and Perioperative Medicine, Translational Research Institute of Brain and Brain-Like Intelligence, Shanghai Fourth People’s Hospital, School of Medicine, Tongji University, Shanghai, China
- 2School of Traditional Chinese Medicine, Guangdong Pharmaceutical University, Guangzhou, China
- 3State Key Laboratory of Genetic Engineering, School of Life Sciences, Zhangjiang Fudan International Innovation Center, Human Phenome Institute, Fudan University, Shanghai, China
- 4College of Chinese Materia Medica and Yunnan Key Laboratory of Southern Medicinal Utilization, Yunnan University of Chinese Medicine, Kunming, China
Background: The dried ripe fruit or seed of Amomun tsaoko is a widely used spice and food additive in Eastern and Southeastern Asia. Approximately 90% of the global production of this spice occurs in Yunnan province, China. Over years of cultivation, genetic variations have emerged, leading to wide regional varieties. Authenticating geographical origin has become essential for quality assessment and control, as it directly influences a product’s commercial value.
Objective: This study aims to authenticate the geographical origins of A. tsaoko seeds sourced from distinct and narrow geographical regions.
Methods: Near-infrared spectroscopy (NIRS) combined with machine learning (ML) techniques was used to determine the specific geographical origins of A. tsaoko seeds.
Results: The results demonstrated that Fourier transform Near-infrared spectroscopy (FT-NIR) followed by a multi-layer perceptron (MLP) was the optimal strategy among all methods tested. This approach achieved a high accuracy of 96.97%. Additionally, feature dimensionality reduction analysis was applied using the Catboost model. This analysis identified certain spectral ranges that contained important features for the model.
Conclusion: This study indicates that pretreatment of NIRS raw data and the use of ML are potential strategies for rapid and specific geographic authentication of plants.
1 Introduction
Amomum tsaoko Crevost et Lemaire is a perennial herb belonging to the Zingiberaceae family. Its dried ripe fruits and seeds, known as black cardamom or Caoguo, are popular food additives and spices in Eastern and Southeastern Asia. They are widely used specifically in China, Korea, Japan, and Indonesia. In traditional Chinese medicine, these plant parts have been traditionally utilized to treat several ailments. Specific indications include cold-dampness obstruction, epigastric pain and abdominal distension, stuffiness and fullness, vomiting, malaria with cold and fever, and febrile pestilence (The State Pharmacopoeia Commission, 2015; Zhang et al., 2023; Li et al., 2025). Recent pharmacological studies revealed a spectrum of bioactivities for A. tsaoko. These include anti-inflammatory properties against insects, antitumor effects in liver cancer cells, anti-angiogenesis effects in ovarian cancer, and constipation-relieving properties (Yang et al., 2008; Kim et al., 2019; Chen et al., 2020; Hu et al., 2023; Ying, 2023). These properties underscore the importance of A. tsaoko as a valuable crop serving both culinary and medicinal purposes.
However, increasing demand, coupled with declining wild populations, has heightened concerns regarding the authenticity and geographical origin of A. tsaoko. Approximately 90% of A. tsaoko is produced in Yunnan province, China (Ma et al., 2017). In this region, variations in climate, topography, and ecology significantly influence fruit morphology, essential oil composition, and concentration (Cui et al., 2017; Wei et al., 2019; Li et al., 2021a). Regional variations in A. tsaoko pose a considerable challenge for maintaining the herb’s quality and consistency. Consequently, authenticating the geographical origin of A. tsaoko is essential for ensuring quality control and providing consumers with genuine products.
Previous studies have explored methods for authenticating A. tsaoko. For example, gas chromatography-mass spectrometry (GC–MS) (Qin et al., 2021) and molecular marker techniques, including simple sequence repeats (SSR) and expressed sequence tag-simple sequence repeats (EST-SSR) (Ma et al., 2022), have been applied. However, current studies have primarily focused on distinguishing between different species, such as A. tsaoko, A. paratsao-ko, and other Zingiberaceae family plants, instead of authenticating their specific geographical origins. Furthermore, typical methods for geographical tracing typically operate at a coarse resolution (for example, discriminating between provinces or cities). They lack the precision needed to differentiate among more localized growing regions (Liu et al., 2021a). In previous studies, our group developed a multi-element fingerprinting method by utilizing absolute quantification of elements in A. tsaoko seeds, which enabled the geographical authentication of A. tsaoko seed samples (Liu et al., 2023). While this approach demonstrated strong performance, it faced several limitations, such as high cost, complex sample preparation, reliance on specialized instrumentation, low throughput, and potential sample destruction. To clearly contextualize the novelty and advancement of the current study, a comparative summary of prior research on A. tsaoko authentication is provided in Supplementary Table 1. These challenges underscore the need for a more efficient and accessible method for geographical authentication.
NIRS was established in the 1960s for cereal analysis (Agelet et al., 2010). Currently, NIRS is widely used for the authentication of geographical origin (Liu et al., 2021b; Nguyen Minh et al., 2022; Schütz et al., 2022; Santos-Rivera et al., 2024). For instance, it has been successfully used to distinguish varieties of millet sourced from different regions of China (Kabir et al., 2021). It offers several advantages such as low cost, rapid operation, high throughput, and non-destructive testing. These advantages make it an ideal solution for quality control and geographical authentication of A. tsaoko seeds. To date, NIRS has been effectively utilized to identify the drying temperatures of A. tsaoko (He et al., 2023).
In this study, we propose a protocol that integrates Fourier Transform Near-infrared spectroscopy (FT-NIR) with machine learning techniques to precisely identify and trace the narrow geographical origin of A. tsaoko seeds. Our primary objective was to establish an accurate and interpretable model for geographical authentication. To achieve this, we leveraged explainable artificial intelligence for optimal feature selection and conducted a comprehensive comparison of multiple machine learning algorithms to identify the most effective classifier.
2 Materials and methods
2.1 Samples information
The plant material used in this study was the same as previously reported (Liu et al., 2023). Briefly, in autumn 2018, A. tsaoko fruits were collected from 12 populations. The botanical identity was confirmed by Professor Yaowen Yang. The geographical origin for each sample was not determined by expert opinion but was an objective, ground-truthed fact based on its precise GPS-recorded collection location (Figure 1). This verifiable geographical label served as the definitive benchmark for our classification model. The geographical separation among these populations was relatively small. For example, the shortest inter-population distance was 22.9 km, occurring between Tengchong (TC) and Yingjiang (YJ) sites (Figure 1). The voucher specimens were deposited at the Yunnan University of Chinese Medicine Museum. The morphological characteristics of the individual fruit and powdered seed samples from all 12 geographical origins are visually documented in Supplementary Figure 1. Three fruit pools were created from six or more randomly selected plants in each population. The 36 fruit samples were then dried until their mass stabilized. The seeds were extracted from the fruits, then ground and sifted. The resulting powder materials (50–65 mesh) were stored in sealed sample bottles at 4 °C pending subsequent analysis. Each individual pooled powder sample was subjected to spectral acquisition in triplicate (three technical replicates), resulting in a total dataset of 108 spectra (12 populations × 3 pooled biological replicates × 3 technical replicates).
2.2 FT-NIR spectra acquisition and data pretreatment
FT-NIR spectra were obtained using a Bruker TANGGO FT-NIR spectrometer (Bruker, Karlsruhe, Germany) equipped with a diffuse integrating sphere and a sample rotator. The scan ranged from 12,000 to 4,000 cm–1 at a resolution of 8 cm–1. Samples were meticulously placed in quartz cups, filling approximately two-thirds of each cup’s volume. Before collecting the sample spectra, the spectral information of the background was collected in air to exclude H2O and CO2 interference. Each sample underwent 64 scans at 22 °C with a relative humidity of 25%–30%. All experiments were conducted in triplicate, resulting in a total of 108 samples, which were used as the dataset for this study. Data preprocessing was subsequently conducted using Omnic and TQ software (Thermo Fisher Scientific, MA, USA) prior to model development and validation. The final dataset for machine learning consisted of 108 samples (rows), with each sample characterized by 1900 spectral data points (wavenumbers) acting as features (columns).
2.3 Feature selection
A feature screening was performed using six ML algorithms combined with SHAP: CatBoost (Dorogush et al., 2018), Decision Tree (Utgoff, 1989) (DT), Extra Trees (Sharaff and Gupta, 2019) (ET), Light Gradient Boosting Machine (Fan et al., 2019) (LightGBM), Random Forest (Tin Kam, 1995) (RF), and eXtreme Gradient Boosting (Chen and He, 2015) (XGBoost). For each algorithm, the top 20 features ranked by SHAP importance were initially selected. Subsequently, the ten most important features from this subset were retained. This multi-step screening process yielded a final set of ten spectral features. These features were subsequently used for model construction.
2.4 Model development and validation
All ML experiments were conducted on a Windows 10 computer using an algorithm implementation framework based on Python (version 3.10) and R (version 4.3.3). To ensure a robust and comprehensive benchmark, a total of 12 classification algorithms were evaluated. This set encompassed a wide range of machine learning paradigms, including Logistic Regression (LaValley, 2008) (LR), CatBoost, DT, RF, AdaBoost (Freund and Schapire, 1997), ET, Support Vector Machine (Patle and Chouhan, 2013) (SVM), Gaussian Naive Bayes (Webb et al., 2010) (NB), K-Nearest Neighbors (Liao and Vemuri, 2002) (KNN), Multi-layer Perceptron (Yang, 2010) (MLP), XGBoost, and LightGBM. This approach allowed for a systematic comparison to identify the most performant model for our specific dataset without a priori bias. The key Python packages utilized in this study, along with their versions, included: scikit-learn (1.6.1) for traditional machine learning models and hyperparameter tuning; TensorFlow (2.8.1) and PyTorch (2.6.0+cu118) for deep learning model development (CNN and Transformer); CatBoost (1.2.7), LightGBM (4.6.0), and XGBoost (2.1.4) for gradient boosting algorithms; and SHAP (0.47.2) for feature importance analysis, with additional support from pandas (2.2.3) and NumPy (1.24.0) for data manipulation. To ensure full reproducibility, the complete code and a detailed environment configuration file have been deposited in the GitHub repository (https://github.com/aibiobrain/FT-NIR).
Given the constraints of a limited sample size, which is common in spectroscopic studies, we employed a rigorous validation strategy to thoroughly assess model stability and generalizability. During the experiments, training and validation were conducted under both cross-validation and hold-out methods. We varied the training set and test set ratios (8:2, 7:3, and 6:4) to examine the sensitivity of model performance to different data partitioning scenarios. This practice helps ensure that the reported performance is not an artifact of a single, fortunate data split. To further verify statistical robustness, we selected 10 different random seeds for repeated experiments with each splitting ratio. Hyperparameter tuning was conducted for all models using the RandomGridSearch algorithm from the scikit-learn library.
We also compared two deep learning models, a Convolutional Neural Network (CNN) and a Transformer network, using a categorical cross-entropy loss function and accuracy as the evaluation metric. The detailed architectures of these models are provided in Supplementary Figure 2, 3. In the CNN architecture, the input 10-dimensional feature vector was processed through three sequential 1D convolution layers with channel numbers optimized over (32, 64, 128) (64, 128, 256), and (128, 256, 512). A maximum pooling layer was used for dimensionality reduction, with batch normalization and dropout layers incorporated to prevent overfitting. The classification was performed by a fully connected output layer with 12 neurons and a Softmax activation function. In the Transformer architecture, the input features were first converted into a 64-, 128-, or 256-dimensional embedding. Feature modeling was then performed through 2, 3, or 4 Transformer blocks, each employing a multi-head attention mechanism (2, 4, or 8 heads) to capture feature correlations. Each attention layer was followed by a feedforward network, with layer normalization and residual connections to stabilize training. Finally, global average pooling was applied before the classification layer to output the results.
2.5 Model evaluation
To comprehensively assess the model’s predictive ability, a set of core evaluation metrics is adopted. The evaluation metrics include accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (AUC). These metrics focus on different aspects of model performance, and their combined use prevents one-sided judgments caused by relying on a single metric.
2.6 External validation and robustness assessment
To further assess the model’s generalization ability and address concerns regarding the limited sample size, we performed external validation using data augmentation techniques. Given the high-dimensional nature of the raw spectral data (>100 dimensions) and the small number of samples per class, direct application of oversampling algorithms could be unreliable. Therefore, we first applied our established feature selection procedure (Section 2.3) to reduce the dimensionality to the top 10 most important features. To preliminarily assess the model’s generalizability beyond the single-season dataset and mitigate the impact of limited sample size, we employed the SMOTE for external validation. The external validation procedure was as follows: 1. Model Training: The optimal Multi-Layer Perceptron (MLP) model was first trained on the entire original dataset (108 samples) using the previously identified top 10 features and the best-performing hyperparameters. 2. Synthetic Dataset Generation: Using the imbalanced-learn library (v0.10.1) in Python, we synthetically oversampled this external set to create a balanced validation dataset with 18 samples per class, resulting in a total of 216 synthetic samples across 12 classes. This process allowed us to simulate a scenario with a larger, unseen validation set. 3. Generalization Assessment: The pre-trained MLP model (from Step 1) was then used to predict the geographical origins of the synthetically generated samples in the augmented external validation set. For comparison, we also evaluated two alternative resampling methods: Random Oversampling (simple duplication) and Bootstrapping (sampling with replacement). The performance of our optimal classifier (MLP) was then evaluated on these augmented datasets to ensure robustness and to verify that the synthetic samples maintained the statistical characteristics of the original data.
2.7 GC-MS analysis of the essential oil
It was reported in our previous paper (Li et al., 2021a). The GC-MS analysis of the essential oil was carried out using an Agilent 9000 GC system coupled with an Agilent 5977B MSD. The sample was introduced into the instrument through split injection with a pressure of 25 psi. at 280 °C, and the injection volume was 1 μL. Compounds were separated along a HP-5MS capillary column (30 m × 0.25 mm × 0.25 μm film thickness). GC-MS data was obtained from split ratios of 44:1 and 8:1. The carrier gas was helium at a flow rate of 1.0 mL/min. MS detection was obtained in an electron impact mode at 70 eV. The temperature of the MS transfer line, quadrupole, and ion source were set at 280 °C, 150 °C, and 230 °C, respectively. The full scan m/z range was 15–300 Da. In the sequence, the samples were run randomly, and one QC injection was added after every seven sample injections to ensure data stability.
3 Results and discussion
3.1 FT-NIR spectra
The raw spectra of A. tsaoko seeds were derived from twelve populations (Figure 2). Despite the substantial overlap and similarity in their raw spectral data, the samples exhibited high-dimensional absorbance features. These feature vectors enabled the classification of the geographic origins of the twelve samples. Differences in peak values were observed across regions (Figure 2). Notably, even within the wavenumber range of 7740 to 9920 cm-1, where spectral variations are subtle and not easily distinguished by visual assessment, consistent variations are still present. However, these subtle, consistent spectral differences enable the use of ML to distinguish the regional origins of A. tsaoko seeds. The high-dimensional absorbance data provide a sufficient number of informative features to facilitate classification among the twelve geographical origins.
3.2 Feature selection and model performance
Model accuracy improved rapidly with the inclusion of top-ranked features. However, it plateaued after approximately 10 features, indicating that additional inputs provided minimal improvement (Figure 3A). We also compared the top 20 most important features across the six ML models and observed notable differences in the characteristic wavelengths prioritized by each algorithm (Figure 3B). Based on AUC performance, the CatBoost was selected for feature filtering. The top 10 features identified by the model were subsequently used in the analysis (Figure 3C). Taking the most important feature at 5972 cm–1, as an example, the sample exhibits a lower peak revealed in blue and a higher peak in red. These peaks are separated on opposite sides of the horizontal line, indicating that their intensities differ between sample types. This distinction allows for effective classification of the samples.
Figure 3. (A) Characteristic filtering of different methods, (B) upset plot of overlap features of different methods, (C) SHAP values of the optimal model Catboost with feature importance plots for the top 20 seed profiles.
To evaluate the performance of different machine learning models in classifying the geographical origins of A. tsaoko seeds, we used the Randomized Grid Search algorithm from scikit-learn for hyperparameter optimization. A total of 12 ML models were trained and evaluated using a comprehensive set of performance metrics. The models were evaluated under both 3-fold cross-validation and hold-out validation schemes, with training-to-test set ratios of 8:2, 7:3, and 6:4. The resulting metrics, including accuracy, precision, recall, and F1-score for the training set, validation set, and cross-validation average, are summarized in a heatmap (Figures 4A–C). The top three models with the highest accuracy across the different data splits are further visualized (Figures 4D–F). Among all configurations, the MLP model achieved the highest identification accuracy of 96.97% under the 7:3 split ratio (Figure 4F), and was selected as the optimal algorithm for this geographical traceability task.
Figure 4. Comparative performance evaluation of 12 machine learning algorithms under different dataset splitting schemes. (A–C) Heatmaps of core evaluation metrics (including AUC, accuracy, precision, recall, and F1-score) for all models under (A) 8:2, (B) 7:3, and (C) 6:4 training-to-test set split ratios. (D–F) Bar plots showing the prediction accuracy of the top three performing models under the (D) 8:2, (E) 7:3, and (F) 6:4 split ratios, respectively.
The data were split into training and test sets using a 7:3 ratio. Receiver operating characteristic (ROC) curves generated from 12 ML algorithms–across training, test, and cross-validation datasets indicated that MLP achieved the best performance (Figures 5–7). Furthermore, the Confusion Matrix of different classes indicated that MLP is the best algorithm (Figures 8–10). To assess the robustness and stability of all models, we repeated the training process using 10 different random seeds for each data splitting ratio. The results were summarized using accuracy boxplots (Figures 11A–C). Within this assessment, the MLP model exhibited the best and most stable performance under the 7:3 splitting scheme, displaying minimal variance in accuracy across iterations (Figure 11B).
Figure 5. The ROC curve of the training set with 12 machine learning algorithms. The algorithms, in order, are: Logistic Regression (LR), CatBoost, Decision Tree (DT), Random Forest (RF), AdaBoost, Extra Trees (ET), Support Vector Machine (SVM), Gaussian Naive Bayes (NB), K-Nearest Neighbors (KNN), Multi-layer Perceptron (MLP), XGBoost, and LightGBM.
Figure 6. The ROC curve of the test set with 12 machine learning algorithms. The algorithms, in order, are: Logistic Regression (LR), CatBoost, Decision Tree (DT), Random Forest (RF), AdaBoost, Extra Trees (ET), Support Vector Machine (SVM), Gaussian Naive Bayes (NB), K-Nearest Neighbors (KNN), Multi-layer Perceptron (MLP), XGBoost, and LightGBM.
Figure 7. The ROC curve of the cross-validation with 12 machine learning algorithms. The algorithms, in order, are: Logistic Regression (LR), CatBoost, Decision Tree (DT), Random Forest (RF), AdaBoost, Extra Trees (ET), Support Vector Machine (SVM), Gaussian Naive Bayes (NB), K-Nearest Neighbors (KNN), Multi-layer Perceptron (MLP), XGBoost, and LightGBM.
Figure 8. The confusion matrix of the training set with 12 machine learning algorithms. The algorithms, in order, are: Logistic Regression (LR), CatBoost, Decision Tree (DT), Random Forest (RF), AdaBoost, Extra Trees (ET), Support Vector Machine (SVM), Gaussian Naive Bayes (NB), K-Nearest Neighbors (KNN), Multi-layer Perceptron (MLP), XGBoost, and LightGBM.
Figure 9. The confusion matrix of the test set with 12 machine learning algorithms. The algorithms, in order, are: Logistic Regression (LR), CatBoost, Decision Tree (DT), Random Forest (RF), AdaBoost, Extra Trees (ET), Support Vector Machine (SVM), Gaussian Naive Bayes (NB), K-Nearest Neighbors (KNN), Multi-layer Perceptron (MLP), XGBoost, and LightGBM.
Figure 10. The confusion matrix of the cross-validation with 12 machine learning algorithms. The algorithms, in order, are: Logistic Regression (LR), CatBoost, Decision Tree (DT), Random Forest (RF), AdaBoost, Extra Trees (ET), Support Vector Machine (SVM), Gaussian Naive Bayes (NB), K-Nearest Neighbors (KNN), Multi-layer Perceptron (MLP), XGBoost, and LightGBM.
Figure 11. Model robustness assessment and external validation. (A–C) Box plots of classification accuracy across 10 independent runs with different random seeds, demonstrating model stability under (A) 8:2, (B) 7:3, and (C) 6:4 training-test split ratios. (D) Confusion matrix of the optimal Multi-layer Perceptron (MLP) model, annotated with class-specific precision and recall values. (E) Prediction accuracy of the MLP model on the externally augmented validation set generated via the Synthetic Minority Oversampling Technique (SMOTE). (F) Confusion matrix illustrating the model’s performance on the SMOTE-augmented external validation set. (G, H) Prediction accuracy of the MLP model on the externally augmented validation set generated via the bootstrap oversampling data and random oversampling data.
To further explore the potential of automatic feature extraction, we implemented and rigorously compared three deep learning architectures: a CNN, a Transformer, and a hybrid CNN-Transformer model. The training dynamics, as reflected in the loss and accuracy curves (Figures 12A, B), clearly demonstrated that the CNN model achieved the most stable convergence and the best performance among the deep learning approaches. On the independent test set, the CNN model substantially outperformed the others, achieving a high identification accuracy of 90.9% (Figures 12C, D). In contrast, the Transformer model struggled with this task, attaining an accuracy of only 69.7%. The hybrid CNN-Transformer architecture, designed to leverage both local feature extraction and global contextual modeling, achieved an intermediate accuracy of 84.8%. This result indicates that while integrating Transformer modules provides a benefit over the standalone Transformer, it still falls short of the performance delivered by the simpler, yet highly effective, CNN model for our specific dataset. Ultimately, however, the top-performing MLP model, which achieved 96.97% accuracy, surpassed even the best deep learning model, underscoring the exceptional efficacy of our feature selection strategy combined with traditional machine learning for this specific task.
Figure 12. Comparative performance analysis between CNN, Transformer and hybrid models. (A) Training loss curves of the CNN, Transformer and hybrid models over 100 epochs. (B) Training accuracy curves throughout the training process. (C) Quantitative evaluation of model performance on the independent test set. (D) Confusion matrix of the CNN model on the test set, detailing the per-class classification results (True Class vs. Predicted Class).
To promote practical application, the final MLP model was implemented into a simplified prediction tool. This tool integrates the top 10 seed profiles of A. tsaoko (5972, 6000, 11208, 11504, 5268, 11336, 9076, 8956, 8496, 5260 cm–1). The integration enables rapid, non-invasive prediction of A. tsaoko location. The web application of this AI model is accessible online at https://frp-act.com:44782/.
3.3 External model validation
To address potential concerns about model generalizability given the single-season dataset, we conducted an external validation exercise using synthetic data augmentation on the optimal 10-feature subset. After dimensionality reduction, SMOTE was applied to generate an expanded validation set. The distributions of the synthetic samples in the reduced feature space were visually and statistically consistent with the original data, confirming that no significant distortion was introduced (Supplementary Figure 4-6). We compared the model’s performance when validated on datasets augmented via SMOTE, Random Oversampling, and Bootstrapping. The optimal MLP model maintained high and comparable accuracy across all three augmented validation sets (Figures 11E-H), with the SMOTE-augmented set yielding an accuracy of 94.4%. This result demonstrates that within the carefully selected, low-dimensional feature space, SMOTE provided a valid mechanism for robustness assessment without compromising data reliability.
3.4 Correlation between key spectral features and metabolites
Due to the overlapping nature of bands within each signal, it is not possible to attribute a distinct wavenumber to a single substance in the metabolome (Wang et al., 2020). However, chemical bonds are still sufficient to provide abundant spectral information. The FT-NIR spectra reveal absorption peaks commonly associated with various molecular vibrations (Table 1). Putative assignments based on common spectral libraries suggest that the range of 4250~4350 cm-1 may correspond to the stretching vibrations of C-H and C-C combinations, which are present in various terpenes. The spectral region of 5000~5200 cm-1 is often associated with the first overtones of C=O stretching vibrations, while absorptions between 5300~6000 cm-1 are typically attributed to the overtones of C-H and CH2 stretching vibrations. The broad feature around 6500~7300 cm-1 can be assigned to the first overtone of O-H stretching vibrations. The observed spectral differences across origins suggest underlying variations in chemical composition, thereby supporting the feasibility of FT-NIR-based geographical authentication.
To enhance the biological interpretability of the important wavenumbers identified by the SHAP analysis, we investigated their potential correlations with known metabolites. We used GC-MS to measure the critical volatile esters. Then we performed a correlation analysis between the absorbance values of the top 10 spectral features (5972, 6000, 11208, 11504, 5268, 11336, 9076, 8956, 8496, 5260 cm-1) and the concentrations of a series of flavonoids and volatile aroma compounds obtained from our previous metabolomic studies on the same set of A. tsaoko samples (Li et al., 2021b; Wang et al., 2021).
The results revealed that the key spectral features were broadly correlated with a diverse range of flavonoids (Figure 13A). More specifically, the important wavenumber 5972 cm-1 showed strong positive correlations with several critical volatile esters, including (E)-2-decenyl acetate, (E)-2-dodecenyl acetate, and geranyl acetate (Figure 13B-D). These compounds are well-characterized as key aroma constituents that define the distinctive flavor profile of A. tsaoko (Li et al., 2021b; Wang et al., 2021).
Figure 13. (A) Heatmap of Mantle correlation coefficients between the top 10 most important infrared wavelengths and selected flavor metabolites. (B-D) Scatter plots with fitted regression lines illustrating the significant correlations between the specific infrared wavelength at 5972 cm-1 and three key acetate esters: (B) (E)-2-decenyl acetate, (C) (E)-2-dodecenyl acetate, and (D) geranyl acetate. The strong correlations suggest the potential of using the 5972 cm-1 wavelength as a key marker for predicting the concentrations of these flavor compounds.
This finding provides a plausible biochemical basis for the high discriminatory power of the 5972 cm-1 feature. It suggests that our FT-NIR model, guided by machine learning, is effectively capturing intrinsic chemical variations in important flavor-related metabolites, which are themselves influenced by geographical growing conditions. This linkage significantly strengthens the scientific value of our spectral authentication model by connecting spectral patterns to tangible quality attributes.
4 Discussion
The quality of plant-derived products often varies across different regions (Dhami and Mishra, 2015; Jangra et al., 2021; Nolden and Forde, 2023; Benomar et al., 2025). Such variation typically stems from a complex interplay of factors, including genetic background, ecological conditions, and human cultivation practices (Wang et al., 2023; Liu et al., 2024; Hu et al., 2025). Due to these inherent differences among plants and cultivation environments, geographic authentication and traceability (GAT) of plant origin can be challenging. However, market globalization and improved transportation have increased the value of GAT, particularly for plants from specific, narrow geographic regions to consumers, governments, and manufacturers (Khayatan et al., 2024). To date, the globalization of markets and the enhanced transportation infrastructure have elevated the importance and value of GAT in a narrow region. In this study, we used FT-NIR spectroscopy combined with ML to address this challenge by accurately distinguishing A. tsaoko seeds from a distinct origin, Yunnan Province. Our approach achieved identification accuracies of up to 96.97%, successfully classifying samples from origins separated by distances as short as 22.9 km. Such fine-scale discrimination has been predominantly studied in high-value products like olive oil (Wang et al., 2023) and wine (Rodríguez-Bencomo et al., 2011; Styger et al., 2011). Its application to A. tsaoko is novel and can serve as a paradigm extendable to other species, thereby advancing quality control and authenticity verification of regional agricultural products.
Previous research utilized combined NIR and ultraviolet-visible light (UV-Vis) spectroscopy to identify A. tsaoko fruits originating from five distinct geographical regions within Yunnan province (Liu et al., 2021a). However, this study employed only basic analytic methods, such as principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA). These methods lack the ability to generalize across different regions and sample types. Furthermore, these studies did not fully address the complex variations in topography, climate, and ecology that exist within Yunnan province. Our study expanded the number of regions sampled to 12. We used a more accurate ML to achieve better geographic authentication.
ML is a program that extracts unknown features from large datasets for prediction or classification. As an analytical technique, it is useful primarily for finding a relationship between inputs and outputs in sample data (Kabir et al., 2021). The results of this study firmly align with the current trajectory in food and agricultural science, where the integration of NIRS with ML is rapidly establishing itself as a versatile and powerful paradigm for non-destructive analysis (Zhang et al., 2022; Miao et al., 2025). This is unequivocally demonstrated by a surge of recent research across diverse applications. For instance, in the context of cereal quality and composition, studies have successfully deployed this combination for rapid protein detection in rice through Raman-NIR fusion (Wang et al., 2023), non-destructive assessment of moisture and fatty acids in rice via hyperspectral imaging (Song et al., 2023), and nutrient quantification in sorghum (Wu et al., 2025). Beyond staple grains, the technique has also proven effective for quality control and defect identification in tubers. Notable examples include the identification of internal defects in potatoes (Semyalo et al., 2024) and the online inspection of blackheart using interpretable deep learning models (Guo et al., 2024). Furthermore, the scope of NIR applications extends to physiological monitoring and origin traceability, with applications ranging from estimating maize leaf water content across growth stages (Ren et al., 2025) to classifying mung beans of different origins (Wu et al., 2023). The convergence of these studies highlights a field moving beyond mere feasibility toward sophisticated, application-specific solutions. This study also provided an in-depth comparison of deep learning architectures. The finding that the CNN model outperformed both the Transformer and a hybrid CNN-Transformer architecture offers valuable insights. It suggests that for the task of geographical authentication of A. tsaoko using FT-NIR spectra, the local morphological patterns and short-range dependencies within the spectral data are more discriminative than the long-range, global contextual relationships that Transformers excel at capturing. The underperformance of the more complex hybrid model relative to the standalone CNN can likely be attributed to the limited sample size, which may have hindered the effective training of the increased number of parameters and the learning of meaningful global representations on top of local features. This aligns with the recognized challenge of training deep, complex models on small-scale spectroscopic datasets. Therefore, while hybrid architectures hold theoretical promise, our empirical results indicate that for the present scope, a well-tuned CNN provides a more effective and efficient deep learning solution. Future work, leveraging larger and more diverse multi-seasonal datasets, will be essential to fully unlock the potential of more sophisticated hybrid and ensemble architectures for further improving the robustness and accuracy of geographical traceability models. Our work on A. tsaoko traceability thereby contributes to this expanding ecosystem of NIR-AI applications.
It is important to acknowledge the inherent limitations of FT-NIR spectroscopy in pinpointing specific metabolites. The absorption bands in the NIR region arise from broad, overlapping overtones and combination vibrations of fundamental mid-IR modes, primarily involving C–H, O–H, and N–H bonds (Wang et al., 2020). Consequently, while our correlation analysis linked key discriminatory wavenumbers to metabolites quantified by orthogonal methods (GC-MS for volatile compounds), these associations should be interpreted as reflecting general chemical moieties and generating hypotheses, rather than providing definitive identifications of single compounds. The strong correlation between the top SHAP feature (5972 cm-1) and key acetate esters, for instance, plausibly indicates the model’s sensitivity to variations in the C-H bonding environment associated with these quality attributes, without implying exclusivity. This perspective reinforces that the observed spectral differences across origins are suggestive of underlying chemical variations that support authentication, while avoiding overinterpretation of the specific biochemical mechanisms.
Beyond the considerations of sample scope, a detailed, class-wise evaluation of our optimal MLP model offers further confidence in its practical utility and reveals insightful nuances (Supplementary Table 2). the model achieved perfect classification (Precision, Recall, and F1-score of 1.0) for 10 out of the 12 geographical origins. This high-performance across the vast majority of classes underscores the robustness of the FT-NIR and machine learning approach for fine-scale geographical authentication. The comprehensive performance metrics did, however, identify a specific misclassification between the JP and LS origins. Statistical analysis of the top SHAP-selected spectral features revealed no significant difference between these two classes (Supplementary Figure 7), providing a data-driven explanation for this confusion: the chemical profiles of seeds from JP and LS, as captured by the most discriminatory FT-NIR wavelengths, are inherently similar. Crucially, since JP and LS are not geographically adjacent, this similarity is unlikely due to environmental continuum and may instead stem from shared but unknown agricultural practices, soil compositions, or genetic backgrounds. This finding is not a failure of the model but rather a reflection of the authentic, underlying chemical similarity between these two distinct origins. It highlights the sensitivity of our method and suggests that distinguishing these two sites might require incorporating additional, orthogonal data (e.g., genetic markers or elemental analysis). Nevertheless, the model’s performance remains exceptionally high overall, and its ability to clearly delineate origins separated by very short distances (e.g., 22.9 km) is a significant achievement.
This study has certain limitations that should be considered when interpreting the results. The most significant limitation is the sample diversity, as our model was developed and validated using samples collected solely during the 2018 harvest season (36 pooled samples: 12 production regions × 3 mixed pools). This single-season dataset does not account for potential environmental or seasonal variations across different growing seasons, which could impact model generalizability and increases the risk of overfitting. We explicitly acknowledge this limitation. Future studies incorporating samples across multiple seasons and years are essential to validate and further strengthen the robustness of these findings.
Despite this limitation, we indirectly evaluated the reproducibility and transferability of our FT-NIR methodology. We conducted a comparative analysis of our spectral dataset against an independent, publicly available FT-NIR spectral dataset of related Zingiberaceae species (Li et al., 2024). The analysis revealed an exceptionally strong correlation (r > 0.999, Supplementary Figure 8). This finding confirms the reproducibility of our spectral acquisition protocol and the consistency of our spectral profiles with established field measurements. Moreover, previous research (Li et al., 2024) indicates that the first overtone of C–H stretching vibration associated with –CH2 groups (6000–5400 cm-1) serves as a discriminative feature for classifying A. tsaoko and A. maximum. Consistent with this finding, our analysis identified wavelengths near 5972 and 6000 cm-1 as among the most informative features for discrimination. This concordance with previously reported spectral regions further reinforces the robustness of our approach. The reproducibility of these results strengthens confidence in the reliability of our data and suggests the potential for transferring our analytical approach. Nevertheless, future validation with larger and more diverse sample sets of A. tsaoko remains an important next step.
5 Conclusion
In summary, this study demonstrates that FT-NIR spectroscopy combined with machine learning and SHAP-based feature selection provides a rapid, non-destructive, and highly effective strategy for the precise geographical authentication of A. tsaoko seeds. By identifying ten critical spectral features and determining the MLP as the optimal model, we achieved a high identification accuracy of 96.97% for distinguishing seeds from 12 narrowly separated geographical origins. While the model’s performance is promising, its generalizability is currently limited by the single-harvest dataset. Nevertheless, this work establishes a powerful and practical framework, implemented via a web application, for the quality control and origin traceability of A. tsaoko and potentially other high-value agricultural products.
Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.
Author contributions
YZ: Data curation, Formal analysis, Visualization, Writing – original draft. SL: Data curation, Formal analysis, Writing – original draft. HH: Formal analysis, Writing – original draft, Writing – review & editing. XH: Data curation, Methodology, Writing – original draft. HW: Formal analysis, Writing – original draft. HC: Data curation, Writing – original draft. XL: Conceptualization, Resources, Writing – original draft. YY: Conceptualization, Resources, Writing – original draft. SJ: Conceptualization, Project administration, Supervision, Writing – original draft. HX: Conceptualization, Project administration, Supervision, Writing – original draft, Writing – review & editing.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Acknowledgments
We thank Home for Researchers editorial team (www.homefor-researchers.com) for language editing service.
Conflict of interest
The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2025.1717851/full#supplementary-material
Supplementary Figure 1 | Morphological characteristics of Amomum tsaoko from 12 different geographical origins. For each pair, the left panel shows the powdered form, and the right panel shows a representative individual fruit.
Supplementary Figure 2 | Architecture of CNN
Supplementary Figure 3 | Architecture of Transformer
Supplementary Figure 4 | Disutribution comparison between orgin data and SMOTE data.
Supplementary Figure 5 | Disutribution comparison between orgin data and bootstrap oversampling data.
Supplementary Figure 6 | Disutribution comparison between orgin data and random oversampling data.
Supplementary Figure 7 | Statistical analysis of the top SHAP-selected spectral features between the JP and LS origins.
Supplementary Figure 8 | High consistency between FT-NIR spectra from the current study and published data (Li et al., 2024).
Supplementary Table 3 | Raw spectral data.
References
Agelet, L. E., Hurburgh, J. R., and C., R. (2010). A tutorial on near infrared spectroscopy and its calibration. Crit. Rev. Anal Chem. 40, 246–260. doi: 10.1080/10408347.2010.515468
Benomar, A., Salime, G. M., Mabrouki, M., Fekrouni, M., Chahboune, R., and Bouatia, M. (2025). Bibliometric analysis of chemometric research applied to phytomedicine: ensuring quality control of medicinal plants. Vegetos. doi: 10.1007/s42535-025-01406-8
Chen, T. and Guestrin, C. (2015). XGBoost: A Scalable Tree Boosting System. 2016 In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). Association for Computing Machinery. (New York, NY, USA) 785–794. doi: 10.1145/2939672.2939785
Chen, C., You, F., Wu, F., Luo, Y., Zheng, G., Xu, H., et al. (2020). Antiangiogenesis Efficacy of Ethanol Extract from Amomum tsaoko in Ovarian Cancer through Inducing ER Stress to Suppress p-STAT3/NF-kB/IL-6 and VEGF Loop. Evidence-Based Complement Altern. Med. 2020, 2390125. doi: 10.1155/2020/2390125
Cui, Q., Wang, L.-T., Liu, J.-Z., Wang, H.-M., Guo, N., Gu, C.-B., et al. (2017). Rapid extraction of Amomum tsao-ko essential oil and determination of its chemical composition, antioxidant and antimicrobial activities. J. Chromatogr. B 1061-1062, 364–371. doi: 10.1016/j.jchromb.2017.08.001
Dhami, N. and Mishra, A. D. (2015). Phytochemical variation: How to resolve the quality controversies of herbal medicinal products? J. Herbal Med. 5, 118–127. doi: 10.1016/j.hermed.2015.04.002
Dorogush, A. V., Ershov, V., and Gulin, A. (2018). CatBoost: gradient boosting with categorical features support. ArXiv. doi: 10.48550/arXiv.1810.11363
Fan, J., Ma, X., Wu, L., Zhang, F., Yu, X., and Zeng, W. (2019). Light Gradient Boosting Machine: An efficient soft computing model for estimating daily reference evapotranspiration with local and external meteorological data. Agric. Water Manag 225, 105758. doi: 10.1016/j.agwat.2019.105758
Freund, Y. and Schapire, R. E. A. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139. doi: 10.1006/jcss.1997.1504
Guo, Y., Zhang, L., He, Y., Lv, C., Liu, Y., Song, H., et al. (2024). Online inspection of blackheart in potatoes using visible-near infrared spectroscopy and interpretable spectrogram-based modified ResNet modeling. Frontiers in Plant Science 15, 2024. doi: 10.3389/fpls.2024.1403713
He, G., Lin, Q., Yang, S.-B., and Wang, Y.-Z. (2023). A rapid identification based on FT-NIR spectroscopies and machine learning for drying temperatures of Amomum tsao-ko. J. Food Compos Anal. 118, 105199. doi: 10.1016/j.jfca.2023.105199
Hu, Y., Gao, X., Zhao, Y., Liu, S., Luo, K., Fu, X., et al. (2023). Flavonoids in Amomum tsaoko Crevost et Lemarie Ameliorate Loperamide-Induced Constipation in Mice by Regulating Gut Microbiota and Related Metabolites. Int. J. Mol. Sci. 24, 7191. doi: 10.3390/ijms24087191
Hu, H., Huang, W., Chen, Q., Ru, Y., Liu, X., Yang, Y., et al. (2025). Analysis of potential antidiarrheal lipids in Valeriana jatamansi reveals effects of lipophilic active substances on inflammation. Ind. Crops Prod 226, 120595. doi: 10.1016/j.indcrop.2025.120595
Jangra, S., Chaudhary, V., Yadav, R. C., and Yadav, N. R. (2021). High-throughput phenotyping: A platform to accelerate crop improvement. Phenomics 1, 31–53. doi: 10.1007/s43657-020-00007-6
Kabir, M. H., Guindo, M. L., Chen, R., and Liu, F. (2021). Geographic origin discrimination of millet using vis-NIR spectroscopy combined with machine learning techniques. Foods. 10 (11), 2767. doi: 10.3390/foods10112767
Khayatan, D., Nouri, K., Momtaz, S., Roufogalis, B. D., Alidadi, M., Jamialahmadi, T., et al. (2024). Plant-derived fermented products: an interesting concept for human health. Curr. Dev Nutr. 8, 102162. doi: 10.1016/j.cdnut.2024.102162
Kim, J. G., Jang, H., Le, T. P. L., Hong, H. R., Lee, M. K., Hong, J. T., et al. (2019). Pyranoflavanones and pyranochalcones from the fruits of amomum tsao-ko. J. Natural Prod 82, 1886–1892. doi: 10.1021/acs.jnatprod.9b00155
LaValley, M. P. (2008). Logistic regression. Circulation 117, 2395–2399. doi: 10.1161/CIRCULATIONAHA.106.682658
Li, Y., Huang, X., Zhao, X., Zou, Y., Wang, X., He, J., et al. (2025). The refrigerated preservation effect of Amomum tsao-ko essential oil and its microcapsules loaded on polyvinyl alcohol/starch composite film based on broad-spectrum antibacterial properties on golden pompanos. LWT 222, 117595. doi: 10.1016/j.lwt.2025.117595
Li, G., Lu, Q., Wang, J., Hu, Q., Liu, P., Yang, Y., et al. (2021a). Correlation analysis of compounds in essential oil of amomum tsaoko seed and fruit morphological characteristics, geographical conditions, locality of growth. Agronomy-Basel 11, 744. doi: 10.3390/agronomy11040744
Li, G., Lu, Q., Wang, J., Hu, Q., Liu, P., Yang, Y., et al. (2021b). Correlation analysis of compounds in essential oil of amomum tsaoko seed and fruit morphological characteristics, geographical conditions. Local Growth 11, 744. doi: 10.3390/agronomy11040744
Li, F., Yang, W., Yang, M., Wang, Y., and Zhang, J. (2024). Differences between two plants fruits: Amomum tsaoko and Amomum maximum, using the SPME-GC–MS and FT-NIR to classification. Arab J. Chem. 17, 105665. doi: 10.1016/j.arabjc.2024.105665
Liao, Y. and Vemuri, V. R. (2002). Use of K-Nearest Neighbor classifier for intrusion detection1. Comput. Secur. 21, 439–448. doi: 10.1016/S0167-4048(02)00514-X
Liu, X., Mu, X., Hu, H., Chen, Q., Yang, Y., Tang, H., et al. (2024). Analysis of potential antidiarrheal metabolites in fibrous root, rhizome, and basal leaf samples from Valeriana jatamansi. Ind. Crops Prod 219, 118887. doi: 10.1016/j.indcrop.2024.118887
Liu, X., Mu, X., Peng, L., Liu, J., Lu, Q., Yang, Y., et al. (2023). Multi-element fingerprinting approach for geographical authentication of Amomum tsaoko seed. Ind. Crops Prod 195, 116345. doi: 10.1016/j.indcrop.2023.116345
Liu, Z., Yang, S., Wang, Y., and Zhang, J. (2021a). Multi-platform integration based on NIR and UV-Vis spectroscopies for the geographical traceability of the fruits of Amomum tsao-ko. Spectrochimica acta. Part A Mol. Biomol Spectrosc. 258, 119872. doi: 10.1016/j.saa.2021.119872
Liu, Z., Yang, S., Wang, Y., and Zhang, J. (2021b). Multi-platform integration based on NIR and UV–Vis spectroscopies for the geographical traceability of the fruits of Amomum tsao-ko. Spectrochim Acta Part A: Mol. Biomol Spectrosc. 258, 119872. doi: 10.1016/j.saa.2021.119872
Ma, M., Lei, E., Meng, H., Wang, T., Xie, L., Shen, D., et al. (2017). Cluster and principal component analysis based on SSR markers of Amomum tsao-ko in Jinping County of Yunnan Province. AIP Conf. Proc. 1864 (1), 020070. doi: 10.1063/1.4992887
Ma, M., Meng, H., Lei, E., Wang, T., Zhang, W., and Lu, B. (2022). De novo transcriptome assembly, gene annotation, and EST-SSR marker development of an important medicinal and edible crop, Amomum tsaoko (Zingiberaceae). BMC Plant Biol. 22, 467. doi: 10.1186/s12870-022-03827-y
Miao, Y.-H., Zong, L.-Y., and Su, W.-H. (2025). Machine learning in VIS/NIR spectroscopy and hyperspectral imaging for rapid detection of staple foods quality and safety: A review. J. Food Compos Anal. 148, 108422. doi: 10.1016/j.jfca.2025.108422
Nguyen Minh, Q., Lai, Q. D., Nguy Minh, H., Tran Kieu, M. T., Lam Gia, N., Le, U., et al. (2022). Authenticity green coffee bean species and geographical origin using near-infrared spectroscopy combined with chemometrics. Int. J. Food Sci. Technol. 57, 4507–4517. doi: 10.1111/ijfs.15786
Nolden, A. A. and Forde, C. G. (2023). The nutritional quality of plant-based foods. Sustainability. 15 (4), 3324. doi: 10.3390/su15043324
Patle, A. and Chouhan, D. S. (2013). “SVM kernel functions for classification,” in Proceedings of the 2013 International Conference on Advances in Technology and Engineering (ICATE), 23–25 Jan. (Mumbai, India) 1–9. doi: 10.1109/ICAdTE.2013.6524743
Qin, H., Wang, Y., Yang, W., Yang, S., and Zhang, J. (2021). Comparison of metabolites and variety authentication of Amomum tsao-ko and Amomum paratsao-ko using GC-MS and NIR spectroscopy. Sci. Rep. 11, 15200. doi: 10.1038/s41598-021-94741-0
Ren, Y., Zhang, W., Wang, H., Zhang, Z., Sheng, W., Qiu, R., et al. (2025). Estimation models for maize leaf water content at various stages using near-infrared spectroscopy. Infrared Phys. Technol. 145, 105732. doi: 10.1016/j.infrared.2025.105732
Rodríguez-Bencomo, J. J., Cabrera-Valido, H. M., Pérez-Trujillo, J. P., and Cacho, J. (2011). Bound aroma compounds of Gual and Listán blanco grape varieties and their influence in the elaborated wines. Food Chem. 127, 1153–1162. doi: 10.1016/j.foodchem.2011.01.117
Santos-Rivera, M., Montagnon, C., and Sheibani, F. (2024). Identifying the origin of Yemeni green coffee beans using near infrared spectroscopy: a promising tool for traceability and sustainability. Sci. Rep. 14, 13342. doi: 10.1038/s41598-024-64074-9
Schütz, D., Riedl, J., Achten, E., and Fischer, M. (2022). Fourier-transform near-infrared spectroscopy as a fast screening tool for the verification of the geographical origin of grain maize (Zea mays L.). Food Contr 136, 108892. doi: 10.1016/j.foodcont.2022.108892
Semyalo, D., Kim, Y., Omia, E., Arief, M. A. A., Kim, H., Sim, E.-Y., et al. (2024). Nondestructive identification of internal potato defects using visible and short-wavelength near-infrared spectral analysis. Agriculture. 14 (11), 2014. doi: 10.3390/agriculture14112014
Sharaff, A. and Gupta, H. (2019). Extra-tree classifier with metaheuristics approach for email classification. Proc. Adv. Comput. Comm Comput. Sci. (Springer, Singapore). 924, 189–197. doi: 10.1007/978-981-13-6861-5_17
Song, Y., Cao, S., Chu, X., Zhou, Y., Xu, Y., Sun, T., et al. (2023). Non-destructive detection of moisture and fatty acid content in rice using hyperspectral imaging and chemometrics. J. Food Compos Anal. 121, 105397. doi: 10.1016/j.jfca.2023.105397
Styger, G., Prior, B., and Bauer, F. F. (2011). Wine flavor and aroma. J. Ind. Microbiol. Biotechnol. 38, 1145–1159. doi: 10.1007/s10295-011-1018-4
Tin Kam, H. (1995). “Random decision forests,” in Proceedings of the Proceedings of 3rd International Conference on Document Analysis and Recognition, 14–16 Aug, (Montreal, QC, Canada) Vol. 271. 278–282. doi: 10.1109/ICDAR.1995.598994
The State Pharmacopoeia Commission. Pharmacopoeia of the People’s Republic of China: 2015 Edition. Beijing: China Medical Science Press, (2015).
Utgoff, P. E. (1989). Incremental induction of decision trees. Mach. Learn. 4, 161–186. doi: 10.1023/A:1022699900025
Wang, J., Li, Y., Lu, Q., Hu, Q., Liu, P., Yang, Y., et al. (2021). Drying temperature affects essential oil yield and composition of black cardamom (Amomum tsao-ko). Ind. Crops Prod 168, 113580. doi: 10.1016/j.indcrop.2021.113580
Wang, Z., Liu, J., Zeng, C., Bao, C., Li, Z., Zhang, D., et al. (2023). Rapid detection of protein content in rice based on Raman and near-infrared spectroscopy fusion strategy combined with characteristic wavelength selection. Infrared Phys. Technol. 129, 104563. doi: 10.1016/j.infrared.2023.104563
Wang, C.-Y., Tang, L., Li, L., Zhou, Q., Li, Y.-J., Li, J., et al. (2020). Geographic authentication of eucommia ulmoides leaves using multivariate analysis and preliminary study on the compositional response to environment. Frontiers in Plant Science 11, 2020. doi: 10.3389/fpls.2020.00079
Wang, X., Yang, Z., Jiang, L., Liu, Z., Dong, X., Sui, M., et al. (2023). Assessment of germplasm resource and detection of genomic signature under artificial selection of Zhikong scallop (Chlamys farreri). Aquaculture 574, 739730. doi: 10.1016/j.aquaculture.2023.739730
Wang, Y., Yu, L., Shehzad, Q., Kong, W., Wu, G., Jin, Q., et al. (2023). A comprehensive comparison of Chinese olive oils from different cultivars and geographical origins. Food Chem: X 18, 100665. doi: 10.1016/j.fochx.2023.100665
Webb, G. I., Keogh, E., and Miikkulainen, R. (2010). Naïve bayes. Encycl Mach. Learn. 15, 713–714. doi: 10.1007/978-0-387-30164-8_576
Wei, Z., Bingyue, L., Hengling, M., Xiang, W., Zhiqing, Y., and Shengchao, Y. (2019). Phenotypic diversity analysis of the fruit of Amomum tsao-ko Crevost et Lemarie, an important medicinal plant in Yunnan, China. Genet. Resour. Crop Evol. 66, 1145–1154. doi: 10.1007/s10722-019-00765-x
Wu, M., Li, Y., Yuan, Y., Li, S., Song, X., and Yin, J. (2023). Comparison of NIR and Raman spectra combined with chemometrics for the classification and quantification of mung beans (Vigna radiata L.) of different origins. Food Contr 145, 109498. doi: 10.1016/j.foodcont.2022.109498
Wu, K., Zhang, Z., He, X., Li, G., Zheng, D., and Li, Z. (2025). Using visible and NIR hyperspectral imaging and machine learning for nondestructive detection of nutrient contents in sorghum. Sci. Rep. 15, 6067. doi: 10.1038/s41598-025-90892-6
Yang, Z. R. (2010). “Multi-layer perceptron,” in Handbook of machine learning. (Tuck Link, Singapore: World Scientific Publishing). doi: 10.1142/9789813271234_0002
Yang, Y., Yan, R.-W., Cai, X.-Q., Zheng, Z.-L., and Zou, G.-L. (2008). Chemical composition and antimicrobial activity of the essential oil of Amomum tsao-ko. J. Sci. Food Agric. 88, 2111–2116. doi: 10.1002/jsfa.3321
Ying, W. (2023). Phenomic studies on diseases: potential and challenges. Phenomics 3, 285–299. doi: 10.1007/s43657-022-00089-4
Zhang, Z., Guan, W., Liang, M., Wang, R., Wu, Y., and Liu, Y. (2023). Characterization of the key odorants in fresh Amomum tsao-ko Crevost et Lemaire fruit by gas chromatography-olfactometry, quantitative analysis and aroma reconstitution. LWT 185, 115154. doi: 10.1016/j.lwt.2023.115154
Keywords: Amomum tsaoko, feature reduction analysis, FT-NIR, geographical authentication, machine learning
Citation: Zheng Y, Lan S, Hu H, Huang X, Wang H, Cao H, Liu X, Yang Y, Ji S and Xie H (2026) Geographic authentication of Amomum tsaoko seeds using fourier transform-near infrared spectroscopy combined with machine learning techniques and feature reduction analysis. Front. Plant Sci. 16:1717851. doi: 10.3389/fpls.2025.1717851
Received: 02 October 2025; Accepted: 25 December 2025; Revised: 15 December 2025;
Published: 22 January 2026.
Edited by:
Changkai Wen, China Agricultural University, ChinaReviewed by:
Chenghao Fei, Nanjing Agricultural University, ChinaAgustami Sitorus, National Research and Innovation Agency (BRIN), Indonesia
Juan Liu, Chongqing University of Education, China
Copyright © 2026 Zheng, Lan, Hu, Huang, Wang, Cao, Liu, Yang, Ji and Xie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Shengguo Ji, c2hlbmdndW9famlAMTYzLmNvbQ==; Hui Xie, eGllaHVpQGZ1ZGFuLmVkdS5jbg==
†These authors have contributed equally to this work
Songping Lan2†