Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Plant Sci., 06 February 2026

Sec. Sustainable and Intelligent Phytoprotection

Volume 17 - 2026 | https://doi.org/10.3389/fpls.2026.1746869

This article is part of the Research TopicNon-Destructive Phenotyping from Seeds to Plants: Advancements in Sensing Technologies, Algorithms, and ApplicationsView all 4 articles

Near-infrared prediction of tannin content in walnut kernels using wavelet transform combined with interpretable machine learning models

Updated
Qiuhao Xia,,&#x;Qiuhao Xia1,2,3†Langqin Luo,,&#x;Langqin Luo2,3,4†Yerhazi Yerzati,,Yerhazi Yerzati1,2,3Mian Muhammad Ahmed,,Mian Muhammad Ahmed2,3,4Yonghao ChenYonghao Chen5Shiwei WangShiwei Wang6Jiangnan Qin,Jiangnan Qin1,2Liping ChenLiping Chen7Qiang Jin,,Qiang Jin1,2,3Zhongzhong Guo,,Zhongzhong Guo1,2,3Rui Zhang,,*Rui Zhang1,2,3*
  • 1College of Horticulture and Forestry, Tarim University, Alar, China
  • 2Efficient and High-Quality Cultivation and Deep Processing Technology of Characteristic Fruit Trees in Southern Xinjiang, National Local Joint Engineering Laboratory, Alar, China
  • 3Xinjiang Production and Construction Corps, Southern Xinjiang Characteristic Forest and Fruit Technology Innovation Center, Alar, China
  • 4College of Life Science and Technology, Tarim University, Alar, China
  • 5Beijing Academy of Agricultural and Forestry Sciences Forestry Fruit Tree Research Institute, Beijing, China
  • 6School of Forestry and Landscape Architecture, Xinjiang Agricultural University, Urumqi, China
  • 7School of Information Engineering, Tarim University, Aral, China

Introduction: Tannin content is a key factor influencing the taste of walnuts and serves as an important index for evaluating walnut quality. Rapid and accurate detection of tannin levels in walnut kernels is therefore significant for quality assessment and management. This study aims to develop an efficient method for predicting tannin content in walnut kernels using near-infrared (NIR) spectroscopy combined with machine learning techniques.

Methods: A total of 180 samples of ‘Wen 185’ walnut kernels were used as the research objects. The NIR reflectance spectra of the samples were measured within the range of 4000–10000 cm⁻¹. The spectral data were processed using mathematical transformations and continuous wavelet transform (CWT), both separately and in combination. Pearson correlation analysis was applied to extract characteristic bands related to tannin content. Based on these features, a random forest (RF) model was constructed to quantitatively predict tannin content. Additionally, the SHAP algorithm was employed to interpret and visualize the machine learning model.

Results: The results indicated that within the spectral range of 4000–10000 cm⁻¹, the NIR reflectance of walnut kernels increased with tannin content under different orchard management modes. Both first-order differential transformation and CWT, as well as their combination, significantly enhanced the correlation between spectral data and tannin content. The combination of first-order differential transformation and CWT notably improved the model's prediction performance. The optimal prediction model was achieved using the feature lg’(1/R)_CWT_28, with training set metrics of R² = 0.880, RMSE = 1.188, RPD = 2.904, and validation set metrics of R² = 0.831, RMSE = 1.620, RPD = 2.459.

Discussion: The study demonstrates that combining mathematical transformations with wavelet transform can effectively improve the prediction accuracy of models for tannin content in walnut kernels. The RF model based on processed spectral data showed strong performance, indicating its potential for rapid and non-destructive tannin quantification. The use of SHAP algorithm further enhances model interpretability. These findings provide a valuable reference for the accurate prediction of tannin content in walnut kernels and may support quality control in walnut production and processing.

1 Introduction

Walnut (Juglans regia L.) is a significant woody oil and economic tree species in China, valued for its drought tolerance and the high nutritional quality of its fruit. In particular, the southern Xinjiang region has seen extensive cultivation of walnuts, which have become a crucial component of local economic development and a primary source of income for farmers (Pei and Lu, 2011; Qu, 1980). Currently, however, most walnut products in China are either consumed directly or subjected to minimal processing. The astringency of the fruit notably affects its taste, with tannin compounds being the principal contributors to this astringency (Liu et al., 2023). Walnut tannin is a highly polymerized polyphenolic compound that interacts with salivary proteins, resulting in astringency. Research indicates that walnut tannin possesses antioxidant and antibacterial properties and may also play a role in the prevention of chronic diseases (Xiao et al., 2006; Aoki et al., 2006; Li et al., 2009).

Traditional methods for detecting tannins in walnuts typically include EDTA titration, the phosphomolybdic acid-sodium tungstate colorimetric method, capillary gas chromatography, and potassium ferrous hexachloride (III) spectrophotometry (Huang and Ni, 2022; Cai, 1997; Hernes and Hedges, 2000; Yang and Qu, 1989). However, these techniques are often costly, exhibit low time efficiency, and pose challenges for large-scale rapid detection. Furthermore, the use of chemical reagents during testing can endanger the health of testers, while the disposal of chemical waste can contribute to environmental pollution (Ma et al., 2022).

With the rapid advancement of spectroscopic technologies, near-infrared (NIR) spectroscopy has evolved from a well-established diagnostic method into a continuously innovating tool with expanding applications and significant practical value. NIR spectroscopy enables real-time, non-destructive, and dynamic monitoring of crop quality at specific spatial and temporal scales, offering considerable advantages over traditional chemical or sensory methods. As such, developing a rapid, universal, and efficient approach to predict tannin content in Juglans regia ‘Wen 185’ walnuts grown in southern Xinjiang has become increasingly important for rapid quality assessment and classification. Zhang et al. (2011) established a calibration model for soluble tannin content in astringent persimmons using visible and near-infrared diffuse reflectance (Vis/NIR) spectroscopy. By applying an improved partial least squares regression (PLSR) algorithm combined with first derivative and scatter correction preprocessing, the model demonstrated superior predictive performance, highlighting the utility of Vis/NIR spectroscopy for internal quality assessment. Similarly, Cheng (2020) employed NIR hyperspectral imaging integrated with chemometric methods, machine learning, and deep learning techniques to rapidly classify wine grape varieties, determine their geographic origins, and predict tannin levels at different maturation stages. Jensen et al. (2008) utilized Fourier transform mid-infrared (FT-MIR) spectroscopy for the rapid quantification of various wine constituents, including tannins. However, due to overlapping spectral responses from other compounds, accurate quantification of tannins remains challenging. Their study explored four variable selection methods to identify key spectral regions relevant to tannin determination using PLSR. In another study, Ying et al. (2006) applied wavelet transform (WT) to denoise NIR spectra of 90 apple samples, exploiting the multiscale differences in the evolution of wavelet modulus maxima between singular signals and random noise, and successfully predicted sugar content via stepwise regression.NIR spectroscopy has thus been widely applied in predicting tannin content in crops such as grapes, apples, persimmons, and sorghum. However, limited studies have specifically addressed tannin quantification in walnut kernels. Existing research has predominantly focused on optimizing model performance, while relatively little attention has been given to enhancing model interpretability.

In this experiment, spectral information from walnut kernels was collected within the wavenumber range of 4000–10000 cm-1.Various spectral processing methods, including mathematical transformation, wavelet transformation, and their combinations, were investigated to identify the most suitable pretreatment method for detecting tannin content in walnut kernels. Building on this foundation, a prediction model for tannin content was developed using random forests. The SHAP algorithm was applied to ascertain feature significance and facilitate internal model visualization, enabling swift tannin content detection in walnut kernels.

2 Materials and methods

2.1 Plant materials and instrumentation

The experimental material for this study was the ‘Wen 185’ walnut selected from the walnut forest farm in Wensu County, Aksu Prefecture, Xinjiang. A total of 180 walnut samples were collected from 9 walnut orchards with varying management levels: 3 high-yield, 3 medium-yield, and 3 low-yield orchards. The walnut trees in these orchards are spaced 5 meters by 6 meters apart and are all 10 years old. Following harvest at ripeness, the walnuts underwent a process where the green skins were removed, and then they were dried in a well-ventilated environment until their moisture content reached approximately 6%. Subsequently, the walnuts were shelled, kernels extracted, crushed for 3 minutes using a FW-80 high-speed universal crusher, and thoroughly mixed. The crushed walnut kernels were then sealed in plastic bags and stored at 4 °C for subsequent spectral scanning and determination of tannin content.

2.2 Acquisition and processing of raw spectral data from walnut kernels

NIR spectral data were collected using a Fourier-transform near-infrared spectrometer (Antaris II, Thermo Fisher Scientific, USA). The instrument was operated at a resolution of 8 cm-1 with a gain setting of 2, using the built-in background as reference. Each spectrum was obtained as the average of 32 scans. Prior to spectral acquisition, walnut samples were equilibrated under controlled environmental conditions (25°C and 40% relative humidity) for 24 hours to ensure consistency with the instrument’s ambient environment, thereby minimizing spectral variability.

The instrument was preheated for 60 minutes before measurement. Spectra were recorded over the wavenumber range of 4000-10000 cm-1. Ground walnut kernel powder was uniformly packed into quartz sample cups (30 mm diameter, 5 mm height, 1 mm wall thickness), with the sample surface leveled and aligned with the rim of the cup. Each sample was scanned three times, resulting in a total of 540 spectra for 180 samples. The final representative spectrum for each sample was obtained by averaging its three replicate scans. After each measurement, the sample cups were sequentially rinsed with tap water, distilled water, and then wiped clean with ethanol to ensure cleanliness and prevent cross-contamination.

2.2.1 Outlier detection and removal

Outlier removal was performed using the Monte Carlo method (Meng et al., 2022), which is effective for identifying data points that deviate significantly from the distribution of the dataset. Such outliers may result from instrumental noise, measurement errors, or data entry mistakes. Eliminating these anomalous points is essential to enhance the accuracy and robustness of subsequent model development.

2.2.2 Correlation analysis

Selection of informative spectral features is a critical step in improving the sensitivity of NIR data to tannin content. In the preliminary phase of this study, various feature selection strategies were evaluated, including Pearson correlation analysis. The comparison indicated that spectral bands selected based on a significance threshold of p <0.01 yielded superior modeling performance. Consequently, Pearson correlation analysis was conducted using MATLAB R2023a (MathWorks, USA) to assess the strength of association between each spectral band and tannin content. Spectral bands exhibiting statistically significant correlations (p<0.01) were retained for model construction (Mao et al., 2023). The Pearson correlation coefficient (r), ranging from -1 to 1, quantifies the degree of linear association between variables, with larger absolute values indicating stronger correlations.

2.2.3 Traditional mathematical transformations

To evaluate the impact of different preprocessing methods on spectral feature extraction and avoid information omission, this study used a total of 11 mathematical transformations, including reciprocal transformation, logarithmic transformation, and their derivatives, for horizontal comparison. Among them, reciprocal transformation aims to compress high reflection areas to enhance low value signals, while logarithmic transformation is used to reduce dynamic range and improve spectral response linearity. Both are commonly used benchmark methods to verify the effectiveness of spectral preprocessing.

2.2.4 Wavelet transform processing

Continuous Wavelet Transform (CWT) is a time-frequency analytical method that decomposes spectral reflectance into components of different frequencies and scales, allowing the identification of subtle spectral variations across multiple resolutions. This study employed various mother wavelets—including Bior, Morlet, Haar, and Gabor functions (Li X. et al., 2024), to convolve with the spectral data, thereby generating wavelet coefficients corresponding to different scales and frequency domains (Guan et al., 2024). The multiscale decomposition enabled by CWT improves feature resolution by capturing localized changes in spectral patterns while suppressing random noise, ultimately enhancing data interpretability and model performance (Lin et al., 2021). The wavelet decomposition is mathematically expressed as:

ωij=+vijψa,b(j)dj(1)

In the formula, they represent the wavelet coefficient and reflectance of the j-th band of the i-th tannin sample, respectively; a is the scale factor ranging from 21 to 210, b is the translation factor; ψa,b(j) denotes the wavelet basis function.

ψa,b(j)=1aψ(jba)(2)

In addition, CWT encompasses a variety of wavelet basis functions, each of which may yield different decomposition outcomes. The selection of an appropriate wavelet function and optimal decomposition scale is therefore critical for effective spectral preprocessing.To select the optimal wavelet basis function, this study preliminarily compared various commonly used wavelets, including Daubechies (db4), Symlets (sym8), Morlet (morl), Mexican hat (mexh), and Gaussian function (gaus4). Based on the comprehensive performance of feature band separation and noise suppression in pre experiments, the gaus4 wavelet was ultimately selected for subsequent analysis (Li et al., 2024).

2.3 Determination of tannin content in walnut kernels

Tannin content was determined according to the Chinese agricultural industry standard NY/T 1600-2008: Determination of tannin content in fruits, vegetables, and their products—Spectrophotometric method (Li et al., 1999). Precisely 1.00 g of ground walnut kernel sample was weighed and placed in a 100 mL volumetric flask. The sample was extracted using a boiling water bath for 30 minutes. After extraction, the mixture was cooled to room temperature and diluted to the mark with distilled water. The extract was centrifuged, and 2 mL of the supernatant was transferred into a 50 mL volumetric flask. Then, 1 mL of a sodium tungstate–sodium molybdate reagent and sodium carbonate solution was added, and the mixture was shaken thoroughly. After standing at room temperature for 2 hours, the absorbance of the solution was measured at 765 nm using a UV–Vis spectrophotometer.The tannin content was calculated using the following equation:

X1=C×V1×NM×1000(3)

In the formula, X1 represents the tannin content in the sample (mg/g); C represents the gallic acid content obtained from the standard curve (mg); V1 represents the volume of the sample determination solution (ml); M represents the mass of the walnut sample taken (g); N represents the dilution factor; 1000 represents the conversion coefficient.

2.4 Model construction and performance evaluation

With the rapid development of machine learning algorithms, numerous advanced modeling techniques have been applied to the prediction of fruit quality traits, often outperforming traditional statistical methods (Tan et al., 2023). Random forest (RF) (Breiman, 2001), an ensemble learning algorithm composed of multiple decision trees, exhibits robustness to multicollinearity and performs well with imbalanced or incomplete datasets (Guo et al., 2025). In this study, the dataset was randomly split into a training set and a validation set at a 6:4 ratio. Model performance was evaluated using the coefficient of determination (R2), root mean square error (RMSE), and relative percent deviation (RPD). A model with R2 approaching 1 and low RMSE indicates strong predictive capability. An RPD value between 1.4 and 2.0 suggests moderate reliability suitable for estimation, while RPD > 2.0 indicates a robust predictive model (Li et al., 2024).

2.5 SHAP-based feature importance analysis

Due to the “black-box” nature of many machine learning models, their internal decision-making processes are often opaque and difficult to interpret (Ye et al., 2024). The SHAP (Shapley additive explanations) algorithm addresses this issue by applying Shapley values—originating from cooperative game theory-to decompose the output of a model into contributions from each input feature (Li et al., 2025). This allows for a more transparent understanding of the model’s predictions and facilitates interpretability in complex systems.The experimental flowchart is shown in Figure 1.

Figure 1
The experimental flowchart outlines the complete research methodology, starting from the collection of walnut samples and acquisition of their near-infrared (NIR) spectra, followed by data preprocessing (including outlier removal and spectral transformations), feature selection using correlation analysis and Continuous Wavelet Transform (CWT), the construction of a predictive Random Forest (RF) model, and concluding with model interpretation using the SHAP algorithm to explain feature importance.

Figure 1. Experimental flowchart.

2.6 Software and implementation

All data preprocessing, spectral feature selection, and model development were performed using MATLAB® (Version R2023a; MathWorks, 2023) and Python. Chemical and spectral mean values were calculated using Microsoft Excel® (Version 2016; Microsoft Corporation, 2016). All visualizations and figures were generated using Origin® (Version 2021; OriginLab Corporation, 2021). For model interpretation, the SHAP Python package (Lundberg and Lee, 2017) was employed, which is based on the Shapley additive explanations framework.

3 Results and analysis

3.1 Analysis of tannin content in walnut kernels and removal of outliers

The tannin content in walnut kernels can be determined using a specific formula (Equation 3). The results indicate significant variability in tannin content among samples, demonstrating notable distinctions and representativeness. a comparison of orchard sample data under various management models is presented in Figure 2a, revealing distinct differences. The average tannin content differs among orchards managed under different models, facilitating model development. Figure 2b illustrates the near-infrared spectral range of 4000–10000 cm-1. The spectral curves under different management modes exhibit similar overall shapes, running roughly parallel to each other, with spectral reflectance increasing as wavelength increases.The absorption features observed in the spectra, particularly in the regions around 4000–5000 cm-1 and 7000–9000 cm-1, are likely associated with the characteristic vibrational modes of tannin molecules. Tannins, as polyphenolic compounds, contain abundant hydroxyl (-OH) groups, whose overtone and combination bands typically appear in the NIR region. Specifically, the first overtone of O-H stretching vibrations often occurs around 7000 cm-1, while combination bands involving O-H bending and stretching vibrations may contribute to the absorption features in the 4000–5000 cm-1 range.

Figure 2
Panel (a) presents a comparative analysis of tannin content measured in walnut kernel samples collected from orchards under different management modes (high, medium, and low yield), highlighting the variability used for modeling. Panel (b) displays the raw NIR reflectance spectra (4000-10000 cm?¹) of all samples, showing the overall spectral shape and key absorption regions potentially associated with tannin's molecular vibrations.

Figure 2. Tannin content data and reflectance curves under different management modes. (a) tannin content data under different management modes, (b) Reflectance curves of tannin content in walnut kernels under different management modes.The absorption features in the spectra, particularly around 4000–5000 cm-1 and 7000–9000 cm-1, are associated with characteristic vibrational modes of tannin molecules, primarily involving O-H overtone and combination bands.

Outliers in the dataset were identified and removed using the Monte Carlo simulation method. A total of 2000 iterations were performed. In each iteration, 60% of the samples were randomly selected as the training set, and the remaining 40% were used for validation. The training data were preprocessed using mean centering (“center”), and a partial least squares (PLS) regression model was constructed using 20 latent variables. The resulting regression coefficients were applied to the validation samples to obtain predicted values, and prediction errors were calculated accordingly.As shown in Figure 3, samples that fell outside the dashed boundary lines were identified as outliers. These samples exhibited either a mean prediction error greater than the overall mean or a standard deviation exceeding the global standard deviation. The criteria for outlier elimination were set as: standard deviation > 2 and mean > 10. Based on these thresholds, nine samples—No. 30, 42, 57, 69, 101, 128, 131, 133, and 161—were identified as outliers and removed from the dataset.After eliminating these nine samples, the remaining 171 samples were retained for subsequent modeling. This outlier removal step led to a significant improvement in model performance, enhancing the reliability and stability of the predictive results.

Figure 3
This figure illustrates the process of outlier detection and removal using the Monte Carlo simulation method. Data points falling outside the defined dashed boundary lines, which indicate samples with high prediction error or high standard deviation relative to the model, were identified as statistical outliers and excluded to enhance the dataset's quality and subsequent model robustness.

Figure 3. Removal of outliers in data.

3.2 Spectral feature analysis of mathematical transformations

After applying 11 mathematical transformations to the raw reflectance spectra (R), including reciprocal (1/R), logarithmic (log R), reciprocal-logarithmic (log(1/R)), as well as first- and second-order derivatives, significant differences in spectral reflectance patterns were observed (Figure 4). As shown in Figure 4a, the original spectra are relatively smooth, lacking prominent peaks or absorption valleys, making it difficult to extract subtle features associated with walnut kernel tannin content. Figures 4b-d display the results of the 1/R, log R, and log(1/R) transformations. Although the spectral shapes remain smooth, there are still no clearly distinguishable peaks or valleys. These transformations are designed to compress high reflectance values or balance the distribution of reflectance intensities, thereby enhancing blended spectral features. However, they do not effectively amplify small differences between adjacent wavelengths, limiting their ability to highlight features relevant to tannin concentration.In contrast, Figures 4e-h demonstrate the effect of the first derivative transformation, which significantly improves spectral sensitivity to tannins. This is achieved by computing the rate of change between adjacent wavelengths, resulting in sharper spectral variation and an increased number of peaks and valleys. The first derivative effectively emphasizes local variations, inflection points, and abrupt changes, while suppressing low-frequency noise and reducing the impact of spectral overlap. However, this method is also more susceptible to high-frequency noise, which may reduce model robustness (Li et al., 2024; Liu et al., 2024).Figures 4i-l illustrate the spectral changes after applying the second derivative transformation. Compared to the first derivative, the number of spectral peaks and valleys is further increased. This occurs because the second derivative reflects the curvature rate of change in the spectral profile, which smooths flatter regions and removes weaker spectral information, thereby retaining only the most prominent features. While this transformation enhances key features, it also amplifies noise and may cause loss of useful bands, leading to reduced model stability (Wang et al., 2023; Zhong et al., 2023). Overall, although both derivative methods improve feature extraction, excessive noise sensitivity in the second derivative makes the first derivative more suitable for further modeling.

Figure 4
This composite figure shows the average spectral reflectance curves resulting from 11 different mathematical transformations applied to the original NIR data (R), including reciprocal (1/R), logarithmic (lgR, lg(1/R)), and their first and second derivatives. It visually compares the effects of these preprocessing techniques on the spectral profiles.

Figure 4. The original spectral reflectance (R) and the average reflectance of spectra processed by 11 mathematical transformations. (a) R; (b) 1/R; (c) lgR; (d) lg(1/R); (e) R’; (f) (1/R)’; (g) lg’R; (h) lg’(1/R); (i) R’’; (j) (1/R)’’; (k) lg’’R; (l) lg’’(1/R).

3.3 Extraction of characteristic wavelengths

The large number of wavelength variables in the original spectral data may introduce redundancy, which can impair modeling efficiency and increase computational load during subsequent analysis. Therefore, effective selection of characteristic wavelengths is essential for building efficient and accurate predictive models.Pearson correlation analysis was performed using MATLAB R2023a to assess the relationships between the measured tannin content and spectral reflectance values from both the raw spectra (R) and its 11 mathematically transformed forms. Spectral variables that passed a significance threshold of P < 0.01 and had correlation coefficients exceeding the critical value (|r| > 0.123) were retained as characteristic wavelengths.As shown in Figure 5, the correlation coefficient (r) reflects the strength and direction of the relationship between individual wavelengths and measured tannin content. The color gradient in the figure indicates the magnitude of correlation: darker tones represent stronger positive or negative associations, while lighter tones indicate weaker correlations.

Figure 5
A heatmap visualization of the Pearson correlation coefficients (r) between each individual NIR wavelength and the measured tannin content. The color gradient (from dark to light) represents the strength and direction (positive or negative) of the linear relationship, aiding in the identification of spectrally informative regions correlated with tannin levels.

Figure 5. Correlation between Spectral Characteristic Bands and Tannin Content in Walnut Kernel.

Table 1 summarizes the number of characteristic wavelengths identified under the original spectrum and various mathematically transformed spectra, along with the maximum, minimum, and mean values of their positive and negative Pearson correlation coefficients. Compared to the original reflectance spectrum (R), transformations such as 1/R, log R (lgR), and log(1/R) increased the number of selected wavelengths but did not yield significant improvements in correlation strength.In contrast, derivative-based preprocessing significantly enhanced the correlation between spectral variables and tannin content in walnut kernels. Specifically, under first derivative transformations, spectra such as R′, (1/R)′, lg′R, and lg′(1/R) produced a moderate number of characteristic wavelengths but showed substantially improved correlations, with maximum absolute r values of 0.399, 0.392, 0.424, and 0.386, respectively.Second derivative transformations also improved correlation levels, particularly for lg″(1/R) and lg″R, which reached correlation values as high as ±0.400. Taking both the number of selected wavelengths and the correlation strength into consideration, the mean absolute correlation coefficient (|r|) was used as the evaluation criterion. As a result, R′, lg′R, and lg′(1/R) were selected as the most effective mathematical preprocessing methods for subsequent model development.

Table 1
www.frontiersin.org

Table 1. Statistics of the number of characteristic bands and correlation values.

Research indicates that the wavelet transform offers significant advantages over traditional mathematical transformations in spectral processing. To investigate these advantages, continuous wavelet transform (CWT) analyses were conducted on R, R’, lg’R, and lg’(1/R) (Equations 1 and 2), denoted as R_CWT, R’_CWT, lg’R_CWT, and lg’(1/R)_CWT, respectively. R was utilized to represent the correlation between the wavelet coefficients and tannin levels in walnut kernels. As illustrated in Figure 6, CWT processing resulted in an overall increase in the correlation between the spectral data and walnut tannin. The correlation exhibited a trend of initially increasing and then decreasing from scale 21 to 210. Notably, at scale 24, R_CWT and lg’R_CWT reached their maximum correlation values of 0.446 and 0.430, respectively. The maximum value for R’_CWT at scale 25 was 0.448, while the maximum R for lg’(1/R)_CWT at scale 27 was 0.388. As the decomposition scale increased, the number of characteristic bands across the four CWT treatments generally exhibited an upward trend (Figure 6).

Figure 6
This set of four subplots depicts the correlation between wavelet coefficients (derived via CWT at scales 2¹ to 2¹°) and tannin content for four differently preprocessed spectral inputs: R_CWT, R'_CWT, lgR_CWT, and lg'(1/R)_CWT. It demonstrates how CWT processing enhances and modulates the correlation across different decomposition scales.

Figure 6. Correlation between Wavelet Coefficients and Tannins under Four Different Continuous Wavelet Transform Processing Methods. (a) R_CWT; (b) R’_CWT; (c) lg’R_CWT; (d) lg’(1/R)_CWT.

As shown in Figure 7, the mean absolute Pearson correlation coefficient (|r|) for R_CWT and R′_CWT reached their respective maximum values at scale 29, with values of 0.178 and 0.208. For lg′R_CWT and lg′(1/R)_CWT, the highest mean |r| values (both 0.240) were observed at scale 28.Considering both the number of characteristic wavelengths and the strength of their correlations, scale 29 was selected as the optimal decomposition scale for R_CWT and R′_CWT, while scale 28 was determined to be optimal for lg′R_CWT and lg′(1/R)_CWT.In subsequent tannin prediction modeling, wavelet coefficients extracted from these four spectral forms at their respective optimal scales will be used as independent variables for model construction.

Figure 7
The figure presents detailed statistics for the four CWT methods, plotting the maximum positive and negative correlation values, as well as the number of selected characteristic bands, across the ten decomposition scales. This analysis was crucial for determining the optimal scale (e.g., 28 or 2?) for each CWT-preprocessed dataset to maximize feature relevance.

Figure 7. Extreme positive and negative correlations between wavelet coefficients and tannins, as well as the number of characteristic bands, under four different CWT processing methods. (a) R_CWT; (b) R’_CWT; (c) lg’R_CWT; (d) lg’(1/R)_CWT.

3.4 Sample set partitioning

Prior to model construction, the full dataset was randomly divided into training and validation sets at a ratio of 6:4. A total of 102 samples were randomly assigned to the training set for model development, while the remaining 69 samples were used as the validation set for performance evaluation. As shown in Figure 8, presents violin plots of tannin content for the three datasets.the tannin content across all samples ranged from 4.73 to 20.17 mg/g. The distribution patterns of tannin content were generally consistent among the full dataset, training set, and validation set. The outer contours reflect kernel density estimates, with wider sections indicating larger sample concentrations.and illustrates the distribution of standard deviation, median, and coefficient of variation (CV) across the datasets.The mean tannin content of the full dataset was 13.17 mg/g, while the training and validation sets had mean values of 13.06 mg/g and 13.33 mg/g, respectively. The mean, median, and standard deviation of tannin content were comparable across the three sets. Moreover, the full dataset exhibited CV and mean values intermediate between those of the training and validation sets, indicating that the partitioning was statistically balanced and representative. These results support the suitability of the sample division for constructing a robust and generalizable prediction model for walnut kernel tannin content.

Figure 8
Panel (a) uses violin plots to compare the distribution of tannin content across the full dataset, the training set, and the validation set, confirming their statistical similarity. Panel (b) provides key descriptive statistics (mean, median, standard deviation, coefficient of variation) for these sets, validating that the 6:4 random split produced balanced and representative subsets for modeling.

Figure 8. Descriptive statistical characteristics of tannins in each sample set.

3.5 Model construction and accuracy evaluation

In this study, random forest (RF) models were developed to predict tannin content in walnut kernels using both full-spectrum data and selected characteristic wavelengths derived from various spectral preprocessing techniques. The spectral variables served as independent variables, while measured tannin content was used as the dependent variable.To optimize RF model performance, key hyperparameters were systematically tuned. Specifically, the number of decision trees was set to 200, the minimum leaf size was fixed at 10, Bayesian optimization was performed with 50 iterations, and five-fold cross-validation was used to ensure predictive accuracy and model generalizability. The modeling results are presented in Figure 9.The results showed notable differences in model performance between full-spectrum and characteristic wavelength inputs under different spectral preprocessing strategies. Overall, models based on selected characteristic wavelengths outperformed those using full-spectrum data, particularly in terms of the coefficient of determination (R2). Most models based on characteristic wavelengths achieved R2 values above 0.70, whereas nearly half of the full-spectrum models showed signs of overfitting.This indicates that selecting informative wavelengths can effectively eliminate irrelevant spectral information, thereby enhancing model performance. Moreover, under the same preprocessing method, the difference in R2 values between the training and validation sets was smaller when using characteristic wavelengths, further demonstrating improved model stability.Although full-spectrum models exhibited slightly better RMSE and RPD values in the training set, their RPD values differed substantially between training and validation sets—with a maximum difference of 0.928—suggesting weaker generalization capability. In contrast, the RPD values of characteristic wavelength models in the validation set were typically above 2.2, indicating higher predictive accuracy and better robustness across different preprocessing methods. Nearly half of the full-spectrum models had RPD values below 1.4, further underscoring their limited stability.In summary, feature wavelength selection reduced spectral redundancy and improved model robustness. While full-spectrum models performed slightly better in certain metrics, models based on characteristic wavelengths offered superior comprehensive performance in terms of accuracy, stability, and computational efficiency, making them more suitable for tannin estimation.Notably, models constructed using first derivative spectra combined with continuous wavelet transform (CWT) achieved the highest prediction accuracy and stability. This highlights that integrating first derivative preprocessing with CWT significantly enhances the model’s ability to predict tannin content and offers a reliable strategy for improving spectral model performance.

Figure 9
A comprehensive bar chart comparing the performance (R², RMSE, RPD) of Random Forest models built using full-spectrum data versus models using only selected characteristic wavelengths, across multiple spectral preprocessing methods. It clearly demonstrates the superior accuracy and stability of models based on feature-wavelength selection, with the combined first-derivative and CWT method yielding the best results.

Figure 9. Results of Training and Validation Sets for Walnut Tannin Estimation Model. (a) R full band; (b) R’full band; (c) lg’R full band; (d) Lg’(1/R) full band; (e) R_CWT29 full band; (f) R’_CWT_29 full band; (g) lg’R_CWT_28 full band; (h) lg’(1/R)_CWT_28 full band; (i) R characteristic band; (j) R’characteristic band; (k) lg’R characteristic band; (l) lg’(1/R) characteristic band; (m) R_CWT29 characteristic band; (n) R’_CWT_29 characteristic band; (o) lg’R_CWT-28 characteristic band; (p) lg’(1/R)_CWT_28 characteristic band.

3.6 SHAP-based interpretation of the RF model

Due to the inherent “black-box” nature of machine learning algorithms, assessing the influence of input features plays a crucial role in model interpretation and optimization (Ye et al., 2024). To identify the most influential spectral variables and explain their contributions to the model’s predictions, the SHAP algorithm was employed. Visualization of model interpretability was performed using the SHAP library in Python.Specifically, SHAP values were used to evaluate the contribution of each selected wavelength in the best-performing RF prediction model. As shown in Figure 10A, the top 10 most important features are visualized in a SHAP beeswarm plot. The features are ranked in ascending order based on their cumulative contribution to the model. It is evident that the most influential spectral regions are located within the 4000-4999 cm-1 and 7000-8999 cm-1 ranges. The horizontal axis represents SHAP values, while the vertical axis displays the features ranked by their overall impact. Figure 10B presents a bar plot of the mean absolute SHAP values for all features, indicating the average contribution of each variable across all predictions. Features with higher SHAP values contributed more significantly to the model’s output. Figure 10C shows waterfall plots for two randomly selected samples, illustrating the contribution of individual features to the prediction result. Positive SHAP values indicate an increase in the predicted tannin content, whereas negative values suggest a decrease. The vertical axis ranks the features by cumulative SHAP impact. Blue bars represent features that reduced the prediction, while red bars represent features that increased the prediction.

Figure 10
This figure utilizes SHAP (Shapley Additive Explanations) to interpret the best-performing RF model. Panel (A) is a beeswarm plot ranking the top 10 most influential features/wavelengths. Panel (B) is a summary bar plot of mean absolute SHAP values for all features. Panel (C) shows waterfall plots for two random samples, detailing how each feature contributed to pushing the final prediction above or below the baseline value.

Figure 10. Explains the feature interpretation of the random forest model using the SHAP algorithm. (A) List the top 10 important features of the bee colony graph, with red data points indicating high SHAP values and blue data points indicating low SHAP values. (B) SHAP algorithm summary chart, representing the mean SHAP value of each feature. (C) SHAP algorithm waterfall diagram,randomly select 2 samples for analysis;a:lg’(1/R)_CWT_28 full band;b:lg’(1/R)_CWT_28 characteristic band.

4 Discussion

This study investigated the effects of various spectral preprocessing techniques—including mathematical transformations, continuous wavelet transform (CWT), and their combinations—on enhancing spectral sensitivity and improving the predictive accuracy for tannin content. The results demonstrated that reciprocal, logarithmic, and reciprocal-logarithmic transformations did not significantly improve the correlation between spectral variables and tannin content. In contrast, the first derivative transformation markedly enhanced correlations by amplifying subtle variations within the spectral data. These findings align with previous reports by Guo Yanping et al. (2025), Li et al. (2024); Zhang et al. (2023), confirming the efficacy of first derivative processing in spectral pretreatment.

However, compared to wavelet-based methods, conventional mathematical transformations were less effective in suppressing high-frequency noise and managing complex background interference. CWT demonstrated distinct advantages in dimensionality reduction, noise suppression, and feature enhancement. Decomposing the original spectra via CWT led to notable improvements in both model sensitivity and stability, outperforming traditional mathematical preprocessing. This observation is consistent with the work of Hu et al. (2025), who demonstrated that CWT effectively extracted spectral features related to rice leaf SPAD values using a BPNN model based on bior3.3 wavelets. Similarly, Wang et al. (2022) reported that small-scale CWT significantly improved the estimation of nitrogen content in tea leaves, reducing the number of input variables by 99.34% and increasing model accuracy by 11% compared to conventional preprocessing methods.

The characteristic wavelengths identified in this study, notably those within the 4000-5000 cm-1 and 7000-9000 cm-1 ranges, correspond to known NIR absorption regions for phenolic compounds. The former region is often associated with combination bands involving O-H and C-O vibrations, while the latter is typically linked to the first overtone of O-H stretching. These assignments are consistent with the chemical structure of tannins, which are rich in hydroxyl and aromatic moieties. The strong correlation between these spectral features and tannin content underscores the physicochemical plausibility of the selected wavelengths and supports the robustness of the developed prediction model.

Overall, the integration of wavelet decomposition, particularly CWT, enhanced both model accuracy and robustness. Among the preprocessing strategies, the combination of the first derivative and CWT proved especially effective. This synergy likely stems from the first derivative’s capacity to capture fine-scale spectral variations, which are subsequently refined and enhanced through the multi-resolution decomposition provided by CWT (Yumiti· and Wang, 2022). Consequently, this combined approach yielded superior predictive performance compared to standalone mathematical or wavelet transformations. Furthermore, CWT exhibited high computational efficiency and sensitivity in detecting abrupt changes and localized features within high-dimensional spectral data. By preserving both low- and high-frequency information, CWT contributed to improved modeling performance relative to conventional spectral transformation algorithms, as also evidenced in prior studies (Yumiti· and Wang, 2022; Wang et al., 2014).In the data preprocessing stage, outlier detection and feature selection are based on all samples, and the training set and verification set are not strictly distinguished. Although this practice is common in spectral studies of limited samples (Jensen et al., 2008; Ying et al., 2006), it may introduce a certain risk of information leakage in theory, resulting in optimistic model verification results. Future research can adopt more rigorous nested cross validation or completely independent validation set design to further improve the generalization ability and reliability of the model.

While the random forest (RF) algorithm was selected for this study due to its established robustness, interpretability, and strong performance in similar spectral applications, we acknowledge that other machine learning approaches—such as gradient boosting, support vector machines, and deep learning architectures—may offer distinct advantages for spectral modeling. Future comparative studies that systematically incorporate a broader range of algorithms could further refine prediction accuracy for tannin content and provide deeper insights for model selection in this field.

In summary, CWT outperformed traditional mathematical transformations in strengthening the correlation between spectral data and tannin content and in enhancing model accuracy. Moreover, combining mathematical transformations with wavelet processing optimized the spectral pretreatment pipeline, leading to improved predictive performance. The application of diverse preprocessing methods for NIR-based tannin estimation establishes a solid foundation for monitoring walnut tannins and opens new avenues for remote sensing-based quantitative trait analysis in agriculture. Future research could integrate optimized spectral transformation techniques with advanced machine learning algorithms and satellite remote sensing data to enable regional-scale monitoring of walnut tannin content, thereby advancing the application of remote sensing technologies in fruit crop research.

5 Conclusion

This study used ‘Wen 185’ walnut kernels as the research material and applied multiple preprocessing strategies—including mathematical transformations, continuous wavelet transform (CWT), and their combination—to enhance spectral data. Based on these preprocessing results, random forest (RF) models were constructed to quantitatively predict tannin content. The first derivative transformation, CWT, and the combination of first derivative with CWT all improved the correlation between spectral data and measured tannin content. Among them, the combination of first derivative and CWT yielded the best model performance.The most effective prediction model was constructed using the characteristic wavelengths of lg′(1/R)_CWT at scale 28, achieving R² values of 0.880 and 0.831 for the training and validation sets, respectively; RMSE values of 1.188 and 1.620; and RPD values of 2.904 and 2.459. These results indicate strong predictive accuracy and robustness.Furthermore, the SHAP algorithm was employed to visualize feature importance and model interpretability. The analysis confirmed that the RF model effectively captured the key wavelengths contributing to tannin prediction, offering a reliable, interpretable approach for estimating tannin content in walnuts.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

Author contributions

QX: Software, Methodology, Writing – original draft. LL: Data curation, Writing – review & editing. YY: Visualization, Investigation, Writing – review & editing. MA: Supervision, Writing – review & editing. YC: Resources, Supervision, Writing – review & editing, Conceptualization. SW: Resources, Writing – review & editing, Conceptualization, Supervision. JQ: Resources, Conceptualization, Supervision, Writing – review & editing. LC: Supervision, Writing – review & editing, Resources. QJ: Writing – review & editing, Resources, Supervision. ZG: Supervision, Writing – review & editing, Conceptualization, Resources, Funding acquisition. RZ: Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the National Natural Science Foundation of China Project (No.32160689),Tianshan Talent Training Program(No.2022TSYCCX0120),Tarim University President’s Fund Major Project Cultivation Project (No.TDZKZD202403),Walnut Full Industry Chain Innovation R&D and Promotion Team -”Three Station Chain Cooperation” Walnut Full Industry Chain Industry University Research Development Practice (No.TDZKCX202101), Southern Xinjiang Key Industry Innovation and Development Support Plan (No.2022DB022), Guiding Science and Technology Program Project of the Xinjiang Production and Construction Corps (No.2024ZD113),Tarim University President’s Fund Populus euphratica Talent (PhD) Project (No.TDZKBS202419).the Guiding Plan Projects of the Science and Technology Bureau of Xinjiang Production and Construction Corps (No.2023ZD102).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Aoki, H., Kimura, K., Igarashi, K., et al. (2006). Soy protein suppresses gene expression of acetyl-CoA carboxylase alpha from promoter PI in rat liver. Biosci. Biotechnol. Biochem. 70, 843–849. doi: 10.1271/bbb.70.843

PubMed Abstract | Crossref Full Text | Google Scholar

Breiman, L. (2001). Random forests. Mach. Learn. 45, 5–32. doi: 10.1023/A:1010933404324

Crossref Full Text | Google Scholar

Cai, J. J. (1997). Determination of tannins in fruits using o-phenanthroline colorimetry. Tianjin. Chem. Ind. 11, 40–41. https://doi.org/CNKI:SUN:TJHG.0.1997-03-013.

Google Scholar

Cheng, Y. L. (2020). Classification and tannin content detection of wine grapes based on near-infrared hyperspectral imaging (Northwest A&F University). doi: 10.27409/d.cnki.gxbnu.2020.001300

Crossref Full Text | Google Scholar

Guan, C., Liu, M. Y., Man, W. D., Zhang, Y. B., and Zhang, Q. W. (2024). Estimation of Spartina alterniflora leaf chlorophyll content based on continuous wavelet and random forest algorithm. Spectrosc. Spectr. Anal. 44, 2993–3000. doi: 10.3964/j.issn.1000-0593(2024)10-2993-08

Crossref Full Text | Google Scholar

Guo, Y. P., Wang, X. M., Zhao, F., and Li, P. P. (2025). Hyperspectral inversion of soil salinity in oasis tillage layer based on optimal mathematics and wavelet transform. Trans. Chin. Soc. Agric. Eng. 41, 83–93. doi: 10.11975/j.issn.1002-6819.202407184

Crossref Full Text | Google Scholar

Hernes, P. J. and Hedges, J. I. (2000). Determination of condensed tannin monomers in environmental samples by capillary gas chromatography of acid depolymerization extracts. Anal. Chem. 72, 5115–5124. doi: 10.1021/ac991301y

PubMed Abstract | Crossref Full Text | Google Scholar

Hu, W. R., Gao, Q. W., Yang, H. B., and Gao, Z. Q. (2025). Estimation of SPAD value in rice based on CWT and BP neural network. Shandong. Agric. Sci. 57, 154–162. doi: 10.14083/j.issn.1001-4942.2025.04.019

Crossref Full Text | Google Scholar

Huang, C. F. and Ni, Y. N. (2002). Determination of tannin in food by spectrophotometry. J. Nanchang. Univ. (Nat. Sci). 26, 243–246. doi: 10.3969/j.issn.1006-0464.2002.03.012

Crossref Full Text | Google Scholar

Jensen, J. S., Egebo, M., and Meyer, A. S. (2008). Identification of spectral regions for quantification of red wine tannins with Fourier transform mid-infrared spectroscopy. J. Agric. Food Chem. 56, 3493–3499. doi: 10.1021/jf703573f

PubMed Abstract | Crossref Full Text | Google Scholar

Li, H., Ainival, A., and Ahmat, M. (1999). Rapid determination method of oil iodine value. Fine. Chem., 26–28. doi: 10.3321/j.issn:1003-5214.1999.03.009

Crossref Full Text | Google Scholar

Li, Q., Chen, S., Han, J., Li, B., Wu, L., Li, J, et al. (2025). Unraveling almonds deterioration using whole-cell biosensor coupled with machine learning approaches and SHAP interpretation. Food Chem. 484. doi: 10.1016/j.foodchem.2025.144392

PubMed Abstract | Crossref Full Text | Google Scholar

Li, M., Liu, Y., Sun, C., Meng, Y. N., Yang, K. Q., Hou, L. Q, et al. (2009). Progress in research on nutritional value of walnut. Cereal Oil J. China 24, 166–170. doi: 10.20048/j.cnki.issn.1003-0174.2009.06.036

Crossref Full Text | Google Scholar

Li, X., Zhang, Y. B., Liu, M. Y., Man, W. D., Kong, D. K., Song, L. J, et al (2024). Prediction of nitrogen content in wolfberry leaves using hyperspectral reflectance. Ningxia. J. Agric. For. Sci. Technol. 65, 48–54. doi: 10.3969/j.issn.1002-204x.2024.06.011

Crossref Full Text | Google Scholar

Li, Y. M., Zhang, L. G., and Zhang, P. C.. (2024). Prediction of nitrogen content in wolfberry leaves using hyperspectral reflectance. Ningxia. J. Agric. For. Sci. Technol. 65, 48–54. doi: 10.3969/j.issn.1002-204x.2024.06.011

Crossref Full Text | Google Scholar

Lin, D., Li, G., Zhu, Y., Liu, H., and Jiao, Q. (2021). Predicting copper content in chicory leaves using hyperspectral data with continuous wavelet transforms and partial least squares. Comput. Electron. Agric. 187, 106293. doi: 10.1016/j.compag.2021.106293

Crossref Full Text | Google Scholar

Liu, J., Li, Y., Liu, J. L., Wang, Y. J., and Yu, Q. X. (2023). Preliminary establishment of an astringency evaluation system of walnut based on tannin content. Anhui. Agric. Sci. Bull. 51, 190–192. doi: 10.3969/j.issn.0517-6611.2023.09.045

Crossref Full Text | Google Scholar

Liu, T., Wang, W. Q., Li, Z. M., Qi, Y., Guo, Z. H., Xu, T. Y, et al. Prediction of nitrogen content in rice leaves based on DWT-DE transformation and AHAELM algorithm. Trans. Chin. Soc. Agric. Mach., 1–11.

Google Scholar

Lundberg, S. and Lee, S. I. A. (2017). Unified approach to interpreting model predictions. doi: 10.48550/arXiv.1705.07874

Crossref Full Text | Google Scholar

Ma, X. T., Luo, H. P., Gao, F., Wang, C., and X. (2022). Research and application of near-infrared spectroscopy in apple detection. J. Food Saf. Qual. 13, 4219–4227. doi: 10.19812/j.cnki.jfsq11-5956/ts.2022.13.048

Crossref Full Text | Google Scholar

Mao, J. H., Zhao, H. Q., Jin, Q., Wang, X. F., Miao, Q. F., Wang, P, et al. (2023). Comparison of hyperspectral inversion methods for heavy metal content in soil of lead-zinc tailings area in Hebei. Trans. Chin. Soc. Agric. Eng. 39, 144–156. doi: 10.11975/j.issn.1002-6819.202307092

Crossref Full Text | Google Scholar

Meng, L., Zhang, J., Yang, T., and Wu, L. G. (2022). Visualization of chlorophyll content in tomato leaves based on hyperspectral imaging. Hubei. Agric. Sci. 61, 171–177. doi: 10.14088/j.cnki.issn0439-8114.2022.14.031

Crossref Full Text | Google Scholar

Pei, D. and Lu, X. Z. (2011). Chinese Walnut Germplasm Resources (Beijing: China Forestry Publishing House).

Google Scholar

Qu, Z. Z. (1980). Pomology: Special Lectures on Fruit Tree Cultivation Vol. 321 (Beijing: Agricultural Publishing House).

Google Scholar

Tan, Y. X., Tian, Y. C., Huang, Z. M., Zhang, Q., Tao, J., Liu, H. X, et al. (2023). Aboveground biomass inversion of Kandelia obovata mangrove in Maowei Sea, Beibu Gulf based on XGBoost algorithm. Acta Ecol. Sin. 43, 4674–4688. doi: 10.5846/stxb202201140141

Crossref Full Text | Google Scholar

Wang, F., Chen, L. Y., Duan, D. D., Cao, Q., Zhao, Y., Lan, W, et al. (2022). Hyperspectral monitoring of total nitrogen content in fresh tea leaves based on wavelet analysis. Spectrosc. Spectr. Anal. 42, 3235–3242. doi: 10.3964/j.issn.1000-0593(2022)10-3235-08

Crossref Full Text | Google Scholar

Wang, Y. C., Li, X. F., Li, L. J., Li, N., Jiang, Q. N., Gu, X. H, et al. (2023). Quantitative inversion of chlorophyll content in pitaya stems and leaves based on discrete wavelet–differential transform algorithm. Spectrosc. Spectr. Anal. 43, 549–556. doi: 10.3964/j.issn.1000-0593(2023)02-0549-08

Crossref Full Text | Google Scholar

Wang, Y. C., Yang, G. J., Zhu, J. S., Gu, X. H., Xu, P., Liao, Q. H, et al. (2014). Estimation of organic matter content in northern meadow soils based on wavelet transform and PLS coupled model. Spectrosc. Spectr. Anal. 34, 1922–1926. doi: 10.3964/j.issn.1000-0593(2014)07-1922-05

Crossref Full Text | Google Scholar

Xiao, C., Wood, C., Huang, W. X., L' Abbé, M., Sarwar, G., Cooke, G, et al. (2006). Tissue-specific regulation of acetyl-CoA carboxylase gene expression by dietary soy protein isolate in rats. Br. J. Nutr. 95, 1048. doi: 10.1079/BJN20061776

PubMed Abstract | Crossref Full Text | Google Scholar

Yang, W. and Qu, X. J. (1989). Determination of tannin in hops by potassium ferricyanide absorbance method. J. Shandong. Agric. Univ. 20, 36–40. https://doi.org/CNKI: SUN : SCHO.0.1989-02-006.

Google Scholar

Ye, M., Zhu, L., Liu, X. D., Huang, Y., Chen, P. P., Li, H, et al. (2024). Hyperspectral inversion of soil organic matter content based on CWT, SHAP, and XGBoost. Environ. Sci. 45, 2280–2291. doi: 10.13227/j.hjkx.202304100

Crossref Full Text | Google Scholar

Ying, Y. B., Liu, Y. D., and Fu, X. P. (2006). Sugar content prediction of apple using near-infrared spectroscopy treated by wavelet transform. Spectrosc. Spectr. Anal. 26, 63–66. doi: 10.1016/S1003-6326(06)60040-X

PubMed Abstract | Crossref Full Text | Google Scholar

Yumiti·, M. and Wang, X. M. (2022). Estimation of soil organic matter content based on continuous wavelet transform. Spectrosc. Spectr. Anal. 42, 1278–1284. doi: 10.3964/j.issn.1000-0593(2022)04-1278-07

Crossref Full Text | Google Scholar

Zhang, P., Li, J. K., Meng, X. J., Zhang, P., Feng, X. Y., Wang, B. G, et al. (2011). Research on nondestructive measurement of soluble tannin content in astringent persimmon using Vis–NIR diffuse reflectance spectroscopy. Spectrosc. Spectr. Anal. 31, 951. doi: 10.3964/j.issn.1000-0593(2011)04-0951-04

Crossref Full Text | Google Scholar

Zhang, X. Q., Li, Z. W., Zheng, D. C., Song, H. Y., and Wang, G. L. (2023). Prediction of brown soil organic matter based on visible–near-infrared hyperspectral stacking generalization model. Spectrosc. Spectr. Anal. 43, 903–910.

Google Scholar

Zhong, L., Qian, J. W., Chu, X. Y., Qian, Z. H., Wang, M., Li, J. L, et al. (2023). Monitoring heavy metal pollution in wheat soil using hyperspectral remote sensing. Trans. Chin. Soc. Agric. Eng. 39, 265–270. doi: 10.11975/j.issn.1002-6819.202207160

Crossref Full Text | Google Scholar

Keywords: continuous wavelet transform (CWT), near-infrared, random forest (RF), Shapley additive explanations (SHAP), tannins

Citation: Xia Q, Luo L, Yerzati Y, Ahmed MM, Chen Y, Wang S, Qin J, Chen L, Jin Q, Guo Z and Zhang R (2026) Near-infrared prediction of tannin content in walnut kernels using wavelet transform combined with interpretable machine learning models. Front. Plant Sci. 17:1746869. doi: 10.3389/fpls.2026.1746869

Received: 17 November 2025; Accepted: 16 January 2026; Revised: 05 January 2026;
Published: 06 February 2026.

Edited by:

Zheli Wang, Hebei University of Economics and Business, China

Reviewed by:

Jaime Cuevas, University of Quintana Roo, Mexico
Zhu Zhou, Zhejiang Agriculture and Forestry University, China

Copyright © 2026 Xia, Luo, Yerzati, Ahmed, Chen, Wang, Qin, Chen, Jin, Guo and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Rui Zhang, emhyZ3NoQDE2My5jb20=; NzQyNTYwMDI2QHFxLmNvbQ==

These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.