- Center of Sustainable Soil Sciences (C3S), College of Agriculture and Environmental Science (CAES), Mohammed VI Polytechnic University (UM6P), Ben Guerir, Morocco
Mid-infrared (MIR) spectroscopy is a powerful, eco-friendly, and cost-effective technique for predicting soil property. However, its predictive accuracy can be affected by factors such as moisture content, particle size, sensor variability, and the baseline noise. To address these limitations, this study investigated the impact of combining various preprocessing techniques with variable selection methods on the performance of partial least squares regression (PLSR) models. Soil samples from the Rhamna region of Morocco were analyzed to estimate key properties, including total nitrogen (TN), total carbon (TC), total organic carbon (TOC), clay, silt, sand, moisture content (MC), pH, phosphorus (P2O5), and cation exchange capacity (CEC). Spectral data were preprocessed using methods such as standard normal variate (SNV), Savitzky–Golay smoothing (SG smoothing), first and second derivatives (SG1D and SG2D), and their combinations (e.g., SNV + SG2D). The best-performing preprocessing combinations were then used with variable selection approaches, interval PLS (iPLS), variable importance in projection (VIP), and selectivity ratio (SR). The results indicated that Savitzky–Golay (SG) derivatives combined with SNV generally improved model performance across most soil properties. In particular, total nitrogen (TN) prediction improved primarily with the first SG derivative, with R2cv increasing from 0.82 (raw spectra) to 0.88 (SG1D), while RMSEcv decreased from 0.03% to 0.01%. Further improvements were achieved through variable selection, with iPLS providing the most consistent enhancement across properties with a very low number of features compared to the other methods. Overall, the integration of optimal preprocessing and iPLS variable selection significantly improved the predictive accuracy and robustness of partial least squares regression (PLSR) models for soil property estimation compared with the full spectrum.
1 Introduction
Mid-infrared (MIR) spectroscopy has emerged as a powerful, non-destructive, eco-friendly, and low-cost soil analysis tool. It offers many advantages, such as high resolution, wide absorbance range, and strong signals, which make it popular and important in the study of soil environments (1). The combination of MIR data with machine learning, especially with partial least square regression as a mathematical model, is widely used to predict soil properties (2) and has proven excellent predictive capabilities for various properties such as total carbon, nitrogen, clay, potassium, pH, sand, and silt (3–5).
However, the accuracy of soil spectroscopy models based on infrared spectral data might be impacted by several factors, such as moisture. (6) found that the moisture level of the studied sample impacts the accuracy of the models for different soil properties. The soil particle size was also found to be an impacting factor for model accuracy (7). Baseline drift and background noise are similar factors that influence the model accuracy (8). The data source may also impact the accuracy of the predictive models, as benchtop, mobile, and homemade sensors perform differently in predicting soil properties (9).
These factors vary significantly across soil environments, potentially affecting the robustness and transferability of commonly applied preprocessing strategies. Most existing preprocessing workflows have been developed and validated using datasets from well-studied sources, leaving uncertainties regarding their generalized applicability across diverse and underrepresented soil contexts (10, 11). This highlights the need to evaluate variable selection approaches in combination with optimized preprocessing, especially for datasets originating from less-studied soil regions.
To overcome these challenges, data preprocessing is an efficient tool that can significantly increase the predictability of soil properties using mid-infrared spectral data (12). Different preprocessing methods were applied to the mid-infrared data, such as standard normal variate (SNV), normalization, Savitzky–Golay smoothing, and 1st and 2nd derivatives. These transformative approaches have shown excellent to acceptable results when predicting soil properties using the partial least square algorithm (13, 14). Likewise, combining pretreatments increases the accuracy of the predictive model (15); found that applying normalization after moving average is considered the best preprocessing step for MIR data to predict pH, organic carbon, Mg, and moisture content, while applying standard normal variate after moving average is the best approach to predict phosphorus. Further research by (16) suggests that the combination of normalization and the 1st derivative of Savitzky–Golay leads to better partial least squares regression (PLSR) performance.
The model’s development and behavior have been greatly influenced by the preprocessing techniques and combinations used; however, the variable selection is crucial to the model’s further enhancement and accuracy, and these variable selection techniques can provide even greater prediction accuracy in a shorter computing time than the full-spectrum model (17). Various studies have applied variable selection, such as interval-PLS and variable importance in projection (VIP) to predict soil properties using infrared spectral data, and these approaches have demonstrated their ability to achieve strong predictive performance in the model (18, 19). Nevertheless, recent work on Moroccan soils indicates that iPLS and VIP mainly contribute to model simplification rather than systematic accuracy improvements (20).
This study aimed to investigate the impact of combining preprocessing (standard normal variate, Savitzky–Golay smoothing, 1st and 2nd derivatives) with iPLS, VIP, and selectivity ratio as variable selection methods on the prediction accuracy of the PLSR model to predict total nitrogen (TN), total carbon (TC), total organic carbon (TOC), clay, silt, sand, moisture content (MC), pH, phosphorus (P2O5), and cation exchange capacity (CEC).
2 Materials and methods
2.1 Soil sample collection and preparation
Fifty soil samples were collected from different locations and depths in the Rhamna region, to assess the impact of combining statistical preprocessing and variable selection methods on the prediction accuracy of some soil properties. Each soil sample was finely ground, sieved, and oven-dried at 39°C for 48 h before analysis.
2.2 The structural framework of the study
The standard procedure for using spectral data to calibrate multivariate models to predict soil properties begins with the construction of a database. To ensure accuracy, one or more preprocessing techniques intended to improve the signals and reduce noise interference can be implemented. Subsequently, relevant variables were identified using suitable variable selection methods. The next stage involves developing a modeling framework and assessing the models that have been built. Statistical metrics, such as the root mean square error (RMSE) of prediction and the coefficient of correlation (R2), were used to evaluate the accuracy of these models (21).
This study aimed to examine the impact of preprocessing tools on predictive models; therefore, the procedure outlined below and in Figure 1 was adhered to. Models generated based on raw data and several preprocessing algorithms (standard normal variate, Savitzky–Golay smoothing, Savitzky–Golay 1st and 2nd derivatives, Savitzky–Golay 1st derivatives coupled with SNV, and Savitzky–Golay 2nd derivatives coupled with SNV) were evaluated. Furthermore, variable importance projection (VIP), selectivity ratio (SR), and interval-PLS (iPLS) were used to highlight important variables. The resulting data were then used to generate PLSR models to predict TN, TC, TOC, clay, silt, sand, MC, pH, P2O5, and CEC in Moroccan soils.
Figure 1. Workflow for evaluating the impact of preprocessing tools and their combinations with variable selection on PLSR models for estimating soil properties.
2.3 Wet chemistry
Soil physical–chemical properties (Table 1), such as total nitrogen (TN) via the combustion method (ISO 13878), soil total carbon (TC) via the combustion method (ISO 10694), total organic carbon (OC) (ISO 10694), sand, silt, and clay (Bouyoucos method ISO 13317), soil pH(1/5 in water) (ISO 10390), moisture (MC) using the gravimetric method (ISO 11461), available phosphorus (P2O5) using the Olsen method (ISO 11263), and the hexamine-cobalt method for cation exchange capacity (CEC) (NF ISO 23470) were analyzed in the Soil Testing Laboratories of Mohammed VI Polytechnic University (UM6P).
2.4 Chemometrics analysis
2.4.1 Data processing
Chemometrics is essential for fully extracting the potential of contemporary instruments (22). In numerous investigations, it has been used to estimate different soil parameters using spectral data, such as near- or mid-infrared data, using various types of models, such as Partial Least Square Regression (2). As indicated by (23), the PLSR is the most employed learning modeling method. It has proven to be a good mathematical model for predicting various soil properties, such as soil organic carbon, total nitrogen, cation exchange capacity, clay, sand, pH, and other properties (3, 4). Predictions based on unprocessed soil sample laboratory spectra appear promising for quantitatively estimating soil chemical properties (24). However, the accuracy of this information can be affected by different factors, such as moisture, variations in soil roughness, and soil particle size. Therefore, data preprocessing methods that can enhance spectral information must be implemented (7, 24). Numerous preprocessing techniques, including standard normal variate, Savitzky–Golay smoothing, 1st derivative, and 2nd derivative, have been used in various studies and have demonstrated good prediction capabilities for soil properties (25, 26). The standard normal variate is an easy approach for spectral data normalization that works row by row, aimed at correcting for light scattering using the formula: , where represents the signal of the observation, is its mean, and is its standard deviation (27).
The derivation of Savitzky–Golay pre-processing is one of the most extensively utilized drift-noise reduction strategies. The first and second are two orders, each with a special role. The second derivative is recommended to address the effect of the offset and linear trend of the baseline, while the first is intended only for a constant baseline effect (28). The Savitzky–Golay derivative uses the same procedure as the Savitzky–Golay smoothing described by (29), but with an added step. The derivative of this function is calculated after fitting the polynomial to a window around the moving point. This value was then used as the derivative estimate for the central point (29).
2.4.2 Variable selection
In areas where datasets with many variables are available, variable and feature selection has attracted considerable interest in the research community. The variable selection process has many advantages, such as enhancing data comprehension and visualization, lowering the need for measurement and storage, shortening the time needed for application and training, and reducing the impact of dimensionality to increase prediction accuracy (30). The use of some variable techniques, such as variable selection with interval-PLS (iPLS) and variable importance in projection (VIP), has proven to be more effective and increases the accuracy of the model (31, 32). The selectivity ratio is also beneficial for increasing predictive accuracy by concentrating on a small, highly relevant subset of variables (33).
2.4.3 Variable selection with interval
Interval partial least squares (iPLS) is an approach for finding intervals of variables that are highly important for prediction. The algorithm splits the variables (predictors) into intervals and attempts to find the best combination of intervals that has the highest predictive accuracy. This interval selection can be performed using two methods: forward and backward. The forward method is when the intervals are included successively, whereas the backward method is when the intervals are excluded successively (34). In this study, forward iPLS was applied to the data after selecting the optimal preprocessing on R.
2.4.4 Variable importance in projection
Variable importance in projection-partial least squares (VIP-PLS) is a multivariate statistical methodology employed to evaluate the significance of individual indicators in influencing an aggregate index. The primary objective of VIP-PLS is to establish a hierarchical ranking of indicators based on their respective degrees of importance to an aggregate index. This facilitates a clear understanding of the relative contributions of individuals (35). The VIP score is calculated using the following equation: , where wjf represents the weight value for j variable and f component, SSYf is the sum of squares of explained variance for the fth component, and J is the number of X variables. SSYtotal is the total sum of squares explained by the dependent variable, and F is the total number of components (36).
2.4.5 Selectivity ratio
The selectivity ratio is the ratio of the explained variance to the residual variance for each variable in a dataset, which is calculated after applying a target projection in a multivariate analysis. This ratio is based on the ability of a variable to discriminate between the different groups of samples. A high selectivity ratio indicates that a variable has a strong discriminative advantage (37). The selectivity ratio is defined as follows: . The SSi,residual represents the portion of the variance of variable i that is explained by the target projection component, that is, the part aligned with the PLS regression direction. In contrast, SSi,residual denotes the remaining or unexplained variance of that variable after subtracting the target-projected part (the residual variance in the model). Therefore, the ratio quantifies the relevance of each variable’s variance to prediction versus residual noise, with higher SR values indicating greater predictive importance (36).
2.5 Spectra collection and preprocessing
The spectral data acquisition was done at the soil spectroscopy laboratory of the Center of Sustainable Soil Sciences (C3S) Mohammed VI Polytechnic University (UM6P) of the Mohammed VI Polytechnic University, using a Bruker Tensor II bench-top spectrometer coupled with the Diffuse reflectance infrared Fourier transform spectroscopy (DRIFT) technology. The generated spectra were an average of 60 scans from each sample, at the 4,000 and 600 cm−1 range, with a 4 cm−1 resolution.
Prediction models were built using the entire FTIR spectra measured for the soil samples. Nine groups of PLS models were established, namely raw spectral data, SNV, SG smoothing (polynomial order = 0 and number of windows = 3), 1st SG derivative (polynomial order = 1, number of windows = 3, and derivative order = 1), 2nd SG derivative (polynomial order = 2, number of windows = 15, and derivative order = 2), SNV coupled with 1st SG derivative, 1st SG derivative coupled with SNV, 2nd SG derivative coupled with SNV, and SNV coupled with 2nd SG derivative data. The optimal number of wavelengths was identified by evaluating multiple configurations, with selection based on predictive accuracy. These models were validated using leave-one-out cross-validation (38), R2, and the root mean square error (RMSEcv), which were calculated to evaluate the accuracy of each model. Based on these metrics, the best prediction model was selected to apply iPLS, VIP, and SR as variable selection techniques. PLSR modeling was conducted in the R environment using the mdatools package (39). Spectral preprocessing was performed using prospectr (40) and variable selection techniques were implemented using the plsVarSel package (41).
To test whether the optimal preprocessing combined with variable selection has improved the model performance, a paired, non-parametric Wilcoxon signed-rank test on cross-validation metrics (R2cv and RMSEcv) comparing raw with the optimal preprocessing combined with best variable selection approaches (α = 0.05) (42, 43). The tests were performed collectively across all 10 soil properties.
3 Results and discussion
3.1 Raw data
Figure 2 illustrates the FTIR spectra of the soil samples for raw data and after each preprocessing technique. The raw spectra show all the studied samples’ original features, divided into four regions. The first, from 4,000 cm−1 to 2,500 cm−1 is characterized by single bond stretch (C–N, C–H, O–H), the second, is located between 2,500 and 2,000 cm−1 which represents the triple bond (C≡N). Double bonds can be found in the third region, which starts from 2,000 cm−1 to 1,500 cm−1. The last region between 1,500 cm−1 and 400 cm−1 is defined as the fingerprint zone (C–C, C–O, C–N) (44). As shown in Table 2, the prediction model built using PLSR on raw data yielded good predictions for TN (R2cv = 0.82), TC (R2cv = 0.88), clay (R2cv = 0.81), silt (R2cv = 0.82), and MC (R2cv = 0.8), and acceptable predictions for sand (R2cv = 0.79), TOC (R2cv = 0.87), pH (R2cv = 0.66), and P2O5 (R2cv = 0.6), whereas CEC was poorly predicted (R2cv = 0.46).
Figure 2. FTIR spectra of the soil samples before preprocessing and after applying different statistical pretreatment methods (SNV, SG smoothing, Savitzky–Golay 1st and 2nd derivatives. SNV coupled with Savitzky–Golay 1st derivative, Savitzky–Golay 1st derivative coupled with SNV, Savitzky–Golay 2nd derivative coupled with SNV, and SNV coupled with Savitzky–Golay 2nd derivative).
Table 2. Table of merit of the prediction of (TN), total carbon (TC), total organic carbon (TOC), clay, silt, sand, moisture (MC), pH, phosphorus (P2O5), and cations exchange capacity (CEC) using medium infrared (MIR) data and preprocessing data with the application of iPLS, VIP, and SR.
3.2 Preprocessed data
As shown in Figure 2, after the application of preprocessing, the spectra become different. The spectra shapes after SNV preprocessing seem similar, the spectra become tightly grouped and follow a consistent baseline, with a decrease in inter-sample variation, showing that these processes have scaled it to a common range and maintained the spectra. SG smoothing is expected to smooth the spectra and reduce the noise while preserving the peaks; however, this preprocessing does not seem to have any impact on the spectra because the raw spectra and SG smoothing spectra seem to be similar.
For the SG 1st derivative, the changes in absorption are highlighted, and the peaks become more pronounced, helping to observe spaced spectral features. The positions of the absorption changes positions are easier to identify, and the baseline of the derivative spectra tends to be flatter and closer to zero, despite the spectra being noisier. The 2nd derivative exhibited the same changes, except that it corrected the baseline drift more effectively and exaggerated some data points, making them noisier than the 1st derivative.
When the 1st derivative is applied to the SNV, the noise increases compared to the raw data, and the peaks are more pronounced when the baseline is corrected and a cleaner and more detailed representation of spectral variations is obtained. After the reverse procedure (applying SNV to the 1st derivative), properties similar to those in the previous plot appear, but the 1st derivative applied to the SNV is more effective in reducing the variability.
After applying the SNV to the 2nd derivative, the baseline was corrected with highlighted peaks. When applying the 2nd derivative to the SNV, this preprocessing yielded a more detailed spectrum with improved resolution. The slope differences and baseline drift were corrected compared to the raw data, while the previous preprocessing was still better at correcting the baseline drift. Small peaks were highlighted by low-frequency noise.
3.3 Soil property prediction
The analysis involves 12 developed models (raw data, SNV, SG Smoothing, SG1D, SG2D, SNV + SG1D, SG1D + SNV, SG2D + SNV, SNV + SG2D, iPLS, VIP, and SR) for each property from the studied 10 soil properties (TN, TC, TOC, clay, silt, sand, MC, pH, P2O5, and CEC), with each model’s predictive accuracy assessed through R² and RMSE, while the selection of the optimal preprocessing for variable selection application was done using two criteria, the R² and RMSE and model complexity.
3.3.1 Total nitrogen
The results of predicting TN using PLS and different preprocessing methods showed that the Savitzky–Golay 1st derivative was the preprocessing method that yielded the best prediction, with an R2cv of 0.88 and an RMSEcv of 0.01%. Savitzky–Golay 2nd derivative, SNV coupled with SG1D, SG1D coupled with SNV, and SNV with SG2D also yielded a good prediction higher than the raw data prediction ranging between 0.85 and 0.88 for the R2cv and 0.02% and 0.03% for the RMSEcv, respectively. SNV and SG smoothing yielded good predictions but were lower than the raw data (R2cv <0.8 and RMSEcv = 0.03%).
3.3.2 Total carbon
All preprocessing techniques employed did not show any improvement and generated R2cv below the R squared of the raw data, except Savitzky–Golay smoothing, which yielded similar results with an R2cv of 0.88 and an RMSEcv (0.28%) higher than the RMSEcv of the raw data. The inconsistent performance observed for total carbon using raw MIR spectra likely reflects fundamental spectroscopic limitations rather than deficiencies in model calibration. Total carbon includes both organic and inorganic carbon fractions, which exhibit different MIR spectral behaviors. Organic carbon is mainly associated with broad and chemically informative absorption features, whereas inorganic carbonates show more specific bands that may overlap with organic signals (45, 46).
Consequently, raw spectra may already capture the dominant organic carbon information, whereas common preprocessing techniques can unintentionally suppress or distort these broad features. This likely explains why preprocessing did not improve and, in some cases, reduced the predictive performance compared with the raw spectra.
3.3.3 Total organic carbon
The combination of Savitzky–Golay’s 1st derivative and SNV yielded good predictions for total organic carbon (R2cv = 0.81, RMSEcv = 0.13%), with a better calibration performance of SNV applied to SG1D (SG1D + SNV). Savitzky–Golay’s 2nd derivative and SG2D applied to SNV resulted in the same predictive performance as the raw data models (R2cv = 0.78, RMSEcv = 0.14%). Conversely, SNV, SG smoothing, and SG2D + SNV generated acceptable results (0.76 < R2cv < 0.77, 0.14% < RMSEcv < 0.15%) but did not show any improvement in predictive performance (lower than the raw data).
3.3.4 Clay
The results of predicting clay content using PLS and various preprocessing methods highlighted that SG1D + SNV was the optimal preprocessing method with the highest R2cv (0.86) and lowest RMSEcv (7.04%). SNV, SG1D, SG2D, SNV + SG1D, SG2D + SNV, and SNV + SG2D also exhibited good predictive performance, with an R2cv higher than and an RMSEcv lower than the raw data. SG smoothing demonstrated an R2cv similar to raw data (0.81) but higher RMSEcv (8.25%).
3.3.5 Silt
With an R² value of 0.86 and an RMSEcv of 3.72%, the application of Savitzky–Golay 2nd derivative after applying SNV (SNV + SG2D) produced the highest accuracy in the silt content prediction results utilizing PLS. With R2cv values of 0.81 and a slight difference in the RMSEcv (4.27%), Savitzky–Golay smoothing had a predictive performance similar to that of the raw data. The SG1D, SNV + SG1D, SG1D + SNV, and SG2D + SNV techniques also elevated R2cv to 0.86 and lowered RMSEcv to 3.8%. In contrast, SNV and SG2D did not improve the predictive accuracy of the model and yielded inferior R2cv and RMSEcv values of R2cv = 0.62, RMSEcv = 4.07%, and R2cv = 0.6, RMSEcv = 3.74%, respectively.
3.3.6 Sand
As shown in Table 2, the results of the sand content prediction using PLS and multiple preprocessing methods showed that the SNV + SG2D combination yielded the highest R2cv and the lowest RMSEcv, 0.86% and 8.87%, respectively. Savitzky–Golay’s 1st and 2nd derivatives, SNV + SG1D, SG1D + SNV, and SG2D + SNV also yielded good prediction performance and higher than the raw data with an RMSEcv of 9.78%, 9.48%, 9.36%, 9.4%, and 9.28%, respectively, and an R2cv of 0.83 for SG1D and 0.84 for the other preprocessing methods. SNV did not improve the prediction, resulting in similar results as the raw data (R2cv = 0.79, RMSEcv = 10.78%). In contrast to all the previous preprocessing methods, SG smoothing lowered R2cv to 0.78 and elevated RMSEcv to 11.04%, suggesting that smoothing is unsuitable for the prediction of sand in this context.
3.3.7 Moisture content
All the derivative preprocessing and their combination with SNV improved the prediction accuracy of the moisture content, with R2cv ranging between 0.8 and 0.85 and RMSEcv between 0.68% and 0.7%. The highest R2cv (0.85) and lowest RMSEcv (0.67%) were generated by SG1D + SNV, ranking it as the best preprocessing methods to predict MC in this context. SNV alone could not enhance the model’s predictive accuracy and delivered results similar to raw data (R2cv = 0.8, RMSEcv = 0.79%). In contrast, SG smoothing increased the RMSEcv to 0.82% and decreased the R2cv to 0.78.
3.3.8 pH
The performance metrics showed that SG smoothing and SG2D + SNV decreased the predictive performance of the model compared to the raw data model, with R2cv values of 0.61 and 0.58 and RMSEcv values of 0.39 and 0.41, respectively. The SG1D + SNV also decreased the predictive performance, even though this preprocessing yielded a similar R2cv (0.66) to the raw data but a higher RMSEcv (0.37), which led to a higher predictive error. In contrast, SNV + SG2D generated a similar RMSEcv but a higher R2cv (0.67). The remaining preprocessing methods (SNV, SG1D, SG2D, and SNV + SG1D) improved the model performance by lowering the RMSEcv and increasing the R2cv, suggesting SNV + SG1D as the optimal preprocessing method for predicting pH (R2cv = 0.73, RMSEcv = 0.32).
3.3.9 Phosphorus
Overall, none of the used preprocessing methods improved or at least maintained the predictive raw data model, except for SNV and SG smoothing, which showed better performance and nearly identical results (R2cv = 0.62 and RMSEcv = 24.29 mg/kg for SNV and R2cv = 0.61 and RMSEcv = 24.74 mg/kg). The low MIR prediction accuracy obtained for soil phosphorus agrees with previous studies reporting unreliable spectroscopic prediction of phosphorus due to its weak association with MIR-active soil components (47).
3.3.10 Cation exchange capacity
In contrast to phosphorus, all preprocessing improved the prediction accuracy of the CEC model, except for SG smoothing, which yielded results similar to those of the raw model’s metrics (R2cv = 0.46, RMSEcv = 6.3 meq/100 g). The results also show that the best preprocessing for CEC prediction using PLSR is SG2D + SNV, with the highest R2cv (0.59) and the lowest RMSEcv (5.52 meq/100 g). The moderate predictive performance obtained for CEC can be explained by the limitations of MIR spectroscopy. The cation exchange capacity is not spectrally active, and its prediction relies on indirect correlations with clay mineralogy and soil organic matter rather than direct absorption features (48). explicitly highlights this limitation, showing that accurate MIR-based CEC prediction is strongly dependent on the dataset size and modeling strategy.
Overall, the results presented in Tables 2, 3 suggest that the application of Savitzky-Golay (SG) preprocessing, either alone or in combination with standard normal variate (SNV), improves the model performance for most soil properties. However, the order of SNV (applying SNV to the derivative or applying the derivative to the SNV) plays a crucial role in decreasing or increasing the model performance. This combination yielded strong predictive performance for most soil parameters, indicating that it was the most effective preprocessing technique tested, except for TN, TC, and P2O5. SNV and Savitzky–Golay derivatives and their combination offered the best results for different soil properties and for different regression techniques (partial least squares, support vector regression, cubist, and convolutional neural networks), as proven in a similar study by (49). In a study by (50), the optimal preprocessing to develop a model for total nitrogen monitoring in situ and in-door for different depths and the optimal preprocessing combinations for the best-performing models (both indoor and in situ) consistently included Standard Normal Variate and derivative preprocessing. Other studies have found other optimal preprocessing combinations to predict soil properties; however, their optimal combination always includes derivative pretreatment, which highlights its role in improving model accuracy (26, 51).
3.4 Preprocessed data with variable selection
3.4.1 Interval partial least square variable selection
The R2cv and RMSEcv of the models generated by the combination of optimal preprocessing and iPLS indicated a high level of prediction ability for most soil parameters, with an R2cv over 90% for TN, TC, TOC, silt, sand, MC, and pH, and an R2cv of 77% for CEC and 67% for P2O5, as illustrated in Figure 3. The resulting RMSEcv was lower than that of the raw data and optimal preprocessing. Figure 4 illustrates that the selected intervals varied for different soil properties. Sand required the fewest intervals, utilizing 40 intervals. In contrast, P2O5 required the highest number of intervals (150). The remaining properties, TN, TC, TOC, Clay, silt, MC, pH, and CEC, required 45, 55, 110, 88, 60, 50, 66, and 70 intervals, respectively. These results indicate that the combination of iPLS and optimal preprocessing can improve the predictive accuracy of soil properties. Table 2 and Figure 3 show that the predictive performance improved for nearly all properties in this methodological integration. These values represent some of the highest accuracies among the preprocessing methods employed in this study. Demonstrating the robustness of the optimal preprocessing + iPLS method.
Figure 3. Reference vs. predicted values of different PLSR models for the prediction of total nitrogen (TN), total carbon (TC), total organic carbon (TOC), clay, silt, sand, moisture (MC), pH, phosphorus (P2O5), and cations exchange capacity (CEC) after the application of the optimal preprocessing and interval-PLS.
Figure 4. Selected intervals of total nitrogen (TN), total carbon (TC), total organic carbon (TOC), clay, silt, sand, moisture (MC), pH, phosphorus (P2O5), and cations exchange capacity (CEC) after the application of Savitzky–Golay 1st derivate and iPLS.
3.4.2 Variable importance in projection
Improved predictive accuracy for the studied soil attributes was also achieved by applying the variable importance in projection variable selection method to the optimal preprocessing, except for clay, moisture content, and phosphorus, which did not show any improvement using this technique. As shown in Table 2, VIP showed excellent predictive accuracy for TC with an R2cv of 0.91 and an RMSEcv of 0.24%, and good predictive accuracy for TN, TOC, silt, and sand, with an R2cv ranging between 0.82 and 0.89. In comparison, pH and CEC exhibited acceptable accuracy with an R2cv of 0.75 and an RMSEcv of 0.31 for pH and 4.32% for CEC.
3.4.3 Selectivity ratio
Compared to VIP and iPLS, SR was not the best technique for improving the predictive accuracy of soil properties, as it did not show any improvement in total nitrogen, moisture content, clay, and pH. SR showed improved predictive accuracy of total organic carbon and silt with nearly similar R2cv and RMSEcv of VIP. The same case was observed for total carbon, but the R2cv and RMSEcv values were nearly similar to those of iPLS. Sand, P2O5, and CEC were improved using SR, with R2cv values of 0.88, 0.63, and 0.73 and an RMSEcv of 8.28%, 24.01%, and 4.43%, respectively.
Overall, the combination of optimal preprocessing with variable selection techniques has been shown to be effective in improving the accuracy of the developed models by increasing R2cv and decreasing the models’ errors. However, when comparing the variable selection techniques, iPLS outperformed both VIP and SR. The iPLS method improved the metrics of all the properties. In contrast, VIP and SR could not improve certain properties, particularly the moisture content. In addition, the improvement rate of iPLS was higher than that of the other techniques (lowest RMSEcv and highest R2cv). In terms of model complexity, as illustrated in Figure 5, SR and VIP selected larger wavelength intervals, whereas iPLS selected shorter intervals, which means a less complex model. iPLS selected a lower number of intervals for some properties, such as total carbon, but for others, such as pH, a high number of intervals was needed to improve accuracy. Table 4 lists the intervals selected by iPLS and the number of selected wavenumbers for each property. Spectral interval selection using iPLS was conducted on the preprocessed data. As preprocessing methods modify the physical meaning of individual wavelengths, the selected intervals were used to optimize predictive performance and reduce model complexity rather than for direct chemical interpretation, which should be approached with caution when feature selection is applied (52). The proportion of selected MIR wavenumbers differed substantially among soil properties, highlighting the contrasting levels of spectral redundancy and information density. MC required the largest fraction of the spectrum (68.74%), indicating that predictive information for this property is distributed across the broad MIR regions. In contrast, properties such as P2O5 and clay were characterized by very limited spectral selections (3.15% and 5.50%, respectively), suggesting that their predictive signals are concentrated within narrow intervals. Intermediate selection levels were observed for organic-related properties (TN, TC, TOC), pH, and CEC, indicating an equilibrium between spectral coverage and reduced model complexity.
Figure 5. Selected MIR wavenumber ranges by variable selection techniques (iPLS, VIP, and SR). (When the technique is written in red it means that this technique could not show any improvement.).
Table 4. Medium infrared wavelength ranges selected by iPLS of total nitrogen (TN), total carbon (TC), total organic carbon (TOC), clay, silt, sand, moisture (MC), pH, phosphorus (P2O5), and cations exchange capacity (CEC), from an initial set of 2,380 MIR wavenumbers (3,997 cm−1–600 cm−1).
To assess whether the improvement in model performance after variable selection was statistically significant, a two-sided Wilcoxon signed-rank test was conducted on the cross-validated R2 values before and after applying the iPLS (α = 0.05). The test revealed a significant increase in predictive accuracy following iPLS variable selection (V = 55, p = 0.0029), confirming that the observed improvement in the R2cv values was statistically significant. This supports the effectiveness of iPLS in enhancing the PLSR model performance.
The application of iPLS in the present study has proven to boost model accuracy and reduce model complexity. In parallel research by (34), the iPLS performed comparably to other variable selection methods (principal variables, forward stepwise selection, and recursively weighted regression), but the iPLS demonstrated its power in optimizing the prediction model compared to the full spectrum Partial Least Squares; this technique offers a beneficial advantage which is the graphical output that enhances data interpretation. The results of this study demonstrated the combination of optimal preprocessing with variable selection improves the predictive accuracy of the model, but when choosing the optimal preprocessing and optimal variable selection techniques, a better predictive performance is achieved, lowering the error of the model. The studied techniques (iPLS, VIP, and SR) are not commonly used in soil spectroscopy, which opens the window for further research and application and can guide future research in developing robustness and testing other preprocessing and variable selection techniques.
4 Conclusion
This study demonstrates that pairing mid−infrared spectroscopy with preprocessing and variable−selection strategies can substantially improve partial least squares regression (PLSR) models for routine soil diagnosis. This confirms the hypothesis that using preprocessing techniques, especially Savitzky–Golay derivatives combined with standard normal variate and variable selection approaches, such as interval Partial Least Squares, improves the predictive capability of the models.
With the iPLS pipeline, cross−validated coefficients of determination reached approximately 0.90 for TN, TC, TOC, silt, sand, MC, and pH (pH R2cv = 0.91 with RMSEcv = 0.19), while CEC and P2O5 remained moderate (R2cv = 0.77 and 0.69, respectively), highlighting both the promise and limitations of MIR−based prediction in this dataset.
All models were evaluated with leave−one−out cross−validation, and the observed improvements from combining optimal preprocessing with variable selection were assessed using paired Wilcoxon signed−rank tests on fold−wise metrics (α = 0.05) that further confirmed that the improvements achieved through iPLS variable selection were statistically significant (p = 0.0029), reinforcing the reliability of the optimized modeling approach. Beyond accuracy, iPLS reduces model complexity by focusing on compact, property−specific spectral intervals, aiding mechanistic interpretation and easing deployment. This study highlights the best combinations, setting the stage for future research to explore more options. Combining MIR spectroscopy with chemometric tools has strong potential to improve soil property prediction, helping make soil management more precise and efficient in both agricultural and environmental monitoring.
5 Limitations, future perspectives, and recommendations
This study was designed around a small, locally homogeneous dataset collected in the Rhamna region of Morocco and measured using a single benchtop MIR instrument under controlled, oven−dry laboratory conditions. Although leave−one−out cross−validation (LOOCV) and a Wilcoxon signed−rank test provided internal evidence of improvement, the lack of an external prediction set or fully nested resampling means that the reported gains may be optimistic when transferred to other soils, moisture states, particle size distributions, or sensor platforms.
Finally, the model performance remained moderate for some targets (P2O5 and CEC), the outlier influence was non−negligible for CEC, and precision estimates for wet−chemistry references were unavailable, limiting our ability to benchmark the RMSE against analytical uncertainty.
Future work should prioritize broader and more diverse spectral libraries spanning soil types, texture classes, mineralogies, and field moisture conditions, coupled with either independent external validation or repeated nested cross−validation to de−bias performance estimates. Methodologically, population−based and stability−oriented selection schemes such as Monte Carlo uninformative variable elimination (MC-UVE), recursive weighted PLS (rPLS), competitive adaptive reweighted sampling (CARS), and variable iterative space shrinkage approach (VISSA) alongside alternative learners such as support vector machine (SVR), Random Forests, and/or cubist can be explored.
Derivative along or combined with SNV is often beneficial, but the order matters. It is recommended to embed all model−selection steps, preprocessing choices, the number of latent variables, variable selection, and selected wavenumbers reporting both accuracy (R²cv/RMSEcv) and uncertainty relative to reference−method precision. For properties that remain challenging (P2O5, CEC), it is also recommended to prioritize targeted sample augmentation, moisture/particle size stress tests, and robust models.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
RM: Investigation, Software, Validation, Conceptualization, Writing – review & editing, Writing – original draft, Formal analysis, Methodology, Visualization. MG: Methodology, Investigation, Conceptualization, Writing – review & editing, Validation, Software. IB: Formal analysis, Project administration, Visualization, Data curation, Resources, Validation, Software, Methodology, Supervision, Funding acquisition, Writing – review & editing, Investigation, Conceptualization.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1. Wu Y. The application of mid-infrared spectroscopy in the soil properties. In: Chen S, editor. 2nd International Conference on Materials Chemistry and Environmental Engineering (CONF-MCEE 2022). 40 (SPIE, ONLINE, United States (2022). doi: 10.1117/12.2646217
2. Barra I, Haefele SM, Sakrabani R, and Kebede F. Soil spectroscopy with the use of chemometrics, machine learning and pre-processing techniques in soil diagnosis: Recent advances–A review. TrAC Trends Analytical Chem. (2021) 135:116166. doi: 10.1016/j.trac.2020.116166
3. Baumann P, Lee J, Frossard E, Schönholzer LP, Diby L, Hgaza VK, et al. Estimation of soil properties with mid-infrared soil spectroscopy across yam production landscapes in West Africa. SOIL. (2021) 7:717–31. doi: 10.5194/soil-7-717-2021
4. Hume R, Marschner P, Mason S, Schilling RK, Hughes B, Mosley LM, et al. Measurement of lime movement and dissolution in acidic soils using mid-infrared spectroscopy. Soil Tillage Res. (2023) 233:105807. doi: 10.1016/j.still.2023.105807
5. Johnson J-M, Vandamme E, Senthilkumar K, Sila A, Shepherd KD, Saito K, et al. Near-infrared, mid-infrared or combined diffuse reflectance spectroscopy for assessing soil fertility in rice fields in sub-Saharan Africa. Geoderma. (2019) 354:113840. doi: 10.1016/j.geoderma.2019.06.043
6. Marakkala Manage LP, Greve MH, Knadel M, Moldrup P, de Jonge LW, and Katuwal S. Visible-near-infrared spectroscopy prediction of soil characteristics as affected by soil-water content. Soil Sci Soc Amer J. (2018) 82:1333–46. doi: 10.2136/sssaj2018.01.0052
7. Barra I, El Moatassem T, and Kebede F. Soil particle size thresholds in soil spectroscopy and its effect on the multivariate models for the analysis of soil properties. Sensors. (2023) 23:9171. doi: 10.3390/s23229171
8. Fu C-B, Cao S, and Tian A-H. High-precision soil ni content prediction model using visible near-infrared spectroscopy coupled with recurrent neural networks. Sensors Mater. (2024) 36:5019. doi: 10.18494/SAM5199
9. Mokere R, Ghassan M, and Barra I. Soil spectroscopy evolution: A review of homemade sensors, benchtop systems, and mobile instruments coupled with machine learning algorithms in soil diagnosis for precision agriculture. Crit Rev Analytical Chem. (2024) 55(7):1–20. doi: 10.1080/10408347.2024.2351820
10. Ng W, Winowiecki LA, Karari V, Weullow E, Ateku DA, Vågen TG, et al. Exploring mid-infrared spectral transfer functions for the prediction of multiple soil properties using a global dataset. Soil Sci Soc Amer J. (2024) 88:1234–47. doi: 10.1002/saj2.20697
11. Viscarra Rossel RA, Shen Z, Ramirez-Lopez L, Behrens T, Shi Z, Wetterlind J, et al. An imperative for soil spectroscopic modelling is to think global but fit local with transfer learning. Earth-Sci Rev. (2024) 254:104797. doi: 10.1016/j.earscirev.2024.104797
12. Reeves JB. Near- versus mid-infrared diffuse reflectance spectroscopy for soil analysis emphasizing carbon and laboratory versus on-site analysis: Where are we and what needs to be done? Geoderma. (2010) 158:3–14. doi: 10.1016/j.geoderma.2009.04.005
13. Knox NM, Grunwald S, McDowell ML, Bruland GL, Myers DB, Harris WG, et al. Modelling soil carbon fractions with visible near-infrared (VNIR) and mid-infrared (MIR) spectroscopy. Geoderma. (2015) 239–240:229–39. doi: 10.1016/j.geoderma.2014.10.019
14. Theophile T. Infrared spectroscopy: life and biomedical sciences. Rijeka, Croatia: BoD – Books on Demand (2012). doi: 10.5772/2655
15. Kandpal LM, Munnaf MA, Cruz C, and Mouazen AM. Spectra fusion of mid-infrared (MIR) and X-ray fluorescence (XRF) spectroscopy for estimation of selected soil fertility attributes. Sensors. (2022) 22:3459. doi: 10.3390/s22093459
16. Tiecher T, Moura-Bueno JM, Caner L, Minella JPG, Evrard O, Ramon R, et al. Improving the quantification of sediment source contributions using different mathematical models and spectral preprocessing techniques for individual or combined spectra of ultraviolet–visible, near- and middle-infrared spectroscopy. Geoderma. (2021) 384:114815. doi: 10.1016/j.geoderma.2020.114815
17. Li H, Wang J, Zhang J, Liu T, Acquah GE, Yuan H, et al. Combining variable selection and multiple linear regression for soil organic matter and total nitrogen estimation by DRIFT-MIR spectroscopy. Agronomy. (2022) 12:638. doi: 10.3390/agronomy12030638
18. Cañasveras Sánchez JC, Barrón V, Del Campillo MC, and Viscarra Rossel RA. Reflectance spectroscopy: a tool for predicting soil properties related to the incidence of Fe chlorosis. Span J Agric Res. (2012) 10:1133. doi: 10.5424/sjar/2012104-681-11
19. dos Santos GLAA, Besen MR, Furlanetto RH, Crusiol LGT, Rodrigues M, Reis AS, et al. Spectral method for liming recommendation in oxisol based on the prediction of chemical characteristics using interval partial least squares regression. Remote Sens. (2022) 14:1972. doi: 10.3390/rs14091972
20. Ghassan M, Mokere R, Beniaich A, and Barra I. Mid-infrared spectral insights into soil hydraulic, physical, and chemical properties across four Moroccan regions. Geoderma Regional. (2025) 43:e01028. doi: 10.1016/j.geodrs.2025.e01028
21. Wadoux A. M. J.-C., Malone B, Minasny B, Fajardo M, and McBratney AB. Soil Spectral Inference with R: Analysing Digital Soil Spectra Using the R Programming Environment. Cham: Springer International Publishing (2021). doi: 10.1007/978-3-030-64896-1
22. Esteban M, Ariño-Blasco MC, and Díaz-Cruz JM. Chemometrics in electrochemistry. In: Comprehensive Chemometrics. Cham, Switzerland: Elsevier (2020). p. 1–31. doi: 10.1016/B978-0-12-409547-2.14622-0
23. Ahmadi A, Emami M, Daccache A, and He L. Soil properties prediction for precision agriculture using visible and near-infrared spectroscopy: A systematic review and meta-analysis. Agronomy. (2021) 11:433. doi: 10.3390/agronomy11030433
24. Franceschini MHD, Demattê JAM, Kooistra L, Bartholomeus H, Rizzo R, Fongaro CT, et al. Effects of external factors on soil reflectance measured on-the-go and assessment of potential spectral correction through orthogonalisation and standardisation procedures. Soil Tillage Res. (2018) 177:19–36. doi: 10.1016/j.still.2017.10.004
25. Vasques GM, Grunwald S, and Sickman JO. Comparison of multivariate methods for inferential modeling of soil carbon using visible/near-infrared spectra. Geoderma. (2008) 146:14–25. doi: 10.1016/j.geoderma.2008.04.007
26. Vestergaard R-J, Vasava HB, Aspinall D, Chen S, Gillespie A, Adamchuk V, and Biswas A. Evaluation of optimized preprocessing and modeling algorithms for prediction of soil properties using VIS-NIR spectroscopy. Sensors. (2021) 21:6745. doi: 10.3390/s21206745
27. Barnes RJ, Dhanoa MS, and Lister SJ. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Appl Spectrosc. (1989) 43:772–7. doi: 10.1366/0003702894202201
28. Brown CD, Vega-Montoto L, and Wentzell PD. Derivative preprocessing and optimal corrections for baseline drift in multivariate calibration. Appl Spectrosc. (2000) 54:1055–68. doi: 10.1366/0003702001950571
29. Ezenarro J, Schorn-García D, Busto O, and Boqué R. ProSpecTool: A MATLAB toolbox for spectral preprocessing selection. Chemometr Intelligent Lab Syst. (2024) 247:105096. doi: 10.1016/j.chemolab.2024.105096
30. Guyon I and Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. (2003) 3:1157–82.
31. Colombo C, Palumbo G, Di Iorio E, Sellitto V, Comolli R, Stellacci A, et al. Soil organic carbon variation in alpine landscape (Northern Italy) as evaluated by diffuse reflectance spectroscopy. Soil Sci Soc America J. (2014) 78:794–804. doi: 10.2136/sssaj2013.11.0488
32. Novacoski EJ, Caetano IK, Melquiades FL, Genú AM, Reyes Torres Y, González-Borrero PP, et al. Spectroscopic based partial least-squares models to estimate soil features. Microchem J. (2022) 180:107617. doi: 10.1016/j.microc.2022.107617
33. Farrés M, Platikanov S, Tsakovski S, and Tauler R. Comparison of the variable importance in projection (VIP) and of the selectivity ratio (SR) methods for variable selection and interpretation. J Chemometr. (2015) 29:528–36. doi: 10.1002/cem.2736
34. Nørgaard L, Saudland A, Wagner J, Nielsen JP, Munck L, Engelsen SB, et al. Interval partial least-squares regression (i PLS): A comparative chemometric study with an example from near-infrared spectroscopy. Appl Spectrosc. (2000) 54:413–9.
35. Mukherjee R, Sengupta D, and Sikdar SK. Selection of sustainable processes using sustainability footprint method. In: Computer Aided Chemical Engineering, vol. 36. Amsterdam, The Netherlands: Elsevier (2015). p. 311–29.
36. Farrés M, Platikanov S, Tsakovski S, and Tauler R. Comparison of the variable importance in projection (VIP) and of the selectivity ratio (SR) methods for variable selection and interpretation. J Chemometr. (2015) 29:528–36. doi: 10.1002/cem.2736
37. Rajalahti T, Arneberg R, Berven FS, Myhr KM, Ulvik RJ, Kvalheim OM, et al. Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. Chemometr Intelligent Lab Syst. (2009) 95:35–48. doi: 10.1016/j.chemolab.2008.08.004
38. Shao J. Linear Model selection by cross-validation. J Am Stat Assoc. (1993) 88:486–94. doi: 10.1080/01621459.1993.10476299
39. Kucheryavskiy S. mdatools: multivariate data Analysis for Chemometrics. Vienna, Austria: CRAN (The Comprehensive R Archive Network), R Foundation for Statistical Computing (2024).
40. Stevens A and Ramirez-Lopez L. prospectr: Miscellaneous Functions for Processing and Sample Selection of Spectroscopic Data. Vienna, Austria: CRAN (The Comprehensive R Archive Network), R Foundation for Statistical Computing (2024).
41. Liland KH. plsVarSel: Variable Selection in Partial Least Squares. 0.9.13. Vienna, Austria: CRAN (The Comprehensive R Archive Network), R Foundation for Statistical Computing. (2016). doi: 10.32614/CRAN.package.plsVarSel.
42. Zhao Y, Yu J, Shan P, Zhao Z, Jiang X, and Gao S. PLS Subspace-Based calibration transfer for near-infrared spectroscopy quantitative analysis. Molecules. (2019) 24:1289. doi: 10.3390/molecules24071289
43. Zheng X, Nie B, Du J, Rao Y, Li H, Chen J, et al. A non-linear partial least squares based on monotonic inner relation. Front Physiol. (2024) 15:1369165. doi: 10.3389/fphys.2024.1369165
44. Hynes A, Scott DA, Man A, Singer DL, Sowa MG, Liu KZ, et al. Molecular mapping of periodontal tissues using infrared microspectroscopy. BMC Med Imaging. (2005) 5:2. doi: 10.1186/1471-2342-5-2
45. McCarty G, Reeves J, Reeves V, Follett R, and Kimble J. (2002). Mid-Infrared and Near-Infrared Diffuse Reflectance Spectroscopy for Soil Carbon Measurement. Soil Science Society of America Journal. 66:640–646. doi: 10.2136/sssaj2002.6400
46. Mirzaeitalarposht R and Kambouzia J. Development of mid-infrared spectroscopic feature-based indices to quantify soil carbon fractions. Eurasian Soil Sc. (2020) 53:73–81. doi: 10.1134/S1064229320010111
47. Terra FS, Demattê JAM, and Viscarra Rossel RA. Spectral libraries for quantitative analyses of tropical Brazilian soils: Comparing vis–NIR and mid-IR reflectance data. Geoderma. (2015) 255–256:81–93. doi: 10.1016/j.geoderma.2015.04.017
48. Ng W, Minasny B, Jeon SH, and McBratney A. Mid-infrared spectroscopy for accurate measurement of an extensive set of soil properties for assessing soil functions. Soil Secur. (2022) 6:100043. doi: 10.1016/j.soisec.2022.100043
49. Haghi RK, Pérez-Fernández E, and Robertson AHJ. Prediction of various soil properties for a national spatial dataset of Scottish soils based on four different chemometric approaches: A comparison of near infrared and mid-infrared spectroscopy. Geoderma. (2021) 396:115071. doi: 10.1016/j.geoderma.2021.115071
50. Barthès BG, Kouakoua E, Clairotte M, Lallemand J, Chapuis-Lardy L, Rabenarivo M, et al. Performance comparison between a miniaturized and a conventional near infrared reflectance (NIR) spectrometer for characterizing soil carbon and nitrogen. Geoderma. (2019) 338:422–9. doi: 10.1016/j.geoderma.2018.12.031
51. Heil K and Schmidhalter U. An evaluation of different NIR-spectral pre-treatments to derive the soil parameters C and N of a humus-clay-rich soil. Sensors (Basel). (2021) 21:1423. doi: 10.3390/s21041423
Keywords: chemometrics, data preprocessing, mid-infrared, soil diagnosis, soil spectroscopy, variable selection
Citation: Mokere R, Ghassan M and Barra I (2026) Soil spectroscopy improves mid infrared soil property prediction through optimized preprocessing and variable selection. Front. Soil Sci. 6:1760011. doi: 10.3389/fsoil.2026.1760011
Received: 05 December 2025; Accepted: 05 January 2026; Revised: 30 December 2025;
Published: 28 January 2026.
Edited by:
Songchao Chen, Zhejiang University, ChinaReviewed by:
Shengchang Huai, University of Liège, BelgiumVimalashree H, University of Agricultural Sciences, Bangalore, India
Copyright © 2026 Mokere, Ghassan and Barra. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Reda Mokere, cmVkYS5tb2tlcmVAdW02cC5tYQ==; Issam Barra, SXNzYW0uYmFycmFAdW02cC5tYQ==
Issam Barra*