SABO-ILSTSVR: a genomic prediction method based on improved least squares twin support vector regression

In modern breeding practices, genomic prediction (GP) uses high-density single nucleotide polymorphisms (SNPs) markers to predict genomic estimated breeding values (GEBVs) for crucial phenotypes, thereby speeding up selection breeding process and shortening generation intervals. However, due to the characteristic of genotype data typically having far fewer sample numbers than SNPs markers, overfitting commonly arise during model training. To address this, the present study builds upon the Least Squares Twin Support Vector Regression (LSTSVR) model by incorporating a Lasso regularization term named ILSTSVR. Because of the complexity of parameter tuning for different datasets, subtraction average based optimizer (SABO) is further introduced to optimize ILSTSVR, and then obtain the GP model named SABO-ILSTSVR. Experiments conducted on four different crop datasets demonstrate that SABO-ILSTSVR outperforms or is equivalent in efficiency to widely-used genomic prediction methods. Source codes and data are available at: https://github.com/MLBreeding/SABO-ILSTSVR.


Introduction
With the decreasing cost of high-throughput sequencing data, genomic prediction (GP) emerges as a novel breeding approach, using high-density single nucleotide polymorphisms (SNPs) to capture associations between markers and phenotypes, thereby enabling prediction of genomic estimated breeding values (GEBVs) at an early stage of breeding (Meuwissen et al., 2001).Compared with conventional breeding methods, such as phenotype and marker-assisted selection, GP greatly shortens generation intervals, reduces costs, and enhances the efficiency and accuracy of new variety selection (Heffner et al., 2010).
From the proposal of the concept of genomic prediction to the present, a multitude of models have emerged.Early models primarily focused on improving best linear unbiased prediction (BLUP), such as ridge regression-based best linear unbiased prediction (rrBLUP) (Henderson, 1975) and genomic best linear unbiased prediction (GBLUP) (VanRaden, 2008), etc.In addition, researchers have proposed various Bayesian methods, including BayesA and BayesB (Meuwissen et al., 2001), BayesC (Habier et al., 2011) and BayesLasso (Park and Casella, 2008), Bayesian ridge regression (BayesRR) (da Silva et al., 2021), BSLMM(Zhou et al., 2013).Moreover, bayesian methods generally exhibit higher prediction accuracy than GBLUP in the majority of cases (Rolf et al., 2015).However, the Markov Chain Monte Carlo (MCMC) steps involved in parameter estimation for Bayesian methods can significantly increase computational costs.With advancements in high-throughput sequencing technologies, the increasing dimensionality of genotype data poses new challenges for GP models.To address this problem, some researchers have begun employing regularization term to mitigate the overfitting problem, such as ridge regression (Ogutu et al., 2012), Lasso (Usai et al., 2009), elastic net (Wang et al., 2019).Meanwhile, machine learning (ML) methods such as support vector regression (SVR) (Maenhout et al., 2007;Ogutu et al., 2011), random forest (RF) (Svetnik et al., 2003), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost) (Chen and Guestrin, 2016) and light gradient boosting machine (LightGBM) (Ke et al., 2017), have made great performance in genomic prediction methods.With the development of deep learning (DL), researchers have also combined it with genomic prediction models, such as DeepGS proposed by Ma et al. (Ma et al., 2018), based on convolutional neural networks (CNN), and DNNGP proposed by Wang et al. (Kelin et al., 2023) for application in multi-omics, which have achieved better performance compared with other classic models.However, genotype data for most species exhibit high-dimensional small-sample characteristics, leading models often to fail to learn effective features from the training data.Moreover, most GP models contain a large number of parameters and there are significant differences in genotype data among different species, leading to a tedious parameter tuning process for each species, significantly increasing breeding costs.Therefore, enhancing the prediction performance of GP models and reducing the complexity of parameters tuning is of crucial importance for shortening generation intervals and reducing breeding costs.
This study explores a machine learning model, least squares twin support vector regression (LSTSVR), to address above problem.LSTSVR, proposed by Zhong et al. (2012), is a regression model that integrates the ideas of least squares method and twin support vector machine (TSVM) (Jayadeva et al., 2007).LSTSVR contains a kernel function; when linearly inseparable data exists in the original input space, it can become linearly separable after being mapped into a higher-dimensional feature space through an appropriate kernel function (Kung, 2014).LSTSVR improves the computational efficiency during the training process of traditional SVR models by introducing the least squares paradigm to replace the ε-insensitive loss function in SVR, thereby transforming the originally nonlinear optimization problem into an easier-to-solve system of linear equations, offering more stable performance.Simultaneously, adopting a two sets of support vectors for regression enhances the model's learning capability and robustness (Huang et al., 2013).However, as a general-purpose regression model, LSTSVR still has certain limitations when dealing with various high-dimensional, smallsample genotype datasets.
To better address model overfitting and complex parameter tuning when the number of genotype samples is far less than the number of SNPs markers (Crossa et al., 2017;Tong and Nikoloski, 2020) this study has made improvements and optimizations to LSTSVR.Firstly, inspired by Lasso regularization, a penalty term was introduced to constrain model complexity on LSTSVR.Concurrently, in order to reduce the complexity stemming from model parameter tuning, the subtraction average-based optimizer (SABO) (Trojovský and Dehghani, 2023) was adopted to perform parameter optimization on the LSTSVR model.By combining the Lasso regularization-based ILSTSVR with the efficient optimization of SABO, this study successfully developed a genomic prediction model named SABO-ILSTSVR.To validate the effectiveness of SABO-ILSTSVR, comparative experiments were conducted using SABO-ILSTSVR on four different species datasets (maize, potato, wheat, and brassica napus) against commonly used genomic prediction models (LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP).The results demonstrate that SABO-ILSTSVR exhibits equivalent or superior performance compared with widely-used genomic prediction methods.Finally, in order to reduce the difficulty of using the model, this study provides an easy-to-use python-based tool for breeders to use conveniently.

Dataset
Four different crops datasets are used in this study, including potatoes, wheat, maize, and Brassica napus.The following provides detailed descriptions of the genotype and phenotype data for each dataset.
Potato dataset (Selga et al., 2021) is derived from a total of 256 cultivated varieties across three locations in northern and southern Sweden.Over 2000 SNPs markers used for genomewide prediction were obtained from germplasm resources at both the Centro Internacional de la Papa (CIP, Lima, Peru) and those in the United States.According to Selga et al. (2021), this number of SNPs is already sufficient for predict genomic estimated breeding values (GEBVs) without loss of information.In this dataset, the total weight of tubers serves as the phenotype data.
Wheat dataset (Crossa et al., 2014) originates from the Global Wheat Program at CIMMYT, comprising information on 599 wheat lines.The project carried out numerous experiments across various environmental settings, with the dataset divided into four core environments according to distinct environmental parameters.Average grain yield (GY) serves as the phenotypic trait data within this dataset.It contains 1,279 SNPs markers, which were acquired following the removal of those with minor allele frequencies below 0.05 and the estimation of missing genotypes utilizing samples from the genotype edge distribution.
Maize dataset (Crossa et al., 2014) originates from CIMMYT's maize project, comprising 242 maize lines and 46,374 SNP markers.The project encompasses multiple phenotypic data points, and we use the most significant yield-related traits as our phenotype data for this study.
Brassica napus dataset (Kole et al., 2002) is part of the MTGS package.The dataset comprises 50 lines derived from two varieties, 100 SNP markers, and phenotype information on flowering days across three distinct time periods (flower0, flower4, flower8).

SABO-ILSTSVR model
The SABO-ILSTSVR model integrates the subtraction-based average optimizer (SABO) and an improved LSTSVR method.Its overall framework is depicted in Figure 1, followed by an elaboration of the model.

LSTSVR
Least squares twin support vector regression (LSTSVR) is an improved twin support vector regression (TSVR) (Peng, 2010;Shao et al., 2013) model by Yang et al. (Lu Zhenxing, 2014).TSVR is an extension of the regression model derived from twin support vector machine (TSVM) (Jayadeva et al., 2007), suitable for addressing continuous value prediction problems, which accomplishes this by determining two regression functions, shown as follows, where Eq. ( 1) determines ε-insensitive down-bound function f 1 and ε-insensitive up-bound function f 2 , w 1 , w 2 are weight vectors, b 1 , b 2 are bias term.These can be obtained by solving the following quadratic programming problems (QPPs), where C 1 , C 2 are the positive penalty parameters, ε 1 , ε 2 are up-and down-bound parameters, ξ 1 , ξ 2 are slack vectors, e is a vector of ones with appropriate dimensions.The result of the final regression function is decided by the mean of upper and lower bound functions, as Eq. ( 4), In the spirit of LSTSVM, Yang et al. apply least squares method to TSVR.TSVR finds the optimal weight vector and bias terms by solving two QPPs, whereas LSTSVR transforms the original TSVR problem into two systems of linear equations for solution, which is typically faster and more stable than directly solving q QPPs, with the loss in accuracy being within an acceptable range.For LSTSVR model, the inequality constraints of ( 2) and (3) are replaced with equality constraints as follows, In formula ( 5) and ( 6), the square of L2-norm of slack variable ξ 1 , ξ 2 is used, instead of L1-norm in ( 2) and (3), which makes constraint ξ 1 ≥ 0, ξ 2 ≥ 0 redundant, so the following formulas is obtained, (7) and ( 8) are two unconstrained QPPs, hence the solutions for w and b can be directly obtained by setting the derivatives to zero, as Eq. ( 9), where u w b , H [X e], then, we have Eq.( 10) and Eq. ( 11), Illustration of SABO-ILSTSVR model frameworks.
Frontiers in Genetics frontiersin.org thus, the

Improved LSTSVR (ILSTSVR)
When applying LSTSVR to handle high-dimensional genotype datasets with small samples, the model is prone to a high risk of overfitting.This is due to the fact that the model may overly fit noise and feature details in the training set, leading to decrease generalization performance on new samples and thereby affecting the effectiveness and reliability of the predictive results.Therefore, this study adds a Lasso regularization term for the weight parameter w in the LSTSVR framework.For linear problems, the function of ILSTSVR is as follows, where C 1 , C 2 , C 3 , C 4 , ε 1 , ε 2 are positive penalty parameters, ξ 1 , ξ 2 are slack variables, e is a vector of ones of appropriate dimensions.For the non-differentiability with L1 regularization and the convenience of calculations, we assume w 1 α*w 2 2 , then ( 12) and ( 13) can be converted into as follow, where α 1 , α 2 are vectors of appropriate dimensions, * represents element-wise multiplication.Expand (14) yields, let the derivatives of ( 16) with respect to w 1 and b 1 respectively be zero, we obtain Eq. ( 17) and Eq. ( 18), composed in matrix form as Eq. ( 19), where D 1 diag(α 1 ) T diag(α 1 ).Similarly, the following result Eq. ( 20) can be obtained through ( 15), where D 2 diag(α 2 ) T diag(α 2 ).Because of the assumption w 1 α p w 2 2 , we set an initial α 0 , and calculate the final w and b through the iterative formula that updates alternately, as Eq. ( 21) To make the process clear, the computational process is summarized as Table 1.
Then we obtain Eq. ( 22), For nonlinear problems, kernel functions typically provide a good solution, and here we have chosen RBF as the kernel function for our model.The formula of RBF as Eq. ( 23), where σ is kernel parameter, the value has a significant impact on prediction performance, easily leading to overfitting or underfitting.We map the training data through K(X, X T ) into a highdimensional reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950), obtaining matrix H. Thus, we obtain the function for the nonlinear problem as Eq. ( 24) and Eq. ( 25), Similarly, we can obtain Eq. ( 26) and Eq. ( 27), the final nonlinear model as Eq. ( 28),

Optimizing ILSTSVR using SABO
For the non-linear ILSTSVR model, the choice of kernel parameter σ for RBF has a significant impact on prediction performance.Moreover, in this study, we set C 1 C 2 and C 3 C 4 for the ILSTSVR parameters C 1 , C 2 , C 3 , C 4 as given in ( 12) and (13).Parameter optimization plays an important role in machine learning, as selecting appropriate parameters can significantly enhance a model's predictive capability and accuracy.Different combinations of parameters may lead to vastly different performances of the model on both training and testing data.Furthermore, parameters affect the model complexity and learning capacity.By adjusting them, we can better strike a balance between overfitting and underfitting (Young et al., 2015).Although grid search can find the global optimal solution, it will result in enormous computational resource consumption and neglect the correlations among parameters (Vincent and Jidesh, 2023).Instead, we use the recently proposed SABO for parameter optimization, which updates the positions of population members in the search space using subtraction averages of individuals, characterized by strong optimization capability and fast convergence rates (Moustafa et al., 2023).
The basic inspiration for the design of the SABO is mathematical concepts such as averages, the differences in the positions of the search agents, and the sign of difference of the two values of the objective function.The idea of using the arithmetic mean location of all the search agents (i.e., the population members of the tth iteration), instead of just using, e.g., the location of the best or worst search agent to update the position of all the search agents (i.e., the construction of all the population members of the (t + 1)th iteration), is not new, but the SABO's concept of the computation of the arithmetic mean is wholly unique, as it is based on a special operation " − v ", called the v-subtraction of the search agents B from the search agent A, which is defined as Eq. ( 29): where v is a vector of the dimension m, the operation "p" represents the Hadamard product of the two vectors, F(A) and F(B) are the values of the objective function of the search agents A and B, respectively, and sign is the signum function (Trojovský and Dehghani, 2023).
In the proposed SABO, the displacement of any search agent X i in the search space is calculated by the arithmetic mean of the v-subtraction of each search agent X j , j = 1,2, . . ., N, from the search agent X i .Thus, the new position for each search agent is calculated using (30).
where X new i is the new proposed position for the ith search agent X i , N is the total number of the search agents, and r i is a vector of the dimension m.Then, if this proposed new position leads to an improvement in the value of the objective function, it is acceptable as the new position of the corresponding agent, according to (31) where F i and F new i are the fitness function values of the search agents X i and X new i , respectively.Similar to other optimization algorithms, the primary positions of the search agents in the search space are randomly initialized using (32).
where x i,d is the dth dimension of X i , N is the number of search agents, m is the number of decision variables, r i,d is a random number in the interval [0, 1], and lb d and ub d are the lower and upper bounds of the dth decision variables, respectively (Trojovský and Dehghani, 2023).
Here, we define the fitness function for SABO as MSE function shown in (34 The parameter corresponding to the smallest fitness function value obtained through iteration is the optimal combination of parameters for ILSTSVR.Finally, train the model according to the obtained parameters combination.

Performance evaluation
To evaluate the prediction performance of GEBVs by the model, while avoiding the problem that Pearson correlation coefficient fails to measure the distance between true and predicted values, we adopt both Pearson correlation coefficient and MSE as evaluation metrics for the relationship between predicted and true values.Furthermore, Algorithm 1: An iterative algorithm to solve the L1 regularization problem in ( 14) and ( 15) where y and y′ represent the true values and predicted values respectively, cov(y, y′) denotes the covariance of vectors y and y′, σ y and σ y′ are the standard deviations of vector y and y′ respectively.Mean squared error (MSE) measures the degree of difference between predicted values and actual values, and is defined as follows,

Comparison of SABO-ILSTSVR with the base model
To validate the effectiveness of ILSTSVR, this section compares SABO-ILSTSVR with some methods prior to its improvement on four datasets (potato, wheat, maize and brassica napus).The results on the potato dataset (Figure 2) show that the SABO-ILSTSVR exhibits a 4% increase in Pearson correlation coefficient and a 2% decrease in MSE compared with SABO-LSTSVR.Furthermore, when contrasted with SABO-SVR, the SABO-ILSTSVR demonstrates a 9% improvement in the Pearson correlation coefficient and a 6% decrease in MSE.
The results on other datasets are shown in Table 2. On the wheat dataset, SABO-ILSTSVR improves the Pearson correlation coefficient in four environments by an average of 2%, 2%, 8%, and 2% and reduces the MSE in four environments by an average of 1%, 3%, 1% and 1%, respectively, compared with SABO-SVR and SABO-LSTSVR.On the maize dataset, SABO-ILSTSVR improves the Pearson correlation coefficient by an average of 6% and reduces the MSE by 1%, respectively, compared with SABO-SVR and SABO-LSTSVR.On the brassica napus dataset, SABO-ILSTSVR improves the Pearson correlation coefficient in three traits by an average of 11%, 6%, and 24% and reduces the MSE in three traits by an average of 17%, 1% and 12%, respectively, compared with SABO-SVR and SABO-LSTSVR.

Comparison of SABO-ILSTSVR with other methods
Considering the complexity of the genetic architecture, this study employs real data to evaluate the prediction performance of models, including datasets from public available sources for maize, wheat, potato, and Brassica napus.And we performed standardization on the phenotype data of all datasets.Due to the small sample sizes in the adopted datasets, random sampling errors may be relatively substantial, leading to decreased model prediction accuracy, reduced statistical power of tests, and difficulty in obtaining stable and reliable statistical inferences (Bengio and Grandvalet, 2004).Consequently, a ten-fold cross-validation is applied to each dataset in this study, with the average of the results over ten iterations used to represent the ultimate prediction performance of the models.

Wheat dataset
Similarly, prediction performance comparisons were conducted for SABO-ILSTSVR, LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP on the wheat dataset.As shown in Figure 4, the DNNGP model exhibits the highest Pearson correlation coefficients in predicting yield under environments env1 and env2 of the wheat dataset, whereas its performance in env3 and env4 is inferior to that of other models.In contrast, our proposed SABO-ILSTSVR model demonstrates higher Pearson correlation coefficients and lower mean squared errors for yield data across all four environments compared with the other models.Specifically, SABO-ILSTSVR achieves the best performance in env3 and env4 and is second only to DNNGP in env1 and env2.The reason for the differing performance may be due to their prediction performance varying across different agroclimatic regions.Performance of prediction for various models on the potato dataset in terms of Pearson correlation coefficients and MSE.

Maize dataset
The detailed information on SNPs and phenotypes in the maize dataset has been described in the Materials and Methods section.Unlike the potato and wheat datasets, this section performs ten replicates ten-fold cross-validation separately for SABO-ILSTSVR, LightGBM, rrBLUP, GBLUP, BayesRR, Lasso, RF, SVR, DNNGP on the maize dataset.Comparison was not conducted with BSLMM due to its unstable results.The Pearson correlation coefficients of the ten results are represented (A) in Figure 5, while the mean squared errors (MSEs) of the ten results are averaged and depicted as (B).As shown in Figure 5, Lasso exhibits the lowest prediction performance, whereas DNNGP, despite having the highest Pearson correlation coefficient prediction performance, displays significantly large variations across the ten runs, resulting in elongated bars in the boxplot.In contrast, SABO-ILSTSVR has a stable Pearson correlation coefficient prediction performance and exhibits the lowest MSE.

Brassica napus dataset
Similarly, on the brassica napus dataset, ten replicates ten-fold cross-validation was performed for SABO-ILSTSVR, LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP, respectively.The Pearson correlation coefficients of the ten results are represented by (A) in Figure 6, while the mean squared errors (MSEs) after averaging are depicted by (B).As shown in (A) of Figure 6, SABO-ILSTSVR exhibits the highest Pearson correlation coefficient and lowest MSE prediction performance on flower0 and flower8, whereas BSLMM achieves the highest Pearson correlation coefficient prediction performance on flower4.The prediction performance of the SABO-ILSTSVR model is slightly lower than that of the GBLUP and BSLMM model but higher than that of the other comparative models on flower4.

Discussion
In this study, we integrate Least Squares Twin Support Vector Regression (LSTSVR) with Lasso regularization, constructing a GP model named ILSTSVR.We use the SABO optimization algorithm to effectively optimize the parameters of the model.To address the number of genotype samples is far less than the number of SNPs markers, we introduce a Lasso regularization term into LSTSVR.By the unique feature selection property of Lasso, it can effectively shrink the coefficients of non-key features to zero, achieving parameter sparsity and effectively preventing overfitting.Meanwhile, to cope with the potential nonlinear relationships in genotype data, we adopt the radial basis function (RBF) kernel, mapping the raw data into a high-dimensional space to attain linear separability.
Considering the differences in genotype data among various species may lead to distinct optimal parameters, we used the SABO optimization algorithm to automatically tune the parameters of the ILSTSVR model.To validate the effectiveness of this model, we conducted evaluations on multiple datasets spanning potato, maize, wheat, and Brassica napus, and compared its prediction performance against a series of widely-used models such as LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP.The results showed that the SABO-ILSTSVR model demonstrated outstanding prediction performance on the potato dataset, outperforming other benchmark models.In the wheat and brassica napus datasets containing multiple phenotypic traits, our model consistently exhibited higher prediction accuracy for most traits compared with other models.The box plot analysis of the maize and brassica napus dataset further revealed the robustness of the SABO-ILSTSVR model's predictions.
In our further exploration of the future, confronted with the high-dimensional challenges of genomic data, we will delve deeper into how to efficiently perform feature extraction (Burges, 2009).With the continuous decline in sequencing costs, large-scale genomic sequencing of samples is poised to become a reality, the increased sample size may offer more favorable application for DL models (Khan et al., 2020;Yang et al., 2024).We will investigate innovative DL models within the field of breeding.

Output
by-item update of α 1 , α 2 t t + 1 until α 1 , α 2 convergence Frontiers in Genetics frontiersin.orgwe use ten-fold cross-validation to assess the model's performance.The original dataset is divided into ten equally-sized (or nearly equal) folds; for each fold, it serves as the validation set, while the remaining nine folds constitute the training set.Training a model using the training set, and its performance is evaluated using the validation set.After completing all ten iterations, the average of the performance measures obtained from each validation set is taken, thereby yielding an overall assessment of the model's performance.The Pearson correlation coefficient (PCC) is used to measure the strength and direction of the linear relationship between two continuous variables, and is defined as follows, ρ y,y′ cov y, y′ σ y σ y′

FIGURE 2
FIGURE 2Performance of Pearson correlation coefficient and MSE prediction for SABO-SVR, SABO-LSTSVR, and SABO-ILSTSVR on the potato dataset.

FIGURE 4
FIGURE 4Performance of prediction for various models on the wheat dataset.Pearson correlation coefficient (A) and MSE (B) metrics of four phenotypes (env1, env2, env3 and env4), as evaluated through ten-fold cross-validation.

FIGURE 5
FIGURE 5Performance of prediction for various models on the maize dataset.(A) Pearson correlation coefficients of maize yield traits, represented by box plots, after ten replicates ten-fold cross-validation.(B) Average MSE of maize yield traits after ten replicates ten-fold cross-validation.

FIGURE 6
FIGURE 6 Performance of prediction for various models on the brassica napus dataset.(A) Pearson correlation coefficients of three phenotypes (flower0, flower4 and flower8), represented by box plots, after ten replicates ten-fold cross-validation.(B) Average MSE of three phenotypes after ten replicates ten-fold cross-validation.

TABLE 1
Iterative algorithm to solve the L1 regularization problem.

TABLE 2
Prediction performance in ten-fold cross-validation for each trait in wheat, maize and brassica napus datasets.