Integrating and optimizing genomic, weather, and secondary trait data for multiclass classification

Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g., grain yield). Genomic data is high-dimensional and often over-crowds smaller data types when naively combined to explain the response variable. There is a need to develop methods able to effectively combine different data types of differing sizes to improve predictions. Additionally, in the face of changing climate conditions, there is a need to develop methods able to effectively combine weather information with genotype data to predict the performance of lines better. In this work, we develop a novel three-stage classifier to predict multi-class traits by combining three data types—genomic, weather, and secondary trait. The method addressed various challenges in this problem, such as confounding, differing sizes of data types, and threshold optimization. The method was examined in different settings, including binary and multi-class responses, various penalization schemes, and class balances. Then, our method was compared to standard machine learning methods such as random forests and support vector machines using various classification accuracy metrics and using model size to evaluate the sparsity of the model. The results showed that our method performed similarly to or better than machine learning methods across various settings. More importantly, the classifiers obtained were highly sparse, allowing for a straightforward interpretation of relationships between the response and the selected predictors.


MULTI-CLASS CLASSIFICATION
One-vs-all (OVA) is the most straightforward approach to binarizing a K-class problem into separate binary classification sub-problems in such a way that each sub-problem utilizes a binary classifier to discriminate the k th class from the other K − 1 classes. Thus, if we have K classes, we require K binary classifiers. To predict the class of a new observation, we predict the probability that it is class k using the k th classifier for each k = 1, . . . , K and assign that class associated with the largest probability.
One-vs-one (OVO) is another approach to creating binary classification sub-problems so that each subproblem builds a discriminant to separate each pair of classes. If we have K classes, there are K(K−1) 2 pairs, and hence we require K(K−1) 2 binary classifiers. When classifying a new observation, we predict the class using each binary classifier and assign the class that is most frequent among the K(K−1) 2 classifiers.

Binary Classification
For binary classification, we evaluate the performance of models using metrics such as overall accuracy (Acc), true positivity rate (TPR), and true negativity rate (TNR). Overall accuracy is simply defined as the total number of correctly classified observations out of the total number of observations in the data set. While it is a simple and straightforward metric to understand, overall accuracy does not distinguish between the false positive and false negative types of error made. In other words, it does not differentiate between a 1 being classified as 0 from a 0 being classified as a 1. In situations where the false positive and false negative errors have differing consequences, accuracy may not measure the important characteristics. Further, overall accuracy does not signify the actual performance in the presence of class imbalance. Suppose we had a binary response with 995 observations in class 1 and 5 in class 0. A model can obtain an overall accuracy of 99.5% by simply classifying all observations as 1. Thus, metrics such as TPR and TNR were defined to decompose the overall accuracy into accuracy by class to alleviate the problems described above. In binary classification, TPR, also known as sensitivity or recall, calculates the proportion of 1s that are actually classified as 1s. TNR, also known as specificity, refers to the proportion of 0s that are classified as 0s. Thus, we can evaluate the performance of the model for each class using TPR and TNR.
Using the confusion matrix in Fig S2, overall accuracy (Acc), TPR, and TNR can be defined as: where TP was the number of true positives, TN was the number of true negatives, FP was the number of false positives, and FN was the number of false negatives. N was the total number of observations, which is also the sum of TP, TN, FP, and FN. Several other metrics have been developed for classification and a detailed review can be found in Tharwat (2020) and Grandini et al. (2020).

BINARY TRAIT IMPLEMENTATION
Days to Maturity (DM) is a continuous trait that quantifies the number of days from planting till the plant is mature enough to produce seeds. We discretized the DM trait into a binary variable with 0 indicating low and 1 indicating high values to implement the binary classification algorithm. We used quantiles to separate the two classes as well as possible. For the continuous trait DM, we denoted all observations in the bottom 25 th percentile as zeros and the top 25 th percentile as ones.Changing the choice of percentiles would change the separation between the classes and impact the predictive ability of any classification method.
In order to evaluate the model thoroughly, we investigated its performance across several levels of class imbalance in the binary response. Class balance refers to a situation where the number of 1's and 0's in the response variable is approximately equal. In contrast, class imbalance refers to cases where one of the classes is more prevalent than the other in the response. Imbalanced response significantly impacts the predictive power of models, and so is an important consideration. Hence, we studied the stability of our method looking at increasing levels of imbalance. We looked at the model's performance under the following five ratios of 1's and 0's: 50-50, 65-35, 75-25, 85-15, 95-05. 50-50 refers to the case where the classes are balanced, while 95-05 refers to the most extremely imbalanced case where 95% of the observations are 1's and 5% are 0's. We randomly sampled 280 observations to create the datasets with different class ratios. For each class ratio, we created 20 replications and averaged the performance across the replications to avoid any bias due to the random sampling of the observations. We split the 280 observations into training, optimization, and validation data sets in the ratio of 120/80/80, respectively. The training data set was used in the second stage, i.e., the main modeling step, to build the predictive model and obtain the coefficients associated with each predictor. The optimization data was used in the third stage of modeling to find the optimal threshold. Finally, the validation set was a new, previously unseen data set used to assess the model's performance. The coefficients from the training set and the optimal threshold from the optimization set were combined to classify observations in the validation set. This approach ensures no information is shared between the different stages of the three-stage method proposed in this work. It guarantees the maximum generalizability of the model to predict the class of a new, previously-unseen observation with a high degree of certainty.
We evaluated the effect of penalization applied to the weather and genomic data types by comparing the performance and model size with a baseline model G + E + P that had no penalty applied to any of the data types, denoted by M0. We also investigate case M1 where we included all the secondary traits in the final model (denoted by P enG + P enE + P ), since the secondary traits were so few to start with. The main proposed model P enG + P enE + P enP , denoted by M2, has penalties applied to all three data types. Both the random forest and support vector machine models were fit on the entire predictor data set with no penalization of any sort. Table S1 summarizes all the model settings, including notation to represent the models in the results section.

RESULTS FOR BINARY TRAIT
We compared our proposed three-stage method to the RF and SVM methods. The proposed method had two primary objectives: improve predictive power and reduce model size. The model size refers to the number of variables with non-zero coefficients in the final model. The smaller the model size, the more interpretable the relationships are between the predictors and the binary response. In order to evaluate the predictive power, we employed TPR, TNR, and overall accuracy for a comprehensive study. First, we evaluated the models based on overall accuracy. All the proposed models showed an improvement in performance over the ML models in the balanced class setting. The baseline model (M0) had an accuracy of 0.73. The penalized versions (M1 and M2) showed an improvement over the baseline model with an accuracy of 0.75. In contrast, the RF only had a 0.68 accuracy, while the SVM model had the worst accuracy at 0.66. Thus, the proposed main model (M2) showed a substantial improvement of 12% in prediction accuracy over the ML methods. M1 and M2 had comparable performance with an increasing class imbalance in the response. The results can be seen in Fig. S3. One thing to note is the failure of every single prediction model in the extreme imbalance case of 95-05. In this situation, all models had a very "high" accuracy score of close to 0.95. However, this is misleading since all models simply predict every observation as 1. Consequently, they end up with an accuracy of 0.95 since 95% of the observations are indeed in class 1. Across the board, the standard error associated with the average of the accuracy, TPR, and TNR were in the order of 10 −3 to 10 −4 and hence are not presented here.
The overall classification accuracy is an unreliable metric in the presence of class imbalance, as clearly evidenced in our results. Thus, we also examined the TPR and TNR values for the models in each classbalance setting. These results can be viewed in Figs. S4 and S5. For the TPR results in the balanced case, we again saw M1 and M2 significantly outperforming the ML methods. Interestingly, M1 and M2 had comparable or marginally better TPR values than RF and SVM across all class-balance settings.
Both M1 and M2 also showed improvement over ML methods in the TNR values for the balanced class with values of 0.73 as opposed to 0.64 for SVM and 0.65 for RF. However, the ML methods did have a more stable performance in the presence of class imbalance. In the extreme imbalance case, all methods have equally poor performance with all TNR close to 0 and RF with a TNR of exactly 0.
Equally crucial to the predictive performance of the models was model interpretability. Our methods showed a vast improvement over the ML methods in model sizes. SVM used all predictors for building the model and performing classification, while the RF model used several thousands of predictors. RF had over 3500 variables in the balanced case, destroying any potential for interpretation of relationships or identifying key variables to classify the binary response. The baseline model (M0) had 250 variables across the different class balance settings. Penalization (M1 and M2) provided a massive improvement in the model sizes over the M0. The M1 model used 30.9 variables on average in the balanced class setting, while the M2 model used 26.3 variables on average. Thus, the penalized models required as few as 15% of variables as the baseline model. More importantly, the penalized models M1 and M2 had the best performance across all five models despite using the fewest variables. Fig. S6 visualizes the number of variables used by each model across the different class balance settings.
Looking deeper into the number of variables used by each model, we saw that none of the M0, M1, or M2 models selected any genomic variables directly. There may have been some genomic effect indirectly through the secondary trait residuals. All the variation in the response was explained by the secondary traits and the weather variables. In contrast, RF relied heavily on genomic information, as evidenced by the ratio of the selected genomic variables to the total number of variables. Across every class balance setting, over 90% of the variables selected by RF were genomic variables. Refer to Figs. S7 and S8 for further details. All the detailed results are summarized in Table S2. Figure S1. Linear regression between environmental means and mean wind speed for the eight environments. The points represent environments and the blue line represents the best-fit line from the linear regression.         Figure S10. Bar plot for RF model sizes partitioned by data type, averaged over the 20 replications within each class balance setting for the multi-class trait. Table S2. Summary of results for the five different models across the five different class-balance settings for the binary trait. The performance was measured using overall accuracy (Acc), true positivity rate (TPR), true negativity rate (TNR), and model size (MDS). The five models and their notations are as follows: G + E + P (M0), PenG + PenE + P (M1), PenG + PenE + PenP (M2), support vector machine (SVM), and random forest (RF