This article was submitted to Plant Genomics, a section of the journal Frontiers in Genetics
This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Modern plant breeding programs collect several data types such as weather, images, and secondary or associated traits besides the main trait (e.g., grain yield). Genomic data is highdimensional and often overcrowds smaller data types when naively combined to explain the response variable. There is a need to develop methods able to effectively combine different data types of differing sizes to improve predictions. Additionally, in the face of changing climate conditions, there is a need to develop methods able to effectively combine weather information with genotype data to predict the performance of lines better. In this work, we develop a novel threestage classifier to predict multiclass traits by combining three data types—genomic, weather, and secondary trait. The method addressed various challenges in this problem, such as confounding, differing sizes of data types, and threshold optimization. The method was examined in different settings, including binary and multiclass responses, various penalization schemes, and class balances. Then, our method was compared to standard machine learning methods such as random forests and support vector machines using various classification accuracy metrics and using model size to evaluate the sparsity of the model. The results showed that our method performed similarly to or better than machine learning methods across various settings. More importantly, the classifiers obtained were highly sparse, allowing for a straightforward interpretation of relationships between the response and the selected predictors.
Modern plant breeding programs are collecting an increasing amount of data of various types from several sources such as multiple secondary phenotypic traits (other than the main trait of interest), highthroughput phenotyping data, weather data, hyperspectral images, and different types of omics data such as genomics, transcriptomics, proteomics, metabolomics,
Early attempts at integration of data types for GS show promising results (
Another underdeveloped area of research involves developing models for nonGaussian phenotypic traits. Crop yield, a continuous trait (can be modeled with a Gaussian distribution), is often the most important trait that plant breeding programs want to improve. Some other continuous traits that breeders focus on include quality (shape, size, and other aesthetic qualities), time to maturity, plant height, and seed weight. Sometimes, breeders are also interested in improving categorical phenotypic traits such as resistance to drought or salinity, susceptibility to disease, and days to maturity or flowering. While extensive literature covers the prediction of continuous traits, there is limited literature developing GS models for classification.
Since the seminal work of (
When the number of categories in the response is large, and the data follows an approximately normal distribution, treating the response as normal may be reasonable. However, if the number of categories is small and has a welldefined ordering among them, the normality approximation leads to biased estimates of the mean and the variance components (
Unfortunately, GLMMs are not directly implementable in GS due to the high dimensionality of the genomic data, where the number of predictors is far greater than the number of observations. Bayesian GLMM approaches were proposed to address the highdimensionality problem and the multicollinearity issue prevalent in genomic data. Some popular methods include BayesA (
Highdimensional prediction is the most prevalent challenge in the GS problem and has been an active area of research. Feature selection is a common strategy to reduce the dimensionality to perform such predictions. Feature selection also helps remove irrelevant features and improve the interpretability of the final model. Variable selection methods such as forward selection, backward selection, and bestsubset selection were popular but performed poorly in highdimensional data (
FIRST and STORM methods were developed for a continuous response and were regression methods.
Additionally, global climate change and extreme environmental changes are an inescapable reality of the day, presenting an escalating challenge to food production worldwide. Resilience and adaptability among crops are essential to ensure food security. Resilience refers to the stability of the yield in the face of extreme weather conditions, while adaptability refers to how crops react to changes in environments (
To summarize, this paper focused on three challenges to improve GS: integrating multiple data types, optimizing weather data to improve forecasting, and developing methods for highdimensional forecasting for categorical phenotypic traits. We developed a threestage method to integrate three data types (secondary traits, weather data, and genomic data) of differing dimensionality to predict a binary categorical phenotypic trait. The main goal of our method is to integrate genomic, weather, and secondary trait information to classify a trait of interest which can be modeled as a binary variable. Binary variables have two classes, such as diseased or not, fraud or not, resistant or not,
The rest of the paper is organized as follows. First, we present a short overview of relevant literature that motivated our proposed threestage method. Our proposed method for binary traits was presented in the Materials and Methods section. Next, we describe details about the strategy that allows our method to handle multiclass traits as well as a description of how FW regression was implemented. This section ends with a discussion of the metrics used to evaluate the classification ability of the methods for binary as well as multiclass traits. The details of the real data set used to demonstrate our methods were described next. Then, we present the results of the method and compare them to other standard methods in terms of their performance and finally conclude with a discussion and future directions.
Penalized approaches (
Another issue was that the genomic and weather variables also impacted the secondary traits. Before considering all the variables in the model building, the effect of genomic and weather variables had to be removed from the secondary traits and their true intrinsic effects had to be obtained. Isolating the intrinsic effect of the secondary traits also ensured that the potential effect of genomic or weather variables was not mistakenly ascribed to the secondary traits. Separating the effects also allowed for a simpler and cleaner interpretation of variable importance and relationships between the response and the explanatory variables.
In the first stage of the modeling, we computed the intrinsic effect of the secondary traits devoid of the weather and genomic effects. In order to remove the genomic and weather effects, the weather variables were first regressed on each of the secondary traits to obtain the residuals. Then, the genomic variables were regressed on the secondary traits to obtain another set of residuals. In the second stage of the modeling, we used a training data set to build a logistic regression classifier combined with a penalized forwardselection scheme to include phenotypic residuals into the model before allowing weather variables and finally allowing genomic variables. Through the iterative process of the forward selection scheme, only the most influential predictor was selected to enter the model at each step. The third and final stage of the proposed method concentrated on improving the classification through a threshold search process. Traditionally, a threshold of
In this section, we build on the overview presented above by describing all the pertinent details to implement the proposed method for GS. Let the binary main trait of interest be represented by
The first stage of the method involved evaluating the two sets of residuals of the secondary traits by removing the effects of weather and genomic covariates. A penalized regression model was used to compute the residuals of each
After obtaining the intrinsic effect of the secondary traits, represented by the two sets of residuals obtained in step one, we moved on to the second stage of the method. Here, the penalized forward selection was implemented to ensure that the secondary trait residuals were entered first into the model, followed by the weather covariates, and finally followed by the genomic variables. This approach ensured that higherdimensional data types did not crowd out the smallerdimensional data types and improved how the variables included in the model explained the variability in the response. We used a logistic classifier as the model structure for its simplicity and ease of interpretability. The probability mass function (PMF) of the
Using the PMF function defined in Eq.
Here,
The likelihood associated with the logistic regression was intractable, and hence NR iterative methods were used to obtain the coefficients associated with the predictors. In this section, we presented the algorithm used to initialize and update the coefficients in each iteration of the NR method as well as the stopping criterion. The algorithm was by setting
1. Update each
2. The NR updates start with a step size
3. Stop the iteration process when no variable is selected.
In the proposed algorithm, there were five hyperparameters that need to be tuned, {
One of the primary objectives of this work was to find the most sparse models that had the best classification performance. While the relationships between the selected secondary trait variables and the response or the genomic variables and the response were easy to interpret, the relationship between the weather covariates and the response was more challenging to understand. Data on four weather variables were collected daily over the entire growing season of 100 days, yielding 400 weather covariates. The variables were maximum temperature (Tmax), minimum temperature (Tmin), wind speed (WS), and rainfall (Precip). Weather covariates that led to the best classifier were selected without regard for the interpretation of the individual covariates chosen. For instance, suppose the three weather covariates chosen were WS at day 45, Tmax at day 18, and Tmin at day 72. There is no insight into what these individual days mean for a breeder or a farmer and how to determine the practical significance of these covariates.
An alternate approach could be to select windows of time when the selected set of weather covariates have the most impact on the main trait. For instance, suppose we select day 18 to day 25 as the window of time in the season. Then, we could include all daily weather covariates for those days in our threestage model to understand the impact of the window on the response. Such windows of time allow for greater interpretability of the impact of weather conditions during the growing season on the final plant production. Further, extreme weather conditions in the identified windows of time could help forecast potential losses, and farmers could take actions to mitigate losses, if possible.
Plant breeding programs often collect data on several weather variables such as wind speed, humidity, daylight hours, temperature,
Suppose the main trait can be expressed as follows:
The time window mean of weather variables is represented as follows:
We found the optimal time window for the weather variables as a preprocessing step. Hence, we included all the weather variables within the optimal time window and did not perform any feature selection for this data type in the modeling stage. We set
Predicting a response with two classes is known as a binary classification problem. Some popular supervised learning methods for binary classification include logistic regression, naive Bayes, decision trees, random forests, support vector machines, and neural networks (
Binarization strategies are prevalent because they allow simpler ways to form decision boundaries separating the two classes. Binarization allows for a greater choice of classification algorithms because almost every classification algorithm such as logistic regression, SVM, neural networks,
The OVO method has better predictive ability than the OVA in general (
In this study, we extended our methods to deal with a categorical response with three classes by opting for the OVA binarization. Since the number of classes was three, both OVO and OVA required the same number of binary classifiers  three. The objective of our proposed method was a multiclass classification for an ordinal categorical response with any number of classes. Hence, we chose the OVA approach because of computational frugality as well as generalizability. Further details about OVO and OVA can be found in section 2 of the Supplementary Material.
For a
Using the OVA approach for a threeclass problem, we obtained a set of three probabilities for each observation corresponding to the probability that the observation belonged to each of the three classes. The predicted class was simply the class with the highest probability for each observation, referred to as the maximum probability approach.
For threeclass classification, we obtained three probabilities for each observation coming from the three binary classifiers. Class assignments depended only on comparing these probabilities and picking the class with the maximum probability. Thus, unlike traditional binary classification, the class assignment was independent of any threshold. Hence, we dropped the threshold search step from the method proposed in the binary classification algorithm and used the rest of the method as the classifier for multiclass classification.
We tried to improve the classification by searching for optimal thresholds between class 1 vs rest and class 3 vs rest. The idea was to partition the probability space into three regions corresponding to the three classes, based on these two thresholds. However, the approach had poor forecasting performance, especially for class 2. A possible reason was that three completely separate classifiers were used, each with its own set of coefficients associated with the predictors leading to three sets of linear predictors that had significant overlaps in the predictor space. Thus, this approach was not feasible, and the best performance was observed when using the maximum probability approach.
Binary classification metrics can be modified to make them suitable for multiclass problems. For multiclass classification, the performance was assessed using the following metrics: overall accuracy, macro true positivity rate (TPR), and macro true negativity rate (TNR). Overall accuracy is a metric that remains the same between binary and multiclass problems. It is defined as the total number of correctly classified observations out of the total number of observations in the data set. In binary classification, TPR and TNR are easily defined because there is one positive and one negative class. However, these metrics need to be modified when there are multiple classes. Instead, TPR and TNR can be computed for each class, and then the weighted mean could be found to compute the weighted macro TPR and weighted macro TNR. The weights are determined by the proportions of each class in the training data set. Using the sample multiclass confusion matrix in
A representative confusion matrix representing the true class vs. predicted class for “class 2” in a multiclass classification: true positive (TP), false positive (FP), true negative (TN), and false negative (FN).
We used a chickpea data set to evaluate the performance of the proposed model and contrast it with standard machine learning methods such as random forest (RF) and support vector machines (SVM). The data set contained phenotypic, weather, and genomic SNP data. The phenotypic and weather data were collected at four locations, namely International Center for Agricultural Research in the Dry Areas (ICARDA)  Amlaha, International Crops Research Institute for the SemiArid Tropics (ICRISAT)  Patancheru, Junagadh Agricultural University (JAU)  Junagadh, and Rajasthan Agricultural Research Institute (RARI)—Durgapura, over two seasons, 2014–15 and 2015–16. Genotypic information was available for
Over 15 phenotypic variables were collected across the eight environments. However, only seven of those were available in all eight environments. We selected only these as the primary and secondary phenotypic traits for the analysis. The seven phenotypic variables were days to flowering (DF), plant height (PLHT), basal primary branch (BPB), basal secondary branch (BPB), days to maturity (DM), pods per plant (PPP), and 100 seed weight (100SDW). This work focused on days to maturity (DM) as the primary trait of interest. The rest of the six traits form the secondary trait data type.
The second data type in this study was the weather data collected daily over the entire growing season of 100 days. Depending on the location, the growing season was between October and April. Several weather variables were collected at each location; however, only four variables were commonly present across the four locations for each day. The variables were maximum temperature, minimum temperature, wind speed, and rainfall. Since the four weather traits were collected daily over 100 days, there were a total of 400 weather variables that form the weather data type.
Genomic Single nucleotide polymorphisms (SNP) data was available for the 749 lines. We randomly selected 10,000 markers for each line to form the genomic data type. To summarize, there were six secondary traits, 400 weather traits, and 10,000 SNPs as the final set of predictor variables. We considered a threeclass DM variable as the response for the multiclass classification demonstration and also implemented the proposed threestage method for a binary trait, whose implementation and results can be found in section 3 of the Supplementary section.
For the multiclass implementation, we created a threeclass categorical response using the DM trait that was continuous. The DM trait was discretized into three levels corresponding to low, average, and high days to maturity. For the continuous trait DM, all observations in the bottom 25th percentile were denoted as class 1 and top 25th percentile as class 3. We also found a 25th percentile centered around the mean of the continuous trait and denoted that as class 2. The discretization of the continuous response is depicted in
Visualization depicting the process of discretizing a continuous trait (Days to Maturity—DM) into a multiclass trait. All observations in red are denoted as class 1, green as class 2, and blue as class 3.
The performance of the multiclass methods was evaluated using three different classbalance settings: 333433, 404020, and 108010. 333333 refers to the balanced class setting, while 108010 refers to the extreme imbalance setting where 80% of the observations are in class 2% and 10% in each of the other two classes. We randomly sampled 280 observations to create the datasets with different class ratios. For each class ratio, 20 replications were created and averaged the performance across the replications to avoid sampling bias.
We decomposed the multiclass problem into a set of binary classification problems using the OVA approach. As discussed earlier, threshold searching methods were not employed due to this OVA approach. Thus, the 280 observations were split into a train and test set in the ratio of 200/80. The training data set was used for the penalized logistic regression modeling step, and the test data set was used to evaluate the performance of the models.
The penalization effect was evaluated with the help of a baseline model
Summary of models assessed in this paper for a multiclass categorical trait.
Model  Notation 




G + E + P  MC0  0  0  0 
PenG + PenE + P  MC1  0  varying  varying 
PenG + PenE + PenP  MC2  varying  varying  varying 
PenG + FWE + P  MT1  0  0  varying 
PenG + FWE + PenP  MT2  varying  0  varying 
Support Vector Machines  SVM  –  –  – 
Random Forest  RF  –  –  – 
In our research, we focused on several aspects of developing models to improve breeding and aid the selection process. The focus was to integrate three different data types with differing dimensions for the purpose of predicting phenotypes that can be categorized. In addition, we wanted to see whether a subset of the weather information is sufficient to achieve the same prediction accuracy as we can reach with the full set of weather information.
First, the proposed models were evaluated in terms of overall accuracy, and we also included TPR and TNR. Seven classification models (MC0, MC1, MC2, MT1, MT2, SVM, RF) were evaluated and compared. We wanted to see whether the penalizationbased methods and methods that incorporate the reduced amount of weather information can outperform machine learning techniques. Also, we wanted to determine whether the model performance is influenced by the rate of the imbalance in the data. All the proposed models showed similar performance to ML models for the balanced class and the medium imbalance settings (40–40–20). All the models had an overall accuracy of ∼.55 in both these settings. The penalizationbased models (MC0, MC1, and MC2) had similar overall accuracy to the optimal time window based models (MT1 and MT2) for these two settings. However, in the extreme imbalance class of 108010, MT1 and MT2 had significantly worse accuracy of .57 and .59 instead of the .80 accuracies for the penalizationbased and ML models. It is important to note that both the ML models and the penalization models ended up predicting all observations as class 2 for the 108010 case and hence resulted in an accuracy of 80%, matching the proportion of class 2 in that setting. Weighted macro TPR and TNR metrics provide better insight into a model’s predictive ability in the presence of imbalance, and hence we looked at these metrics next. Across the board, the standard error associated with the average of the overall accuracy, mTPR, and mTNR were in the order of 10^{–3} to 10^{–4} and hence are not presented here. Overall accuracy results for all the models can be seen in
Bar plot comparing the seven different classification models based on overall classification accuracy averaged over the 20 replications within each class balance setting for the Days to Maturity (DM) trait with three classes. The seven models and their notations are as follows: G + E + P (MC0), PenG + PenE + P (MC1), PenG + PenE + PenP (MC2), PenG + FWE + P (MT1), PenG + FWE + PenP (MT2), support vector machine (SVM), and random forest (RF).
For the balanced class setting, the weighted macro TPR values for the proposed models were around .78, while the ML models had values around .58. This represents a 20% higher TPR value for our proposed models. On the other hand, macro TNR values for the proposed models were similar to the ML methods in the balanced case. We saw similar trends for the medium imbalance case as well. However, in the extreme imbalance case, the ML methods outperformed the proposed methods in terms of the weighted macro TPR metric. In contrast, MT1 and MT2 had ∼25% higher macro TNR than the penalization based and ML models. These results can be viewed in
Bar plot comparing the seven different classification models based on the weighted macro true positivity rate (TPR) averaged over the 20 replications within each class balance setting for the Days to Maturity (DM) trait with three classes. The seven models and their notations are as follows: G + E + P (MC0), PenG + PenE + P (MC1), PenG + PenE + PenP (MC2), PenG + FWE + P (MT1), PenG + FWE + PenP (MT2), support vector machine (SVM), and random forest (RF).
Bar plot comparing the seven different classification models based on the weighted macro true negativity rate (TNR) averaged over the 20 replications within each class balance setting for the Days to Maturity (DM) trait with three classes. The seven models and their notations are as follows: G + E + P (MC0), PenG + PenE + P (MC1), PenG + PenE + PenP (MC2), PenG + FWE + P (MT1), PenG + FWE + PenP (MT2), support vector machine (SVM), and random forest (RF).
Model interpretability was one of the primary objectives of this study along with the evaluation of the model’s predictive power. Model interpretability can be assessed by evaluating the complexity of the model used. Generally, including more predictors into the model results in a more complex model that can lead to difficulty in interpretability. Our methods showed a tremendous improvement over the ML methods in model sizes. Across all class balance settings, our method had close to 90% fewer predictors in the final models compared to the ML models. The penalized methods had smaller model sizes between the penalizationbased and the optimal windowbased approaches. This complements what we observed with the binary trait results (presented in the supplementary materials). MC1 and MC2 had the smallest models consistently across the class settings. Just as with the binary trait, the penalizationbased methods MC0, MC1, and MC2 did not include any genomic variables in the final model, while MT1 and MT2 did. Refer to
Bar plot comparing the proposed models to random forest based on model size averaged over the 20 replications within each class balance setting for the Days to Maturity (DM) trait with three classes. The six models and their notations are as follows: G + E + P (MC0), PenG + PenE + P (MC1), PenG + PenE + PenP (MC2), PenG + FWE + P (MT1), PenG + FWE + PenP (MT2), and random forest (RF).
Summary of results for the seven different models across the three different classbalance settings. The performance was measured using overall accuracy (Acc), weighted macro true positivity rate (mTPR), weighted macro true negativity rate (mTNR), and model size (MDS). The seven models and their notations are as follows: G + E + P (MC0), PenG + PenE + P (MC1), PenG + PenE + PenP (MC2), PenG + FWE + P (MT1), PenG + FWE + PenP (MT2), support vector machine (SVM), and random forest (RF).
Balance  Model  Acc  mTPR  mTNR  MDS 

33—34—33  MC0  .57  .78  .78  240.5 
MC1  .57  .78  .79  83.8  
MC2  .57  .79  .79  28.5  
MT1  .56  .78  .78  476.9  
MT2  .55  .78  .78  459.9  
SVM  .55  .55  .77  –  
RF  .58  .58  .79  3191.6  
40—40—20  MC0  .55  .76  .72  238.8 
MC1  .56  .76  .73  60.3  
MC2  .56  .76  .73  45.2  
MT1  .56  .77  .76  460.3  
MT2  .55  .77  .74  458.6  
SVM  .54  .54  .73  –  
RF  .56  .56  .74  3443.8  
10—80—10  MC0  .80  .67  .21  253.2 
MC1  .80  .67  .21  24.3  
MC2  .80  .67  .21  48.7  
MT1  .59  .70  .54  475.9  
MT2  .57  .70  .53  468.4  
SVM  .80  .80  .23  –  
RF  .80  .80  .32  1013.4 
One of the recent challenges in plant breeding and especially in terms of accelerating genetic gain is how to utilize all the data that are collected. With highthroughput technologies, information is collected on the molecular level, on the different environments including weather information, and factors that describe the different management practices. This system of diverse information has the potential to connect the different genotypes in different environments and explain how they might be related and ultimately benefit the prediction of traits of interest.
This paper proposed a novel threestage classification method to improve multiclass classification when combining multitype data. More specifically, we developed a classification method for binary and multiclass responses using secondary trait data, weather covariates, and marker information. We used the oneversusall (OVA) binarization approach (
The improved performance of the proposed model, as compared to the ML methods, can be attributed to the manner in which the stages of the proposed method were constructed. First, by isolating the intrinsic effect of the secondary traits, we reduced the confounding effects. Reducing confounding helped separate the effect of each data type on the response as well as helped improve the independence between data types. It is well known that collinearity and highcorrelated predictors significantly harm prediction and classification efforts. Secondly, by controlling the order in which data types entered the model, we gave the secondary traits the best chance of being selected in the final model. This process ensured that weather and genomic variables were not selected unless they significantly enhanced the model and ensured that the secondary traits were not ignored just because of their low dimensionality. We further exacerbated this effect by the choice of
One of the key strengths of the proposed method is its modular nature. In the first stage, we used ridgebased penalized regression to extract the intrinsic effect of the secondary traits devoid of weather and genomic effects. The ridge penalty can be substituted for any other penalties such as LASSO, adaptive LASSO, elastic net, adaptive elastic net,
There are several immediate extensions that we can foresee. In this work, we considered 10,000 randomly selected SNPs as the genomic data type. Surprisingly none of these SNPs were selected in the final models based on the proposed method, which could be a result of the random selection process of these SNPs from a much larger pool of available SNPs. Alternatively, it could also be attributed to measures employed to keep model sizes as small as possible such as the choice of penalties and
There is limited literature (
Finally, while we presented this method in an agronomic setting, the threestage model proposed can be implemented in any problem involving combining data types of differing sizes. For example, we believe that it could be very useful in the area of precision medicine where the main trait is a risk of disease or reaction to a medication. The secondary traits could be other physiological measurements from the subjects, the mediumdimensional data could be the lifestyle and behavioral characteristics, and the highdimensional data type could be the genomic marker information.
The data analyzed in this study is subject to the following licenses/restrictions: The data set used in this study is available upon request. Requests to access these datasets should be directed to Reka Howard,
VM conceptualized the work, analyzed the data, and summarized the results. DJ conceptualized the study, edited the paper, and supervised the project. RH conceptualized the study, edited the paper, and supervised the project. All authors have read and agreed to the published version of the paper.
The authors are thankful to Hari D. Upadhyaya for their contribution to the planning of the study and for generating quality phenotypic data at ICRISAT, Patancheru.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
The Supplementary Material for this article can be found online at: