A data-driven crop model for biomass sorghum growth process simulation

Chang, Yanbin; Ni, Zheng; Panelo, Juan S.; Kemp, Joshua; Salas-Fernandez, Maria G.; Wang, Lizhi

doi:10.3389/fpls.2025.1617775

ORIGINAL RESEARCH article

Front. Plant Sci., 13 November 2025

Sec. Crop and Product Physiology

Volume 16 - 2025 | https://doi.org/10.3389/fpls.2025.1617775

A data-driven crop model for biomass sorghum growth process simulation

Yanbin Chang¹

Zheng Ni¹

Juan S. Panelo^2,3

Joshua Kemp²

Maria G. Salas-Fernandez^2*

Lizhi Wang^4,5

¹School of Industrial Engineering and Management, Oklahoma State University, Stillwater, OK, United States
²Department of Agronomy, Iowa State University, Ames, IA, United States
³Horticultural Sciences Department, University of Florida, Gainesville, FL, United States
⁴Department of Bioengineering, George Mason University, Fairfax, VA, United States
⁵Department of Systems Engineering and Operations Research, George Mason University, Fairfax, VA, United States

Accurate simulation of crop growth processes for predicting final yield is critical for optimizing resource management, particularly in regions with variable climates and limited resource availability. This paper proposes a novel data-driven crop model to simulate phenotypic changes during biomass sorghum growth. The model integrates a detailed physiological framework for sorghum development—tracking how phenotypes are determined by genotype, environment, management practices, and their interactions—with data-driven techniques to calibrate genotypic parameters using experimental data. Results demonstrate that the model achieves accurate biomass production predictions and successfully disentangles the effects of environmental and management factors on phenotypic development, even with limited data. This model enhances the accuracy and applicability of biomass sorghum growth and yield prediction models, offering valuable insights for precision agriculture.

1 Introduction

Sorghum (Sorghum bicolor (L.) Moench) is a versatile C4 drought-resistant and nutritionally valuable crop, integral to food security and biofuel production around the world (Wang et al., 2008; Xiong et al., 2019; Silva et al., 2022). Among different sorghum types, biomass sorghum has emerged as a resource capable of accumulating over 20 tn/ha of dry matter (Salas-Fernandez and Kemp, 2022) for forage and bioenergy production. In addition, bioenergy sorghums are beneficial toward greenhouse gas mitigation (Olson et al., 2012).

Biomass yield can be influenced by environmental factors, management practices (Olson et al., 2013) and, given its polygenic nature, genotypic variability (Breitzman et al., 2019; Habyarimana et al., 2020; Singh et al., 2025). Key biomass-related traits such as stem diameter, plant height (Salas Fernandez et al., 2017), flowering time (Habyarimana et al., 2020), and carbon partitioning (Boatwright et al., 2022) have been the focus of attention to dissect the complexity of biomass yield. The study of biomass-related traits led to the development of genetic resources including the Bioenergy Association Panel, the Carbon-Partitioning Nested Association Mapping panel and the Photoperiod Sensitive Panel (PSP) (Brenton et al., 2016; Yu et al., 2016; Boatwright et al., 2021). These populations enabled studying biomass-related traits with germplasm relevant to the production system, integrating growth dynamics with high throughput phenotyping (Panelo et al., 2024) and crop modeling strategies (Panelo et al., 2025).

Accurate simulation of the biomass sorghum growth process is pivotal for predicting the final yield and optimizing resource management strategies, particularly in areas susceptible to climate variability and resource constraints (Zha et al., 2010; Biazin et al., 2012; Kugedera et al., 2022). Reliable yield predictions are essential for optimizing agronomic interventions, resource allocation, and supply chain logistics. Consequently, researchers have explored various modeling approaches, ranging from process-based crop simulations to data-driven models, to address this challenge.

Process-based crop models have been widely used to predict sorghum yield by explicitly integrating various physiological processes, environmental factors, and management practices. SORKAM, introduced by Rosenthal et al. (1989), broke ground by modeling daily canopy development and adjusting carbon partitioning based on organs’ demands. This sink-source foundation was brought into the Decision Support System for Agrotechnology Transfer (DSSAT) framework, introducing CERES-Sorghum (Virmani et al., 1989). In CERES-Sorghum, radiation-use efficiency drives daily biomass production that is then distributed to leaves, stems, and grain according to stage-specific coefficients, whereas genotypic coefficients drive mostly crop phenology (White et al., 2015). Continuous updates in the CERES-Sorghum model improved routines for leaf area development and biomass partitioning, boosting predictive skill by up to 20% (White et al., 2015), while experiments with larger rooting depths have successfully identified management practices for sweet sorghum (Lopez et al., 2017). The Agricultural Production Systems sIMulator (APSIM) is another radiation use efficiency-based model, including a sorghum module that has been optimized for integration with plant breeding (Hammer et al., 2010). This crop growth model has been effective for simulating genetic diversity in sorghum across environments (MacCarthy et al., 2009; Chimonyo et al., 2016; Truong et al., 2017; Yang et al., 2021; Tirfessa et al., 2023). Advancements in high-throughput phenotyping allowed integrating remotely sensed leaf area index (LAI) and vegetation indices like NDVI with both CERES and APSIM to correct state variables, thus improving predictive ability under varying climatic conditions (Masjedi et al., 2018; Della Nave et al., 2022; Kivi et al., 2023). Generally, process-based models like APSIM and DSSAT describe processes on a fine-scale temporal basis (Jones et al., 2003; Holzworth et al., 2014). However, calibrating the parameters used by process-based models represents a challenge as it requires resource-intensive field experiments in a range of environments (He et al., 2017).

Data-driven models aim to build a mathematical relationship between the input data and the output, unlike process-based models which rely on known physiological mechanisms (Roberts et al., 2017). Jiang et al. (2004) developed an artificial neural network using back-propagation algorithms to enhance crop yield prediction accuracy. Over a decade later, several deep neural network based models were developed to ingest daily weather grids, layered soils, and genotype markers to untangle genotype by environment (G×E) interactions driving yield. Khaki and Wang (2019) predicted maize yields for new hybrids planted in unseen locations by learning the complex G×E interactions from historical trials, while Shook et al. (2021) integrated genotype information with weather variables to improve soybean yield prediction. Later, Khaki et al. (2020) improved generalization by introducing convolutional neural networks and recurrent neural networks (CNN-RNN) framework which extracts spatio-temporal features from weather and soil data to capture latent G×E patterns. Such hybrid CNN-RNN outperformed random forests and linear models (Khaki et al., 2020).

Statistical regression modeling is another data-driven method, which can take advantage of weather and remote sensing data. County-scale weather regressions could achieve notable accuracy in maize yield forecasting (Conradt et al., 2016), satellite-derived vegetation indices, weather, soil, and location data could explain the soybean yield variation (Chen et al., 2019). Similar methods using NDVI time-series have been applied for wheat yield estimation as well (Duan et al., 2017). These studies highlight that well-structured regression models can provide robust, interpretable predictions, especially when paired with remote sensing and meteorological inputs. Likewise with process-based models, integration of high-throughput phenotyping imagery from unmanned aerial vehicles (UAVs) and advanced machine learning improve precision. Varela et al. (2021) demonstrated that high temporal resolution UAV imagery can capture growth dynamics in biomass sorghum by extracting time-series features as canopy development rates. Their model utilized dynamic and time-point specific image-derived features to predict biomass accumulation, highlighting the benefit of monitoring crop progress over time. Integration of UAV-based data with deep learning algorithms sharped predictive performance, as the fine-scale, high-resolution data from UAVs better capture crop health and stress status throughout the growing season (Masjedi et al., 2019; Khaki et al., 2021; Wang and Crawford, 2021; Wang et al., 2023). The data-driven modeling approach has two major limitations. First, the black-box structure between input and output layers makes the results less interpretable since it can build relationships in the data that do not consider known assumptions (Alibabaei et al., 2022; Drees et al., 2024). Second, the model performance is highly sensitive to data quantity and quality, posing challenges when applying the model with insufficient or noisy data (Jabed and Murad, 2024; Miftahushudur et al., 2025).

Researchers have recently attempted to integrate traditional process-based crop growth models with data-driven modeling techniques to gain both accuracy and interpretability. One popular route treats simulated state the variables as engineered features, fed into gradient-boosting or bagged-tree ensembles (Feng et al., 2020; Shahhosseini et al., 2021). They demonstrated how output variables from APSIM such as phenology and soil moisture, can serve as engineered features in machine learning frameworks, reducing prediction errors in wheat (Feng et al., 2020) and maize (Shahhosseini et al., 2021). Similar integrations in soybean (Corrales et al., 2022) and maize (Zhang et al., 2021), improved the prediction performance by combining environmental data with crop growth model outputs into linear regression models. These integrative models tend to be more transparent, since the process-based component ties predictions to biophysical crop responses, and the data-driven component can quantify feature importance. A second route builds neural experiments that approximate the entire CERES or APSIM parameter surface. Some field-focused studies (McCormick et al., 2021; Xiao et al., 2022; Droutsas et al., 2022; Cunha et al., 2023) reinforced that coupling data-driven and process-based techniques provides more interpretable agronomic adjustments under climate adaptation scenarios. Likewise, Li et al. (2023); Gallear (2023) and Chang et al. (2023) report that machine learning emulators of crop models enabled faster simulations and more efficient scenario analyses. These tools facilitate real-time exploration of “what-if” management decisions and provide interpretable outputs. Overall, these studies highlight that integrating knowledge from the process-based models domain with the flexibility of machine learning, results in more accurate and data-efficient models that are also transparent and actionable advancing decision making for breeders, agronomists, and farmers.

This paper presents a novel data-driven crop model for biomass sorghum growth simulation. The model integrates a descriptive sorghum growth framework—tracking phenotypic responses to genotype, environment, and management (G×E×M) interactions—with data-driven calibration of genotypic parameters from experimental data. Unlike conventional process-based models that treat genotypes as fixed inputs, our approach explicitly disentangles G×E×M effects on phenotypes during the sorghum growth stage by parameterizing genetic properties for each genotype. This methodology streamlines the calibration of complex coefficients inherent in process-based models and reduces reliance on uncertain parameters derived from field experiments, which are often confounded by G×E×M interactions. Additionally, our modular framework adapts to data availability, eliminating the need for predetermined datasets or assumptions about missing information. This adaptability stands in contrast to traditional models, which require extensive data imputation prior to implementation. To the best of our knowledge, this paper presents the first attempt to merge a crop model and a data-driven model to address biomass sorghum yield prediction.

2 Materials and methods

In this section, we first present the input data used in this study, then demonstrate the sorghum growth model used in the data-driven crop model approach, followed by the training approach.

2.1 Sorghum data

Sorghum phenotypic data were collected from field trials conducted in 2021 and 2022 at the Iowa State University Agricultural Engineering and Agronomy farm, in Boone, IA. The experiments were conducted using a randomized complete blocks design with two replications, with a planting rate of 12 pl/m² and 70 cm inter-row spacing. The trials evaluated the Photoperiod Sensitive Panel, which includes 270 photoperiod sensitive (PS) sorghum genotypes (Yu et al., 2016). PS sorghum requires a daylength shorter than 12 hours and 20 minutes for flowering (Rooney and Aydin, 1999), and is primarily cultivated for biomass production. Its extended vegetative stage in temperate and subtropical climates results in higher total dry biomass of leaves and stems compared to other sorghum types during later growth phases (Rooney and Aydin, 1999; Hao et al., 2014). The dataset includes phenotypic records for 11 time points during the growing season (22–145 days after planting), along with genotype and management data. Although the exact sampling dates differed by one or two days, the scheduled measurement points were 22, 36, 43, 50, 57, 64, 71, 78, 85, 110, and 145 days after planting. Phenotypic measurements included dry biomass weights of stems and leaf blades. Each fraction comprises the total biomass of the main culm and, where present, the tillers. Dry biomass was recorded after drying the samples at 60 °C until constant weight. Additionally, management records were collected, containing information on planting date, harvest date, and stand count (plant population density).

2.2 Weather data

To comprehensively account for the effect of weather on sorghum growth, weather data was retrieved from the Iowa Environmental Mesonet Herzmann and Wolt (2020), which includes an automated weather station at the farm where the experiments were performed. The variables obtained were air temperature, relative humidity, solar radiation, precipitation, wind speed, evapotranspiration, soil temperature (at 4, 12, 24, 50 inch depth), and soil volumetric water content (at 12, 24, 50 inch depth).

2.3 Sorghum growth model

Our sorghum growth model is designed based on the available data previously described. It has a customized module structure adapted to the modeling granularity provided by the available data. Although the crop model includes a grain component, PS sorghum does not produce grain in temperate environments. In this paper, we focus on tracking the phenotype during the sorghum growth process, with particular emphasis on total biomass weight, which is determined by the dry weight of leaves and stems. Based on these considerations, we constructed a sorghum growth model using the module structure shown below (Figure 1). More detailed definitions and equations are illustrated in Supplementary Presentation 1.

Figure 1

Diagram of the Sorghum Growth Module showing interactions between genotype, environment, and management factors. Inputs include environment (air and soil conditions), genotype, and management practices. The growth module processes stress, tillering, growth, water, photosynthesis, phenology, transpiration, maintenance, and respiration. Outputs are phenotype traits: leaf weight, stem weight, and plant height.

Figure 1. Architecture of the proposed sorghum growth process simulation model.

● Stress: heat and cold stresses based on air temperature and root temperature are considered. Air stress can influence leaf, stem, and root stress only affects the root.

● Tillering: tillers have their own leaf, stem, grain, and root systems.

● Growth: the model updates leaf weight, root weight, root length, stem weight, stem height, grain weight, after considering maintenance and growth respiration. Root length and stem height do not decrease due to irreversible cell expansion and lignification. Under carbon deficit (maintenance > growth), biomass is remobilized from existing pools (weight decline) while maintaining structural dimensions, reducing tissue density. When growth biomass is replenished, the model first restores tissue density before allocating to new growth.

● Water: water can be stored and transported in the xylem of the main crop and tiller. Plant water uptake is influenced by root system efficiency and xylem transport capacity, while stand count affects water availability through competition among neighboring plants. Soil water volume is treated as an external input, independent of plant activity since it is considered an input data.

● Photosynthesis: the daily biomass accumulation is determined by light, water, leaf, and phloem capacity constraints, radiation interception is modeled as a function of stand count.

● Phenology: there are four transition points for the sorghum growth in our model, 1) planting, 2) vegetative stage, 3) bloom and grain filling, and 4) harvest. The phenology module does not include stage 3 (bloom and grain filling) for PS sorghum.

● Transpiration: air temperature, humidity, evaporation, and wind can affect the transpiration.

● Maintenance: the amount of photosynthate consumption for maintenance and senescence for leaf, root, stem, and grain, determined by organ weight and stress.

● Respiration: the respiration consumes the photosynthate and provides energy for plant growth activity.

Our data-driven crop model approach can separate input data, output data, genotype-specific properties, intermediate variables, and output variables. Other crop models like APSIM and DSSAT use parameters that are jointly determined by genotype and environment interactions. Instead of using growing degree days (GDD) as the threshold for growing stage transitions and biomass partition ratio, we define a growing degree unit (GDU) in eq. (S2) which is similar to GDD but is determined by hourly temperature. The GDU has more capability to capture weather fluctuations on an hourly temporal scale instead of being potentially misleading like average scale data. We also define a growing phenology unit (GPU) in eq (S3) and (S4) which is calculated by normalized temperature and normalized solar radiation to determine the growth stage. The data-driven crop approach calibrates the parameters using data rather than using predetermined coefficients. This difference provides advantages including compatibility with state-of-the-art data-driven calibration algorithms and adaptability with breeding algorithms.

2.4 Training approach

To demonstrate the effectiveness of the data-driven crop approach, we applied the sorghum growth model to the dataset described previously. Computational experiments were conducted using Python on the High Performance Computing Center at Oklahoma State University with dual Intel “Skylake” 6130 CPUs 192 2.1GHz and 96 GB RAM. The data-driven training method is illustrated below.

The missing data in the weather dataset were imputed using the k-nearest neighbors (kNN) method (Fu et al., 2019; Hamzah et al., 2021), which is widely used to handle missing values in crop-yield prediction studies. Genotypic parameters are seeded with values taken from APSIM 7.10 (Holzworth et al., 2014) together with expert-derived bounds, providing a biologically plausible starting point that speeds convergence. Following the workflow described in Section 2.4 (Algorithm 1), we trained the model on one experimental year and validated it on the other. The calibration of genotypic parameters g∗ uses the relative root mean square error (RRMSE) as the performance metric, which scales the classic RMSE by the mean observed value, making it easier to compare across traits and years. The heuristic algorithm applies an iterative search: 1) randomly select n parameters from the N-dimensional vector g∗, assigning higher sampling probability to parameters with greater local sensitivity, 2) for each selected parameter, propose two new values: one incremented and one decremented by the current step size, 3) evaluate current RRMSE for each proposed g∗ vector, 4) update the optimal g∗ if a proposal yields better RRMSE, otherwise retain the current solution and proceed to the next iteration. The search terminates when the time limit is reached or when the RRMSE falls below a predefined tolerance. The function GetRRMSE calculates the RRMSE of dry biomass from the given training set. The RRMSE can be calculated as:

R R M S E = \frac{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i, t} - {\hat{x}}_{i, t})}^{2}}}{\frac{1}{n} \sum_{i = 1}^{n} x_{i, t}}

where,

● $n$ , sample size number.

● $x_{i, t}$ , the observed total dry biomass weight of leaves and stems in sample $i$ on day $t$ .

● ${\hat{x}}_{i, t}$ the predicted total dry biomass weight of leaves and stems in sample i on day t.

Algorithm 1. Heuristic Algorithm for Tuning Genotypic Parameters.

3 Results

In this section, we demonstrate the training and test strategies and results of our data-driven crop model, followed by additional noteworthy findings.

3.1 Phenotypic data

Figure 2 displays box plots of leaves and stems dry weights for the two field trials carried out in Boone, IA. Each pair of leaves and stems dry biomass weights box-plots represents the corresponding phenotypic data characterized the same date. By the end of the growing season, stems biomass showed larger values and variability compared to the leaves biomass. Both experiments displayed a consistent increase in biomass accumulation, with differences between years. Notoriously, the leaves fraction showed a lower biomass at the end of the season in 2022, compared to the same sampling point in 2021.

Figure 2

Box plots illustrating dry biomass weight for stems and leaves across the years 2021 and 2022. Panels A and B display stem biomass, while panels C and D show leaf biomass. Each panel includes sampling dates from June to November, with means and standard deviations indicated. Stems in 2021 and 2022 show increasing biomass over time, with greater values in 2022. Leaves follow a similar trend, but with lower biomass compared to stems. Outliers are depicted as individual points above and below each box plot.

Figure 2. Distribution of dry biomass for stems (blue) and leaves (red) from field trials in Boone, IA, in 2021 and 2022. Panels correspond to 11 scheduled sampling points at 22, 36, 43, 50, 57, 64, 71, 78, 85, 110, and 145 days after planting; calendar dates occasionally differed by 1–2 days. Boxes show the interquartile range (IQR) with the median line; whiskers extend to 1.5×IQR; circles denote observations outside the whiskers. Each observation represents one plant per organ at that date. To aid interpretation, the last three sampling points in each panel are annotated with the sample mean and standard deviation. Figure 2 (A) contains stems records in 2021, (B) contains stems records in 2022, (C) contains leaves records in 2021, (D) contains leaves records in 2022.

3.2 Environmental characterization

In 2021, conditions for planting and crop establishment were excellent. In 2022, although temperatures were initially higher than in 2021, there was a very heavy rain after planting, which had a negative impact on seed germination. Figure 3 summarizes weather conditions during the 2021 and 2022 growing seasons. Notably, mean temperatures in October 2022 were significantly lower than those during the same period in 2021. This temperature anomaly aligns with observed reductions in leaves dry biomass weights near mid-October in 2022.

Figure 3

Three-panel line graphs depict mean temperature, humidity, and radiation from June to November 2021 and 2022. Panel A shows mean temperatures with 2021 in solid red and 2022 in dashed red lines. Panel B shows humidity trends with 2021 in solid blue and 2022 in dashed blue lines. Panel C presents solar radiation with 2021 in solid orange and 2022 in dashed orange lines. Planting and harvest dates are marked with vertical lines.

Figure 3. Weather data for 2021 and 2022: The weather data are retrieved from the Iowa Environmental Mesonet Herzmann and Wolt (2020). The data was collected from the automated weather station nearby the trial field location at the Iowa State University Agricultural Engineering and Agronomy farm, in Boone, IA. The planting and harvest dates in 2021 (2022) are 5/27 (5/30) and 10/17 (10/26), respectively. Figure 3 (A) temperature, (B) humidity, (C) radiation.

3.3 Training and test results

After masking missing values, the refined dataset consisted of 265 genotypes, with each genotype having two replicates per year at varying stand counts (plants per square meter). This resulted in 530 series of leaves and stems dry biomass weight measurements annually. We conducted two train-test experiments: (1) training on 2021 data and testing on 2022 data, and (2) training on 2022 data and testing on 2021 data. Figures 4, 5 summarize two sample results for the same genotype after applying the data-driven crop model to the training and test datasets.

Figure 4

plot (A–D) evaluating wheat biomass predictions with 2021 training and 2022 testing for genotype ID 156510. X-axis: Days Since Planting. Y-axis: Leaf/Stem Dry Biomass (g). Legends show observed leaf and stem (scatter) and predicted lines: solid stem, dashed leaf. (A) Train 21 b1; Avg. RRMSE 17.43%; Stand Count 14.84. (B) Train 21 b2; Avg. RRMSE 32.08%; Stand Count 19.57. (C) Test 22 b1; Avg. RRMSE 19.17%; Stand Count 8.17. (D) Test 22 b2; Avg. RRMSE 22.62%; Stand Count 9.03. Predicted stem increases to ~125 g and leaf to ~35 g by ~140 days. “b1/b2” denote biological replicates.

Figure 4. Sample result 1 (Training with 2021 Data): This figure shows the model’s performance when trained on 2021 records for genotype ID 156510 and tested on the same genotype’s 2022 data. Scatter points represent observed leaf and stem dry biomass weights across the growing season, while solid and dashed lines indicate the predicted stem and leaf biomass weights, respectively. Labels “b1” and “b2” denote biological replicate numbers in the field trial. The stand counts reflected in the four observed data series are also included in each small title. The upper (A, B) subplots summarize training results for block 1 and 2 in 2021, and the lower (C, D) subplots illustrate test performance for block 1 and 2 in 2022.

Figure 5

plot (A–D) evaluating wheat biomass predictions with 2022 training and 2021 testing for genotype ID 156510. X-axis: Days Since Planting. Y-axis: Leaf/Stem Dry Biomass (g). Legends show observed leaf and stem (scatter) and predicted lines: solid stem, dashed leaf. (A) Train 22 b1; Avg. RRMSE 19.17%; Stand Count 8.17. (B) Train 22 b2; Avg. RRMSE 22.62%; Stand Count 9.03. (C) Test 21 b1; Avg. RRMSE 17.43%; Stand Count 14.84. (D) Test 21 b2; Avg. RRMSE 32.08%; Stand Count 19.57. Curves rise from zero to 120 g stem and 35 g leaf by approximately 140 days. “b1/b2” denote biological replicates.

Figure 5. Sample result 2 (Training with 2022 Data): This figure shows the model’s performance when trained on 2022 records for genotype ID 156510 and tested on the same genotype’s 2021 data. Scatter points represent observed leaf and stem dry biomass weights across the growing season, while solid and dashed lines indicate the predicted stem and leaf biomass weights, respectively. Labels “b1” and “b2” denote biological replicate numbers in the field trial. The stand counts reflected in the four observed data series are also included in each small title. The upper (A, B) subplots summarize training results for block 1 and 2 in 2022, and the lower (C, D) subplots illustrate test performance for block 1 and 2 in 2021.

In the Figures 4, 5, scatter points represent observed leaves and stems dry biomass weights across the growing season, while solid and dashed lines denote predicted stems and leaves dry biomass, respectively. Labels “b1” and “b2” indicate replication number in the randomized complete block design used in the field trial, following with varying stand counts across the four observed data series. Training results (upper subplots) generally exhibit lower Relative Root Mean Square Errors (RRMSEs) compared to test results (lower subplots), a common outcome as models are optimized for training data. We can also observe that the data-driven crop model can provide an accurate prediction of sorghum dry biomass production with unseen weather data.

Training RRMSEs were similar across experiments (approximately 20%), whereas test RRMSEs were significantly higher (Table 1), suggesting potential overfitting. We also conducted an additional experiment training the model on combined 2021 and 2022 data; results are presented in the final row of Table 1. To further analyze parameter behavior, Supplementary Figure S1 illustrates distinct probability density curves for 56 parameters under three training scenarios: (1) 2021 dataset, (2) 2022 dataset, and (3) combined dataset.

Table 1

Table 1. Training and test performance for different training sets.

3.4 Changing stand counts

In this subsection, we conducted a series of simulations to identify the optimal stand count for maximizing biomass production. The stand counts in the training data have a mean value of 14.56 pl/ $m^{2}$ with the standard deviation of 3.83. The simulations were conducted with genotypic parameters calibrated using data from both 2021 and 2022, and assumptions of same weather conditions in 2021 with same soil moisture levels. Figure 6 compares simulated biomass yields (red line) against observed 2021 and 2022 field data (blue dots). The highest biomass yield based on the simulation was observed at approximately 25 pl/ $m^{2}$ , with dry biomass production reaching 3.2 kg/ $m^{2}$ . The sudden drop in shoot biomass around 30 pl/ $m^{2}$ is likely due to environmental conditions not represented in the training data. While a comprehensive optimal density analysis will require further field validation environment to confirm these outputs, the present density tests still yield valuable insights and underscore the model’s potential for prescriptive analysis despite limited training data.

Figure 6

Scatter plot showing shoot dry biomass weight in kilograms per square meter versus stand count in plants per square meter. Blue dots represent 2021 and 2022 records, and a red line indicates predicted dry biomass production. Biomass weight increases with stand count initially, reaching a peak before declining after 25 plants per square meter.

Figure 6. Dry biomass under different stand counts: The blue dots represent the total shoot dry biomass (in kilograms per square meter), calculated from the observed final shoot dry biomass per plant and the stand counts. The red line indicates the simulated shoot dry biomass under varying densities but the same growing environment in 2021.

3.5 Changing planting and harvest dates

We conducted a series of tests to evaluate whether the original planting and harvest dates were optimal under 2022 weather conditions, using parameters calibrated with data from both years. The original planting and harvest dates for the 2022 trial were May 30th and October 26th, respectively. As shown in Figure 7, these dates were suboptimal. Shifting the planting date 1–2 days earlier and the harvest date 8–9 days earlier would maximize yield. The simulated peak shootdry biomass is about 9% higher than the original value, with most of the increase contributed by the leaves. This adjustment aligns with the weather patterns illustrated in Figure 3, where early harvesting helped avoid severe cold stress observed in late October. Cold stress during this period can accelerate leaf senescence, leading to significant dry biomass loss.

Figure 7

Heatmap showing predicted dry biomass weight per plant, varying by planting and harvest date deviations from May 30, 2022, and October 26, 2022, respectively. Green suggests higher biomass, red indicates lower. A black square highlights a specific box that shows the observed dry biomass weight without dates change. Biomass weight ranges from 155 to 190 grams per plant.

Figure 7. Yield under varying planting and harvesting dates. Values in horizontal and vertical axes indicate numbers of days deviation from actual planting date (May 30) and harvesting date (November 26) in 2022.

4 Discussion

Our data-driven crop model for biomass sorghum demonstrated robust predictive performance, achieving an average Relative Root Mean Square Error (RRMSE) of approximately 20% across experiments trained on the 2021 and 2022 datasets. The model’s training performance is comparable to contemporary crop biomass prediction frameworks (Roy Choudhury et al., 2021; Servia et al., 2022), but underperforms relative to yield prediction models in agricultural applications (Jégo et al., 2012; Xu et al., 2020; Roy Choudhury et al., 2021; Khaki et al., 2021; Dhillon et al., 2023; Chang et al., 2023). However, elevated RRMSE values in test results suggest potential overfitting, likely attributed to limited data availability for each genotype. Furthermore, the model’s accurate prediction of post-120-day leaves dry biomass trends in both years demonstrates its capacity to distinguish genotypic and environmental influences. By isolating the impacts of genotype, environment, and management, the model offers actionable insights for both descriptive analysis and prescriptive agricultural optimization.

Yields at various stand counts can provide critical insights for farmers seeking to maximize profits. Our results indicate that higher stand counts does not ensure increased biomass production, a finding consistent with prior studies (Turgut et al., 2005; Snider et al., 2012; Adams et al., 2015; May et al., 2015; Mahmood et al., 2015; Xuan et al., 2015; Tang et al., 2018). While the literature suggests that the optimal biomass production for sorghum typically occurs at 10–20 pl/m², our simulation results exceed this range (Snider et al., 2012; Adams et al., 2015; May et al., 2015; Xuan et al., 2015; Tang et al., 2018). This discrepancy may be attributed to idealized assumptions in our model, such as soil moisture and nutrient availability, which could elevate the optimal stand count. Due to current data limitations, our model does not incorporate seed or labor costs during the sorghum growing process. However, we emphasize that the model’s flexible framework allows for seamless integration of these variables once additional data become available, enabling future analyses with alternative objective functions (e.g., cost-benefit optimization).

The results from planting and harvest dates adjustments test suggests yield improvements harvesting 8 days earlier. The results from the different planting and harvest dates tested indicate that the potential value of data-driven crop models for prescriptive analysis would not have been possible without their ability to separate the genotypic and environmental effects of crop yield. Separating these influences is a crucial feature that enables the data-driven crop model to provide useful recommendations and insights for optimizing crop planting practices. Note that the current model has limited capability to capture certain real-world risks associated with earlier or later planting and harvesting. These include poor emergence due to cold soil temperatures, insect damage linked to delayed planting, or frost risks resulting from late harvesting.

By parameterizing the genotypic properties, our model circumvents calibration challenges inherent to conventional process-based approaches. The proposed data-driven crop model has the ability to fundamentally distinguish genotypic and environmental effects on crop yield, which can unlock valuable prescriptive potential. After obtaining a set of explainable and insightful results, the parameters from our model are transferable to other environments, whereas the genotype parameters for other process-based crop models may need to be recalibrated when the same varieties are grown in different environments (Adnan et al., 2019; Chang et al., 2023; Shawon et al., 2024; Wallach et al., 2025). Such advantages could empower farmers to optimize planting schedules using weather forecasts, reducing reliance on costly field trials for parameter recalibration. Simulations that combine weather forecasts with our model could help farmers choosing sowing dates that favor seed germination, promoting even crop emergence and biomass accumulation and allowing the crop to take advantage of a favorable growing season. If conditions appear unfavorable, the model can recommend delaying planting or scheduling a second sowing to maximize yield. Likewise, toward the end of the growing period, integrating our model with real-time forecasts could alert farmers to the risk of a killing frost, enabling timely harvest and preventing the potential biomass and sugar losses that can follow a sudden cold snap. Our data-driven model has modular flexibility, allowing adaptation to data availability without requiring imputation or assumptions for missing inputs. This adaptability streamlines model development for diverse datasets.

The proposed data-driven model has limitations. The modular structure for one crop species is not easily transferable to another, as each crop has unique physiological properties that need a carefully re-designed framework suitable to that specific species’ biology and growth processes. In addition, the performance is heavily dependent on the quality and quantity of input data. Furthermore, some key practices of management like irrigation, fertilization, and tilling methods are absent in the current version of data-driven crop model.

The results of applying this data-driven model in biomass sorghum could lead to additional data-integration strategies. First, results from our model may provide insightful information that can be readily adapted to other sorghum types as well. Second, UAV and remote sensing data could be incorporated into the model to provide a more comprehensive framework for crop growth. Third, other phenotypic data such as leaf temperature and root depth can be integrated within the data-driven crop model to achieve more reliable simulation and yield prediction results. Furthermore, the data-driven modeling framework could be applied to more crop species and even more complex systems.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

YC: Writing – review & editing, Writing – original draft. ZN: Writing – review & editing. JP: Writing – review & editing. JK: Writing – review & editing. MS-F: Writing – review & editing. LW: Writing – review & editing.

Funding

The author(s) declare financial support was received for the research and/or publication of this article. This work was partially supported by NSF and USDA (#1830478 and #2021−67021−35329) and the Plant Sciences Institute at Iowa State University. MS-F. was supported by the United States Department of Agriculture, National Institute of Food and Agriculture (grant number IOW05768).

Acknowledgments

The authors are grateful to the Editor and Reviewers for their feedback that helped improve the quality of the manuscript. We also thank Dr. Phil Alderman for helpful discussions about data and computational experiments.

Conflict of interest

LW is a co-founder of Crop Convergence LLC.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2025.1617775/full#supplementary-material

References

Adams, C. B., Erickson, J. E., Campbell, D. N., Singh, M. P., and Rebolledo, J. P. (2015). Effects of row spacing and population density on yield of sweet sorghum: Applications for harvesting as billets. Agron. J. 107, 1831–1836. doi: 10.2134/agronj14.0295