Inside out: transforming images of lab-grown plants for machine learning applications in agriculture

Introduction Machine learning tasks often require a significant amount of training data for the resultant network to perform suitably for a given problem in any domain. In agriculture, dataset sizes are further limited by phenotypical differences between two plants of the same genotype, often as a result of different growing conditions. Synthetically-augmented datasets have shown promise in improving existing models when real data is not available. Methods In this paper, we employ a contrastive unpaired translation (CUT) generative adversarial network (GAN) and simple image processing techniques to translate indoor plant images to appear as field images. While we train our network to translate an image containing only a single plant, we show that our method is easily extendable to produce multiple-plant field images. Results Furthermore, we use our synthetic multi-plant images to train several YoloV5 nano object detection models to perform the task of plant detection and measure the accuracy of the model on real field data images. Discussion The inclusion of training data generated by the CUT-GAN leads to better plant detection performance compared to a network trained solely on real data.


. Introduction
Machine learning (ML) tasks are often limited by the availability and quality of training data for a given model (LeCun et al., 2015;Goodfellow et al., 2016). To enable ML-based applications in agriculture-such as the automatic detection and classification of plants or crop health monitoring, say-ultimately requires large quantities of labeled image data with which to train deep neural networks (DNNs) (Lobet, 2017;Liakos et al., 2018;Wäldchen et al., 2018;Lu et al., 2022). It is the present lack of such data, and the challenge of generating it, that may ultimately limit the broad application of such techniques across the immense variety of crop plants. Lobet (2017), for example, emphasize that this process "is hampered by the difficulty of finding good-quality ground-truth datasets." Similar sentiments are echoed in general reviews (Liakos et al., 2018;Wäldchen et al., 2018;Lu et al., 2022), as well as in publications on specific applications such as weed detection (Binch and Fox, 2017;Bah et al., 2018;Bosilj et al., 2018) and high-throughput phenotyping (Fahlgren et al., 2015;Singh et al., 2016;Gehan and Kellogg, 2017;Shakoor et al., 2017;Tardieu et al., 2017;Giuffrida et al., 2018). The difficulty of generating or collecting plant-based image data for agricultural applications is further exacerbated by the many differences in growing .
conditions and physical dissimilarities between any two plants, even for those belonging to the same genotype. Covering a wide variety of phenotypes with a sufficient volume of labeled training data is a task of massive scope. This challenge is further impeded by the requirement of expert knowledge that is often necessary to accurately label plant data (for example, when it comes to the distinction between oats and wild oats) (Beck et al., 2021). Image transformation and synthesis through the use of generative adversarial networks (GANs) is gaining interest in agriculture as a means to expedite the development of large-scale, balanced and ground-truthed datasets (Lu et al., 2022). GANs were originally used for creating synthetic MNIST digits, human faces, and other image types (Goodfellow et al., 2014) and have proven successful for a wide variety of quite remarkable image translation problems, including transforming a horse to/from a zebra, a dog to/from a cat, and a summer scene to/from a winter scene (Isola et al., 2016;Zhu et al., 2017;Park et al., 2020). In the agricultural domain, GANs have been applied to areas such as plant localization, health, weed control, and phenotyping (Lu et al., 2022). Some specific examples that demonstrate the promise of GANs in agriculture include the improvement of plant image segmentation masks (Barth et al., 2018), disease detection in leaves (Zeng et al., 2020;Cap et al., 2022), leaf counting (Giuffrida et al., 2017;Zhu et al., 2018;Kuznichov et al., 2019), modeling of seedlings (Madsen et al., 2019), and rating plant vigor .
The contribution of this work is not an automated task or application to solve a particular agricultural problem, but rather the development of a method to translate real indoor images of plants to appear in field settings. As such, this work serves as an important precursor to the development of such applications. This approach can in principle be used to create bespoke, labeled data sets to support the training needs of a wide variety of ML tasks in agriculture. For example, appropriately constructed collages of two or more species on a soil background could ultimately be used to synthesize very large numbers of images of crop plants interspersed with weeds, resulting in labeled, ground-truthed datasets suitable for developing ML models for automated weed detection. For this initial study, however, testing is limited to simpler demonstrations of object detection.
The impetus for this work comes from our previous development of an embedded system for the automated generation of labeled plant images taken indoors (Beck et al., 2020). Here, a camera mounted a computer controlled gantry system is used to take photographs of plants against blue keying fabric from multiple positions and angles. Because the camera and plant positions are always known, single-plant images can be automatically cropped and labeled. In addition to this, we have collected outdoor images of plants and soil (Beck et al., 2021), which at present must be cropped and annotated by hand. Belonging to both datasets are four crop species: canola, oat, soybean, and wheat. The presence of these plants in both datasets provides the opportunity for outdoor image synthesis through image-to-image translation via GANs (Isola et al., 2016). Our goal, then, is to create fully labeled training datasets that are visually consistent with real field data. This procedure eliminates the need for manual labeling of outdoor grown plants (which is time-consuming and prone to error) while being scalable and adaptable to new environments (e.g., different soil backgrounds, plant varieties, or weather conditions). The creation of one's own datasets could improve the accuracy of plant detection and other models in a real field setting. This paper is structured as follows. Section 2 provides an overview of the GAN architecture used in our image translation experiments and describes the construction process for the GAN training datasets. Section 3 presents visual results for several single-plant translation experiments and discusses the benefits and limitations of each training dataset. Section 4 describes our method for producing augmented outdoor multi-plant images with automated labeling. Section 5 presents plant detection results achieved using a YoloV5 nano model trained on our augmented datasets. Section 6 concludes the paper and discusses potential extensions to the image synthesis methods.

. Materials and methods
Many GAN architectures require the availability of paired data for training. In our case, an image pair would consist of a plant placed in front of a blue screen and an identical plant, in the same location of the image, placed in soil. Such image pairs are difficult to obtain in large volumes and instead we focus on GAN architectures that can train on unpaired data (Zhu et al., 2017), such as the examples shown in Figure 1, which are taken from our indoor and outdoor plant datasets.
Some GANs, such as CycleGAN, consist of two-sided networks that not only translate an image from one domain to another but perform the reverse translation as well (Zhu et al., 2017). For our specific problem, we are interested in single-directional translation and only consider one-sided networks to reduce training duration and model sizes.
As a result, the approach taken here follows that of contrastive unpaired translation (CUT) as presented in Park et al. (2020). For this, one considers two image domains X and Y (with samples ⊆ Y) and seeks to find a function that takes a sample from the domain X and outputs an image that can plausibly come from the distribution Y.
Generative adversarial networks typically consist of two separate networks that are trained simultaneously. The generator G learns the mapping G : X → Y and the discriminator D is trained to differentiate between the real images y of domain Y and the fake images G(x) =ŷ produced by the generator. Note that we refer to the real and fake images of domain Y as y andŷ, respectively. The discriminator returns a probability in [0.0, 1.0] that the input image came from the distribution Y. Effectively, the generator is trained to produce images that fool the discriminator by minimizing the adversarial loss (Goodfellow et al., 2014;Isola et al., 2016;Giuffrida et al., 2017;Zhu et al., 2017Zhu et al., , 2018Park et al., 2020) (1) A visual overview of a GAN structure is given in Figure 2. The generator is composed of two networks that are applied sequentially to an image. The first half, the encoder G enc , receives the input and constructs a feature stack, primarily through downsampling operations. The second half, the decoder G dec , takes a feature stack and constructs a new image, through up-sampling operations. Here, we have that G(x) = G dec (G enc (x)) =ŷ (Park et al., 2020).
Contrastive unpaired translation is a GAN architecture that enables one-sided image-to-image translation (Park et al., 2020). In addition to the adversarial loss, which is dependent on the networks G and D, CUT provides the PatchNCE loss and feature network H. The PatchNCE loss is used both to retain mutual information between the input image x and outputŷ as well as to enforce the identity translation G(y) = y (Park et al., 2020). The feature network is defined as the first half of the generator, the encoder, plus a multi-layer patchwise (MLP) network with two layers and is used to encode the input and output images into feature tensors. Patches from the output imageŷ are sampled, passed to the feature network, and compared to the corresponding (positive) patch from the input as well as N other (negative) patches from the input image. The process is shown in Figure 3 (Park et al., 2020).
Since G enc is used to translate a given image, its feature stack is readily available, with each layer and spatial position corresponding to a patch in the input image. We select L layers from the feature map and pass each layer through the patchwise network to produce features {z l } = {H l (G l enc (x))} L where G l enc is the output of the l-th layer. The feature z s l then represents the s-th spatial location of the l-th layer and z S\s l represents all other locations. The PatchNCE loss is given by Park et al. (2020) as where the sums are taken over all desired layers l and spatial locations s within each layer, where N is the number of negative samples and the temperature τ = 0.07 scales the magnitude of penalties on the negative samples (Park et al., 2020). The overall loss used for network training is

FIGURE
Visual representation of an image translating GAN. The discriminator learns to di erentiate between the real images from domain Y and the translated images from the generator output. The discriminator and generator weights are updated through backpropagation of the gradients from the discriminator output according to the first and second terms in the adversarial loss, respectively. The image is adapted from the Google Developers Website ( ) (https://developers.google.com/machine-learning/gan/gan_structure).

FIGURE
A visual demonstration of the features extracted from an input image x and output imageŷ (Park et al., ). Corresponding patches are sampled from both images as well as N other patches from the input image. These patches are used to calculate the PatchNCE loss in Equation ( ) (Park et al., ). In the example presented here, high similarity between positive patches is desired to retain the shape of the animal head, while allowing re-coloring of the fur. Conversely, one should not expect to retain similarities between the head and other parts of the body. The image is taken from Park et al. ( ).
where the weighing factors λ X = λ Y = 1 when using the default CUT options (Park et al., 2020). The first PatchNCE loss term is used to retain mutual information between the input image x and outputŷ. The second is used as an identity loss term, where translating the input image y should produce y as the output. The CUT training scripts provide several network definitions for the generator, discriminator, and feature network. For the work in this paper, we use the default options where the generator is a ResNet-based architecture that consists of 9 ResNet blocks between up and down-sampling layers and the discriminator is a 70 × 70 PatchGan network that originates from the work of Isola et al. (2016).
The remainder of this section provides an overview of the datasets used to train several CUT generators.

. . Target domain-Outdoor image dataset
Generator training requires a dataset of outdoor images to represent the target domain Y for image translation. Our outdoor dataset was constructed by sampling from the 540,000 available images in the outdoor image database (Beck et al., 2021). The images are individual frames from videos taken with a camera mounted to a tractor while traveling through a field. The fullresolution images (2208x1242 px) often contain several plants with unknown locations and must be cropped by hand to obtain singleplant photos suitable for image translation. Manual cropping is a time-consuming process and, in general, limits our overall training dataset size. For initial experiments, 64 hand-cropped single-plant field images were used to construct the target dataset. Largerscale experiments that include image translation of several species were also performed with 512 single-plant field images. Table 1 summarizes the relevant parameters for all training datasets. Example multi-and cropped single-plant outdoor images are shown in Figure 4.
The outdoor plant datasets used for model training contain several single-plant images of canola, oat, soybean, and wheat. The cropping bounds are varied to provide data with plants of several sizes relative to the image, this helps to prevent the generator from increasing or decreasing the size of the plant in the image during translation. Similarly, plant locations (e.g., center, top-left, bottomright, etc.) are varied across the single-plant images to minimize positional drift. Additional images sampled from this dataset can be seen in Figures

. . Input domain-Cropped lab image dataset
The first dataset to be used as the input domain X to the generator consists of several indoor single-plant images with a blue screen background. These images are provided by the EAGL-I system, as described in Beck et al. (2020). EAGL-I employs a GoPro Hero 7 camera mounted on a movable gantry capable of image capture from positions that vary in all three spatial dimensions. Additionally, the gantry includes a pan-tilt system to provide different imaging angles. For the purpose of this work, we attempt only to translate top-down images, thus we include images only where the camera is perpendicular to the floor, within a range of ±10 • .
Similar to the previous section, the full-resolution images (4000 × 3000 px) contain several plants and must be cropped to obtain single-plant photos for training. However, in this case, both the . /frai. . Color-corrected composites 1 64 64 Soybean 10-40 32 0.50 0.85 Figure 10 Color-corrected composites 2 512 512 All Varies 128 0.50 0.85 Figure 11 For the Cropped Lab dataset, background replacement is not applicable since composite images are not created. Furthermore, the plant always makes up the entire image, so the scale is always 1.00. The Color-Corrected Composites 2 dataset contains each of the four available species; canola, oat, soybean, and wheat. These plants have minimum-maximum age ranges of 10-40, 0-365, 10-40, and 0-365 days, respectively. plant and camera positions within the setting are known, and loose bounding boxes are found through geometric calculation. Tighter bounding boxes are obtained through an algorithm given in Appendix 1 (Supplementary material), avoiding the need for hand-cropping. Figure 5 shows an example of both a multi-and single-plant lab image. To construct our dataset for generator input, we use 64 singleplant lab images as shown in Figure 5B. It is expected that a translated image would not only contain a plant with characteristics that reflect the outdoor plant domain, but also contain soil that has been generated in place of the blue screen background.

. . Input domain-Composite image dataset
As an alternative to the single-plant indoor photos of Section 2.2, image processing techniques can be used to remove the blue screen in a single-plant image and replace it with a real image of soil. These images, henceforth known as composite images, ideally require minimal translation to the image background and instead allow the generator to primarily translate plant characteristics. Such a network would be beneficial as it provides the user with the ability to influence the appearance of the background, even after passing through the generator.
Soil backgrounds are randomly sampled from a set of images to provide sufficient variation in the training data. These backgrounds are hand-cropped from real outdoor data, so the soil is visually consistent with our target domain. Typically, 32 soil backgrounds were used for datasets composed of 64 composites. See Table 1 for the exact number of backgrounds used to construct each dataset.
Given an indoor multi-plant image and soil background, the steps for composite formation are listed below. This process is fully-automated through scripts written in Python that utilize the OpenCV and NumPy libraries for image processing. Example images that depict each step are shown in Figure 6.  px. Random padding gives the effect of random placement of the plant within the composite. 5. Combine the resized plant image with a randomly selected soil background.
In many cases, differences in lighting between indoor and outdoor images leave little contrast between the plant and soil background in a composite image. In preliminary generator training experiments, insufficient contrast could lead to the generator translating plant leaves to appear as soil and constructing a fully-synthetic plant in a different region of the image. Such an occurrence is detrimental for creating synthetic object detection data as one loses the ability to accurately provide a bounding box for the plant. In the datasets described below, color-correction is used during composite formation to match the plant color with real field plants and increase the contrast between the plant and soil to allow the generator to maintain semantic information after domain translation. Color-correction techniques such as histogram matching were explored but were not used in favor of simple mean matching since one would still expect the generator to adjust plant color. The chosen color-correction process is described as follows.
Given an (M × N × 3) CIELAB single-plant image x and the associated (M × N) binary mask m, we want to correct the pixels belonging to the plant to have color that is consistent with the real outdoor plants in the domain Y and leave all other pixels unchanged. To do this, we first determine the average value for the k-th channel of x, only where the mask is true, given bȳ Note that Equation (5) is simply the weighted average for all pixels of a given channel. The weight of each pixel is given by the value of the corresponding pixel in the binary mask, which have values 0 and 1 for false and true, respectively. We now construct our color-corrected image x ′ , where the individual pixels have values andx ′ k is the desired average value for the k-th channel of x ′ . We found that choosing the CIELAB channels to have average valuesx ′ L = 170,x ′ A = 100, andx ′ B = 160 provides a suitable plant color. Note that in the definition of x ′ ijk , one can assign pixel values that exceed the valid range [0,255]. In this case, one must also constrain x ′ ijk to the same range. In general, color-correction leads to a composite image where the plant has enhanced lightness, greenness, and yellowness. An example of a composite image with and without color-correction is shown in Figure 7.
For small-scale translation experiments, our input domain consists of 64 (optionally color-corrected) composite images. Larger-scale experiments use datasets composed of 512 composites. The method and parameters used to construct each dataset are summarized in Table 1.

. Single-plant image translation results
In this section, we provide single-plant translation results for several GANs trained on the datasets described in Section 2. Training hyperparameters are consistent for all models, where we train for a total of 400 epochs and use a dynamic learning rate that begins to decay linearly to zero after the 200th epoch.

. . Cropped lab dataset
Training a generator on the Cropped Lab dataset allows us to directly translate indoor plants to outdoor, without the intermediate step of creating composite images. This dataset contains 64 indoor and outdoor images of soybean (128 total), aged 10-40 days. Twenty additional indoor photos of soybean from the same age range unseen during the training process compose   the distribution V for qualitative evaluation of the model. Figure 8 shows translation results for this generator. Referring to both the training and evaluation images, the generator appears to be capable of image translation between these two domains with little error. Details such as leaf color, leaf destruction, and soil are suitably added in each translated image. Leaf veins, which are greatly pronounced in some indoor images, are removed in the translated images, reflective of the images from the outdoor domain. Despite the ability to translate indoor plants to appear as true outdoor plants, the greatest drawback to a generator trained on these images is the inability to control the image background in the translated images. As discussed in Section 4, this is problematic for the construction of images with multiple plants, where one wants consistency in the background throughout the image.

. . Composites dataset
Training a generator on the Composites dataset allows us to translate uncorrected composite images to outdoor-appearing plants. Here, we expect that the background in the output image remains consistent with the input. This dataset contains 64 composite and outdoor images of soybean (128 total), aged 10-40 days. Twenty additional composite photos of soybean from the same age range unseen during the training process compose the distribution V for qualitative evaluation of the model. Figure 9 shows translation results for this generator.
Referring to both the training and evaluation images, the generator is capable of image translation between these two domains. Similar to the previous section, plant details such as leaf color, destruction, and veins are suitably added/removed in Frontiers in Artificial Intelligence frontiersin.org . /frai. . the translated images. In general, the image backgrounds are consistent before and after translation. Small changes, such as the addition/removal of straw or pebbles in the background, are seen in the translated images. However, these details are sufficiently small so as not to have any negative effect during the construction of a synthetic multi-plant image.

. . Color-corrected composite datasets
Training a generator on the Color-Corrected Composite datasets allows us to translate color-corrected composite images to outdoorappearing plants. Similar to the previous section, we expect that the background in the output image remains consistent with the input. The first generator in this section is trained on a dataset that contains 64 color-corrected composite and outdoor images of soybean (128 total), aged 10-40 days. Twenty additional color-corrected composite photos of soybean from the same age range unseen during the training process compose the distribution V for qualitative evaluation of the model. Figure 10 shows translation results for this generator. Additional translation results for generators trained on similar datasets of different species are shown in Appendix 2 (Supplementary material).
Referring to both the training and evaluation images, the generator is capable of image translation between these two domains. Similar to the previous section, plant details are suitably added or removed in the translated images, and the image background is consistent before and after translation. Additionally, little-to-no positional drift is seen between plants in the input and output images, indicating a model such as this could be used for multiple-plant image synthesis.
We now turn our focus to translating images of additional plant species. The Color-Corrected Composites 2 training dataset contains 128 color-corrected composite and 128 field images of each of the four available species: canola, oat, soybean, and wheat (1,024 images total). Twenty additional color-corrected composites for each species compose the distribution V for qualitative evaluation of the model. Image translation results are given in Figure 11.
From the above translation results, it can be seen that our method is easily extendable to additional species. Translated plants have no noticeable positional drift and background appearance is consistent between the input and output. A model such as the one providing image translations in Figure 11 could be used to construct synthetic multiple-plant images in a field setting with multiple plant species while retaining the original bounding boxes to enable the production of object detection data.

. Multiple-plant image translation results
The results in the previous section suggest that a generator trained to translate composite images would be suitable for multiplant image construction. Here, one expects to be able to place several synthetic plants within a real soil background to produce plausible outdoor images of multiple plants. Domain translation by the GAN should produce outdoor plant images with lightness similar to that seen typically in our outdoor data, as well as sufficient blurring of the plant relative to the real background. Using a singleplant translation generator, multi-plant images are constructed via the following algorithm. The algorithm makes use of existing full-scale soil images (see Figure 12A for an example) and indoor .
/frai. .  single-plant images with their binary masks (see Figures 6A2, B2 for examples). Algorithm 1 effectively creates several composites, translates, and places them back into the full-scale image. Instead of selecting composite backgrounds from a pre-defined set, small sections are randomly cropped from 20 larger images. Figure 12 provides an example of composite background selection from a large-scale soil image.
. /frai. . To improve generator performance when translating images with these new backgrounds, we contruct an additional training dataset with 512 new color-corrected composite images and train a new CUT-GAN. Here, the composite backgrounds are randomly selected sub-sections of the large-scale backgrounds, as opposed to randomly selecting from the set of 128 small backgrounds used in the dataset Color-Corrected Composites 2. This training dataset uses the same 512 images as Color-Corrected Composites 2 for the target domain Y. Table 2 summarizes the parameters for constructing our new dataset.
In line 11 of Algorithm 1, composite images are color-corrected following translation. This is a necessary step to help ensure a consistent background of our multi-plant images after translation. Background correction becomes especially useful if one chooses an image background that differs from our training data. If one were to train a model to detect plants in our synthetic images without correction, it is possible that the model learns to locate plants through inconsistencies in the background where the translated plant is placed. Background color-correction is used as an attempt to mitigate this risk. Additional processing can be done to improve the joining of the translated composite with the background, but is not shown in this article.
The procedure for background correction follows similarly to that of color-correction for composite images. However, rather than only correcting the pixels belonging to the plant, we correct all pixels in the translated image by adding a uniform offset to all pixels of a given channel. This offset is found by calculating the difference in average values for each RGB channel in the composite and translated images, only for pixels in a small region outside the plant bounding box. In general, for background correction, we assume that the translated plant does not exceed the bounding box. However, small patches of green outside the bounding box (see Figure 13B) lead to little difference in the result.
To achieve background correction, three inputs are required: the (M × N × 3) composite single-plant image x, (M × N × 3) translated single-plant imageŷ, and the (M × N) binary mask m. Example images are given in Figure 13.
The offset d k for the k-th channel of the corrected imageŷ ′ is calculated as Equation (7) can equivalently be considered as the difference in weighted averages for the input and output images where the weight of all pixels within the bounding boxes is zero and one elsewhere. The pixels ofŷ ′ are then given bŷ constrained to the range [0, 255]. Figure 14 shows the effect of translating a single-plant image with and without applying background correction afterwards. Figure 15 demonstrates the difference between multi-plant images with and without background correction. In general, the translation capabilities of the generator are invariant to the position of plants in a given image. As a result, there is no restriction on placement, aside from avoiding overlapping plants. Our approach provides the option to place plants randomly, or by alignment into rows, as they would be in a production field.
. /frai. . The dataset contains each of the four available species; canola, oat, soybean, and wheat. These plants are aged 10-40, 0-365, 10-40, and 0-365 days, respectively. Each composite image possesses a unique background that comes from randomly cropping small sections from 20 large soil images.  Additionally, the minimum and maximum scales of the plants relative to the background can be chosen by the user. Datasets can be made to contain plants of all species or individual plants can be selected. Examples of labeled synthetic multi-plant images are given in Figure 16. From the multi-plant images presented here, one can see that the position of each plant is maintained after translation, suggesting that such data could be useful for training a plant detection network.
. Results of plant detection using augmented datasets In general, object detection is a machine learning task that refers to the process of locating objects of interest within a particular image (Zhao et al., 2019). Models capable of object detection typically require training data consisting of a large number of images as well as the location and label of all objects within said images, usually given by bounding boxes. Such data is often produced by manual annotation, a procedure that is both time-and cost-intensive, especially as the number of objects within an image increases (Guillaumin and Ferrari, 2012;Ayalew, 2020). As a result, object detection is an excellent test case of our image transformation methods.
As proof of concept that our synthetic multi-plant images can be beneficial for detection of real plants, we trained several YoloV5 (Redmon et al., 2016;Bochkovskiy et al., 2020) nano object detection models on various augmented datasets (Jocher et al., 2022). The nano is chosen in particular as it has the fewest number of trainable parameters in comparison to all other YoloV5 networks. This is most desirable for our training datasets which contain few classes and little variability in the data itself. In all cases the network is trained to locate canola, oat, soybean, and wheat. However, we are not attempting to find a solution to a classification problem, so these four species are grouped into a single class named plant. Three primary training dataset types are described below, and each includes the random placement of non-overlapping plants onto an image background. The size and location of the single plant .
/frai. .  image embedded within the background are used as the ground truth bounding boxes for our training data. The first training dataset contains 80,000 blue screen images with color-corrected plants randomly placed throughout Figure 17A). The dataset is split into training, validation, and testing sets with proportions 80% (64,000 images), 10% (8,000 images), and 10% (8,000 images), respectively. Note that no background replacement or GAN is used, so the differences between the training data and real field data are significant. Hence, we expect the performance of this network to be poor in general. For future references this dataset is referred to as the Baseline dataset.
The Composite dataset contains 80,000 multi-plant colorcorrected composite images (see Figure 17B) with the same 80% (64,000 images), 10% (8,000 images), and 10% (8,000 images) split as above. Here we include a real soil background, but no GAN is used to individually translate each plant. Since this dataset is more similar to real data than the baseline we expect to see improved performance on an evaluation set composed of real images.
The third dataset used to train the network consists of 80,000 multi-plant GAN images (see Figure 17C) split identically as above. This training data is created through the multi-plant GAN procedure previously described in this section. Here we expect network performance to be the greatest, since the training and target datasets are most similar. For future references this dataset is named GAN.
An additional dataset, known as the Merged dataset, is composed as the union of both the Composite and GAN datasets. As such, this training dataset consists of 160,000 total images for the network with half of the images including the usage of a GAN for plant translation.
With several plant detection models trained on various datasets, our models are evaluated on 253 additional real multi-plant images of canola and soy. These images were excluded during the training process and ground truth bounding boxes are determined by hand. The evaluation metrics used are precision, recall, and mean average precision (mAP). Before defining the metrics, one must first consider the intersection over union (IoU), given by Everingham et al. (2010): where Area of Overlap refers to the area of overlap between the ground truth and predicted bounding boxes and Area of Union is the total area from joining the bounding boxes. Whether a bounding box prediction is considered to be successful is dependent on the IoU threshold, where we consider any prediction leading to an IoU greater than the threshold to be correct. Now, with IoU being used to determine if a model prediction is correct, the metrics precision and recall are defined by Everingham et al. (2010) as where n denotes the number of classes and the outer sum leads to a mean precision for all classes. Note that since we group our species into a single class we have n = 1 and the outer sum is presented here only for verbosity. The . /frai. .  notation mAP@0.5 denotes the evaluation of mAP using an IoU threshold of 0.5, consistent with the Pascal VOC evaluation metric (Everingham et al., 2010). Alternatively, mAP@0.5:0.95 denotes the average mAP value for all IoU thresholds in the range [0.50, 0.95], in steps of 0.05, which is identical to the evaluation metric for the COCO dataset challenge (Lin et al., 2014). Sampled visual results for the performance of each network are given in Figure 18, numerical results are provided by Table 3, including both the mAP@0.5 and mAP@0.5:0.95 metrics.
From both Figure 18 and Table 3 it is clear that the baseline model shows no capacity to detect plants. This is expected as the training data is vastly different from the real images through which our models are evaluated. Both the Composite-and GAN-trained models show good ability to locate plants in the evaluation images, however the GANtrained model performs better according to both the mAP@0.5 and mAP@0.5:0.95 metrics. The combination of the Composite and GAN datasets led to the greatest performing network in terms of our metrics. However, this could come as a result .
/frai. . The training dataset denotes the images used to train the network for which we are evaluating, not the set from which the evaluation images are taken. of being exposed to twice the number of training images. In general, the Yolo models trained on the Composite, GAN, and Merged datasets all appear capable of plant detection on the provided images.

. Conclusion
The contribution of this work is an image translation process through which one can produce artificial images in field settings, using images of plants taken in an indoor lab. The method is easily extendable to new plant species and settings, provided there exists sufficient real data to train the underlaying GAN. The construction of one's own augmented datasets enables further training of neural networks, with applications to in-field plant detection and classification. More importantly, this work is a first step in demonstrating that plants grown in growth chambers under precise and fully-controlled conditions can be used to easily generate large amounts of labeled data for developing machine learning models that operate in outdoor environments. This work has the potential to significantly improve and accelerate the model development process for machine learning applications in agriculture.
Future work will consist of mimicking outdoor lighting conditions via controllable led-based growth-chamber lights, leading to more variety in the input data and hopefully more realistic synthetic data. Additional work will focus on using the approach presented here to develop infield plant-classification models, as well as developing outdoor datasets and machine learning models for other problem domains (such as disease detection). This will include an investigation into whether classification tasks require their own species-specific GAN for the image translation process.

Data availability statement
The datasets presented in this study can be found in online repositories. All the data for our article is available from the online portal we have created to share our data. This portal is hosted by the Digital Research Alliance of Canada, and access can be obtained by contacting the authors.

Author contributions
AK: conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing-original draft, review, and editing. PS: data curation, formal analysis, investigation, software, validation, visualization, and writing-original draft. CH: conceptualization, formal analysis, funding acquisition, methodology, project administration, resources, supervision, writing-original draft, review, and editing. MB: data curation, formal analysis, methodology, supervision, and writing-review and editing. CB: formal analysis, funding acquisition, project administration, resources, supervision, and writing-review and editing. All authors contributed to the article and approved the submitted version. . /frai. .