High Resolution, Annual Maps of Field Boundaries for Smallholder-Dominated Croplands at National Scales

Mapping the characteristics of Africa’s smallholder-dominated croplands, including the sizes and numbers of fields, can provide critical insights into food security and a range of other socioeconomic and environmental concerns. However, accurately mapping these systems is difficult because there is 1) a spatial and temporal mismatch between satellite sensors and smallholder fields, and 2) a lack of high-quality labels needed to train and assess machine learning classifiers. We developed an approach designed to address these two problems, and used it to map Ghana’s croplands. To overcome the spatio-temporal mismatch, we converted daily, high resolution imagery into two cloud-free composites (the primary growing season and subsequent dry season) covering the 2018 agricultural year, providing a seasonal contrast that helps to improve classification accuracy. To address the problem of label availability, we created a platform that rigorously assesses and minimizes label error, and used it to iteratively train a Random Forests classifier with active learning, which identifies the most informative training sample based on prediction uncertainty. Minimizing label errors improved model F1 scores by up to 25%. Active learning increased F1 scores by an average of 9.1% between first and last training iterations, and 2.3% more than models trained with randomly selected labels. We used the resulting 3.7 m map of cropland probabilities within a segmentation algorithm to delineate crop field boundaries. Using an independent map reference sample (n = 1,207), we found that the cropland probability and field boundary maps had respective overall accuracies of 88 and 86.7%, user’s accuracies for the cropland class of 61.2 and 78.9%, and producer’s accuracies of 67.3 and 58.2%. An unbiased area estimate calculated from the map reference sample indicates that cropland covers 17.1% (15.4–18.9%) of Ghana. Using the most accurate validation labels to correct for biases in the segmented field boundaries map, we estimated that the average size and total number of field in Ghana are 1.73 ha and 1,662,281, respectively. Our results demonstrate an adaptable and transferable approach for developing annual, country-scale maps of crop field boundaries, with several features that effectively mitigate the errors inherent in remote sensing of smallholder-dominated agriculture.


Code and Data Availability
The repositories containing the data, code, and manuscript source files used to produce this paper are listed below: • activemapper: Repository 1 containing code and derived data used to write this paper.
• imager: Repository 2 containing code for the image processing software. • labeller: Repository 3 containing code for the labelling platform.
• learner: Repository 4 containing code for the active learning component. • segmenter: Repository 5 containing code for the segmentation algorithm. • mappingafrica.io: Project website 6 linking to web map displaying field boundary maps.
The web map is currently undergoing redevelopment, and will be hosted on a new URL and links download maps. In the interim, map data are available on request.

Map reference system
The map reference system used in our mapping approach had three different levels. At the coarsest level, Ghana was divided into 16 different mapping zones, or areas of interest (AOIS; Figure S1A), that were used to create AOI-specific mapping models. Each AOIs was comprised of 400 to 777 adjacent tiles. These tiles were used to create seasonal image composites, and were defined within a 0.05 degree grid ( Figure S1B)), with each tile numbered to correspond to a larger 1 degree grid cell that it sits within (dotted lines in Figure S1B). Training and reference labels were created within a 0.005 degree grid that nested within each tile ( Figure S1C). Therefore, each tile has 100 grid cells, and there are 400 tiles per 1X1 degree. The smallest AOI consists of a single 1X1 degree, which fall in the center of the country (AOIs 5,8,11,and 14). AOIs falling along Ghana's boundaries were created by tiles from 1X1 degree cells that straddled Ghana's border with those from the closest degree cells that were fully contained within Ghana (e.g. AOI 1), with the exception of AOI 16, which was comprised of tiles in three partial 1X1 degree cells along Ghana's coast.

Variation in lengths of temporal window used
The typical window for compositing dry season imagery was December, 2018 to February, 2019, but in the cloudiest regions (AOIs 10,11,13,14,16) we started the dry season window in November, to ensure a sufficient density of images for compositing.

Assessing composite quality
The rubric presented in Table S1 was used to assess the quality of the seasonal image composites. The imagery was evaluated by examining their Raster Foundry (Azavea 2020) overlays within a labeller instance set up for the purpose.
Figure S1: The reference system used in this mapping approach, including A) numbered areas of interest (AOIs) that define the minimum mapping geography (solid black lines; dotted lines indicate boundaries of 1 degree grid), B) the 0.05 degree tile used for compositing imagery, and C) the 0.005 degree resolution reference grid used for collecting training data and distributed computing.

Digitizing tools
To minimize the risk of topological errors, mapper's polygon digitizing tools prevent drawing that results in self-intersections and overlaps between adjacent polygons. Upon submission, the PostGIS ST_MakeValid function is applied to each polygon's geometry to clean remaining topological errors upon insertion into the database.

Mapping rules
Labellers followed set of rules described in the next two sections when interpreting and digitizing field boundaries.

Interpreting and mapping imagery
• Map the fields that are visible in the PlanetScope images, not the fields that you see in the basemap.
• As the first choice, map the field outlines as they appear in the dry season PlanetScope images. If a field is not clearly visible in the dry season image, but is visible in the growing season image, then the next choice is to map it in the growing season image.
• Use all available images (PlanetScope scenes and base map) to help you interpret what is and is not a field. Toggling back and forth between the different image sources can help identify and classify the field boundaries.
• Map fields that intersect the mapping target, making sure to finish the entire boundary.

Choosing what to map
• Draw polygons around fields that look like they contain crops that are planted and harvested during a single year (or occasionally slightly longer, in the case of crops such as sugarcane). These fields may look recently ploughed or harvested, or contain actively growing crops.
• Do not draw polygons around fields that are under tree crops, such as orchards, woodlots, or other planted forest types.
• Do not draw polygons around fields that look like they are annual croplands, but are overgrown and haven't been planted for a few years. These are possibly fallows or abandoned fields.
• Do not draw polygons if you cannot tell whether a piece of land is an annual crop field or if it is perhaps another type of land cover (e.g. bare land, or a young orchard).
• Do not draw any polygons if there are no annual crop fields to map.

Assessing label accuracy
For each accuracy assessment assignment, the labeller's maps are scored against training reference polygons digitized by one of the map supervisors (e.g. Estes or Ye). A proportion of training reference sites were of the non-cropland class and thus had no corresponding polygons.
As described in the main text, the score for a particular assignment i is calculated as the weighted sum of five accuracy metrics: Where β 0−4 are the weights assigned to each accuracy metric that sum to 1. For the current production version, β 0−4 were assigned as 0.4, 0.2, 0.2, 0.1, and 0.1.
I is "inside the box" accuracy which is defined as a proportion of the area correctly mapped within a 0.005 degree resolution grid (I c ) over total "inside the box" area for this grid (I t ).
O is "outside the box" accuracy which refers to a proportion of the field area correctly mapped outside the grid over total "outside the box" region (O t , the region within the bounding box of the workers' polygons but not within 0.005 degree resolution grid).
F is the fragmentation accuracy which is defined as a proportion of matched polygon number (N m , the number of the workers' polygons that has at least 50% of its region overlapped with a reference polygon) over total workers' polygon number (N t ).
E is the average edge accuracy for all pairs of matched workers and reference polygons; the edge accuracy for a single pair is defined as the length of 'correctly mapped edges' (L c , the partial boundary of a workers' polygon that are within a three-pixel buffer region of the matched reference polygon boundary) over the total boundary length of its matched reference polygon (L t ).
C is the categorical accuracy, i.e., a proportion of the area that has been correctly labeled with field category within intersected regions between worker's and reference polygons (T c ) over the total intersecting area (T t ).

Consensus labelling
As described in the main text, the formula used for creating a consensus label is: Where θ represents the true cover type of a pixel (field or not field), D is the worker's label of that field, and W i is an individual worker. Looking in greater details at this equation, the first half of the righthand side of the equation, P(W i |D), is the "prior" for worker i for the current site based on their history of scores from accuracy assessment assignments. The second term, P(θ|D, W i ), is the probability that the actual class of the pixel in the current assignment is the class that worker i says that it is, which is either 0 or 1. There are four possible values for this second term: Where equations 2 and 4 represent true positives and negatives, respectively, and equation 3 is a false positive, and equation 5 is a false negative.
Coming back to the first term, the calculation of prior probability can be re-expressed as: Where: With BIC being the Bayesian information criterion: In which n is the sample size, k is the number of parameters to estimate, andL is the maximum likeihood function. In this case, we are only interested in one parameter (the label that maximizes the likelihood function), thus the BIC becomes: After rearranging, we have: Which is the worker maximum likelihood, which can be computed as: Equations 11 and 12 are Producer's accuracies, thus the maximum worker likelihood is equivalent to the worker's average Producer's accuracy.
The other component of equation 6, P(W i ), is the worker's average score over m accuracy assessment assignments: Thus equation 6 uses two measures of worker accuracy, 1) their overall average accuracy score multipled by 2) their average Producer's accuracy to create a weight for their individual maps for the given site. Equation 1 becomes: With P(θ|D, W i ) being either 0 or 1. In labelling, if the consensus result for a pixel is: P(θ = field|D) > 0.5, then we assign that pixel to the field category, otherwise to the no field category.
After creating the consensus label, the degree of confidence in the resulting label value is measured by Bayesian Risk: Where C is the consensus probability that a given pixel is a field (P(θ = field|D)), and L is the consensus label (i.e. non-field if C < 0.5, field if C > 0.5) for that pixel. The risk values across the entire sample site can be processed in two ways to provide useful information about the confidence in the consensus label for that site. The first is a simple average of all risk values in the site, where the slope of the risk varies depending on whether L is a field or not a field ( Figure S2). The closer to 0 the lower the risk that the L is mislabelled, while values approaching 1 indicate increasing risk of mislabelling. A second approach is to calculate the proportion of pixels having risk values that exceed a certain threshold. Figure S2: Bayesian risk values (Y-axis) for consensus values (X axis) ranging from 0 to 1 (0 indicates no consensus that a pixel falls into the field class, 1 means complete consensus) for field and non-field consensus labels.
Accuracy assessment and consensus label generation is performed using an R package rmapaccuracy that is internal to the labelling platform. The package relies on spatial data classes and operations provided by the sf (Pebesma 2018) and raster (Hijmans 2020) packages.

Example
To provide an example of this approach in practice, we'll imagine two workers A and B, each with the following histories: In this scenario, worker A thinks that the given pixel falls within a field, and worker B thinks it is not a field. First, we calculate the weights for each worker: And then we plug these weights into the full equation: Since 0.621 > 0.5, we label the particular pixel a field.

Segmentation
An overview of the two inputs to and outputs from the segmentation algorithm are shown in Figure S3. Figure S3: An overview of the inputs and key outputs from the five-step segmentation algorithm.

Accuracy assessment
We designed and implemented a map accuracy assessment protocol following procedures summarized by Stehman and Foody (2019). This entailed the creation of a map reference sample, which first entailed designing a sample, and then designing how the sample response would be collected.

Map reference sample design
We employed a stratified design for collecting the map reference sample, using the segmented field boundaries to define the strata for cropland/non-cropland. To create the sample, we first extracted the centroids for each of the sample grid cells in Ghana. We then intersected the centroid points with the field segments, assigned a class of 1 (cropland) where points intersected a field, and 0 where they didn't (non-cropland). We removed from this set of points all those that corresponded to model training, validation, or training reference sites, and extracted a random sample from both the cropland and non-cropland points. To determine the sample size of each, we specified a desired confidence interval using the following formula (Stehman and Foody 2019): Where p is the estimated probability (or mapped class accuracy), and d is the size of the margin of error (1/2 the confidence interval). We selected a d value of 0.03 and assumed that the user's accuracy of the field class would be 0.75 and that of the non-cropland class would be 0.8, returning sample sizes of 800 and 683, respectively. The distribution of the resulting map reference sample is shown in Figure S4. Figure S4: Distribution of the selected map reference sample for the cropland and non-cropland class. Classes represent the values extracted from the map strata, rather than those assigned during the classification of the sample.

Response design
We collected the map reference sample on a separate instance of the labelling platform set up for the purpose. For the sampling unit, we selected a rectangular polygon of~0.1 ha (0.0002866 • resolution). This polygon was centered on the centroid of each grid cell selected for map reference sample, and was presented as the target grid in the labelling platform. Four classes were established for the validation: cropland, non-cropland, uncertain but likely cropland, and uncertain but likely non-cropland. The latter two classes were designed to capture information related to swidden dynamics, following the rationale that uncertainty and time since last cropping are likely to be positively correlated. This uncertainty also captures information about the inherent difficulty of the mapping task. Samples were collected by visually interpreting the overlays of PlanetScope composites, following the same interpretation protocols used by the labelling team, with the exception that the polygons placed were square and of 0.0002866 • resolution.
Two supervisors (Ye and Estes), who were not involved in collecting training and validation labels, collected the map reference sample by evaluating the PlanetScope composites at each site to determine its class membership. The following set of rules were followed in collecting the sample: 1. When determining the class corresponding with the initial location of the target grid, if: • More than half of the target falls within what appears to be a clear arable crop field, then classify it as cropland; • More than half falls in what is clearly not a field, then classify it as non-cropland; • More than half falls in a location where it is harder to tell whether it is cropland or non-cropland, determine whether it is more likely a crop field or not a crop field, and then assign either uncertain but likely cropland, or uncertain but likely non-cropland.

After determining the class, if:
• The target polygon is contained entirely within a single clear class, then simply digitize a point within the center of the target box, assign the appropriate class label, and complete the assignment; • Digitize a square polygon exactly aligned with the initial target, choose the correct class label, and then move the new polygon to the nearest location where it can be contained entirely by the assigned class.
After collecting the sample, the geometries were further refined by converting the points (sites were the target didn't have to be shifted) to polygons with the same 0.0002866 • resolution, and then the full set of map reference polygons was used to extract the classified values from both the cropland probability and vectorized field boundary maps. Accuracies were assessed for the entire country, and with several zones consisting of different groupings of AOIs or agroecozones 7

Map reference label uncertainty
The size of the collected validation sample was 1207, with 1036 samples collected and interpreted by one observer (Su Ye) and 171 collected by a second observer (Estes). To evaluate the uncertainty inherent in defining the map reference labels, the pair mapped 23 common sites, showing an overall level of agreement of 87%, and a Spearman Rank Correlation of 0.76.
Although this overlap between observers was limited, the map reference classification scheme provided two additional measures of uncertainty, which were classes defined as "unsure but most likely a field" or "unsure but most likely not a field," which the reference labeller would choose when they could not with high confidence state whether the site was either cropland or not cropland. Of the 487 map reference samples that were identified as cropland, 22% fell into the lower confidence category, as did 11.5% of the 720 non-cropland samples. Across both classes, the reference labellers had lower confidence in 15.7 of sites labelled.

Accuracy assessment zones
The zones used to assess regional variation in map accuracy are illustrated in Figure S5.

Image catalog and quality
The total number of PlanetScope images available for download from the Planet API (PlanetTeam 2018) per tile for each of the two major seasons is shown in Figure S6.
The image quality assessment was conducted by two separate observers (L. Song and Q. Zhang). Each observer assessed the composites for each season at all 50 of the randomly selected tiles (see main text) following the defined criteria (Table S1). To calculate the score for a given tile for each season, we summed the ranking across the four categories for each observer, rescaled the values, and then calculated the mean tile score across both observers by season. The mean difference between the two observers was -0.057 (sd = 0.081) and the mean absolute difference was 0.063 (sd = 0.076), thus one observer scored tiles about 6 percent lower than the other observer, on average.

Mapping cropland probabilities with active learning
The number of active learning iterations per AOI was three, for all but four AOIs. AOIs 10 and 14 stopped after one and two iterations, respectively, as they started with high initial validation accuracies (>83%) and showed little subsequent improvement. The models for these two AOIs were thus trained with 600 -700 labels. AOI 15 was run for 4 iterations (900 samples), while AOI 3 underwent a second active learning cycle because the model produced during the first Figure S6: The density of images available through the Planet API for the five month growing season in 2018 and the subsequent three month dry season (excluding the month of November), shown in relation to the AOI boundaries.
cycle was inaccurate (see next section). In this second run, 300 initial training sites randomly selected within the AOI were used, followed by 2 subsequent active learning iterations, resulting in a training sample of 500.

Labelling
The locations of training, training reference, and validation points are shown in Figure S7. In AOI 3, the initial active learning cycle resulted in low accuracy because the northern part of the AOI shows low contrast between fields and the surrounding vegetation in the dry season. Training the model with the initial 500 samples resulted in large commission errors in this part of the AOI, thus we ran a second active learning cycle that began with an inital random draw of 300 training sites confined to this AOI (blue points in Figure S7A).
The distribution of training and validation sample collection effort was divided across 20 labellers, with a core group of 13 who completed more than 1,000 assignment each ( Figure S8). As each training/validation task was undertaken by 4 separate labellers, 34,014 sets of labels were made. Each labeller digitized an average of 2,001 training/validation assignments.
These results include those from re-labelling the initial training in Cluster 2 (AOIs 7-9, 12, 15). This was done because we discovered a small spatial offset in the original image composites, which we corrected by reprocessing the images. We replaced the image overlays in the instances for this Cluster (Figure S7A), and the labelling team mapped these sites a second time during the production run in late 2019. The reprocessed labels were used to initially train the model for AOIs 9, 12, and 15.

Model performance
The differences in accuracy, AUC, and F1 between the active learning process and the random retraining at each iteration for AOIs 1, 8, and 15 ( Figure S9). Small differences due to random variations in the RandomForest models are the reason for the non-zero differences at iteration 0, when both models were trained with the same level set.
The lower accuracy of actively versus randomly trained models in earlier iterations was caused by results at AOI 15, where active learning accuracy was 8.37 percent lower than random training after iteration 1 (see Figure S9). In comparison, iteration 1 active learning accuracies were 2.88 and 0.45 percent higher than random training for AOIs 1 and 8, respectively. Accuracy under active learning for AOI 15 exceeded randomized training after 4 iterations. Figure S8: The A) number of training/validation and accuracy assessment assignments completed by each labeller, and B) the distributions of quality scores at training reference sites for each labeller (means indicated by X in boxplots). Labellers' identities are anonymized.

The impact of training data error
Two measures of label quality, average quality score per labeller and the Bayesian Risk of labels, were calculated to provide proxy measures of label error. Labeller quality was scored 9,389 times against 98 unique training reference sites, with each labeller assessed an average of 552 times at a rate of 1 training reference site for every 3.62 training site mapped. The mean of each labeller's average accuracy score was 0.71 (range 0.6 to 0.85; Figure S10).
The average Bayesian Risk of each training and validation site is shown in Figure S11, and the distribution of risk values per AOI and the initial training clusters in Figure S12. The three initial clusters include the second mapping of Cluster 2 ( Figure S7A).
The average Bayesian risk was 0.122 for training labels and 0.127 for validation labels. Risk was highest in the northern AOIs (AOIs 1-6; Figures S11-S12), falling between 0.157 for training and 0.173 for validation labels, and lowest in the southwestern AOIs (AOIs 10, 11, 13, 14, 16; training risk = 0.079; validation risk = 0.065). Label risk in the central-southeastern AOIs (AOIs 7-9, 12, 15) was slightly lower (training = 0.127; validation = 0.136) than in the north. Labeller experience also appeared to reduce risk, which we observed during a relabelling of the 500 initial random site in this cluster; the mean risk of the updated labels was 0.055, compared to 0.172 for original labels.
Probability images resulting from Random Forests models trained with labels generated under three different labeling strategies are illustrated in Figure S13. These included consensus labels, and those individual labels that were likely to be the most and least accurate for each training site. Label accuracy was based on the mean score of each labeller against the training reference sites ( Figure S10), as assessed when labelling a given AOI. These images were created for a single tile in AOI 1.
The association between the average label risk per AOI and several model performance metrics, Figure S9: The percent difference in performance metrics per iteration for AOIs 1, 8, and 15 (grey lines with numbers indicating AOI; the black line indicates average difference across the three AOIs) when comparing models trained using active learning versus those trained using randomly selected sites. Postive percentages indicate superior performance by active learning, negative percentages the inverse. assessed using Spearman Rank Correlation, is shown in Table S3.

Categorical accuracy
In addition to the map accuracies and area estimates calculated per AOI zone (reported in main text; see Figure S5A), the accuracies were also assessed within several different groupings of agroecozones (Figures S5B and Table S4).

Field area and number
To mean, median, and distributions of the average area of segmented field boundaries over the 100 validation sites in each AOI are compared to the areas of the polygons digitized by the most accurate labeller over the same sites in Figure S14. The same statistics for average number of segments versus average number of labelled polygons across validation sites in each AOI are shown in Figure S15.

Field area and number
To examine average field sizes, and the total number, the mean segment size per 0.05 degree tiles was calculated and mapped, as well as the total number of fields per tile ( Figure S16). Figure S13: Cropland probability images produced by Random Forests models trained with A) consensus labels, B) the most accurate individual labels, and C) the least accurate individual labels. Figure S14: The distributions of the average areas (in hectares) of segmented field boundaries (shown in blue) at the 100 validation sites per AOI, compared with the average areas of field boundaries digitized (shown in red) by the most accurate worker to label each site. Vertical lines indicate the mean and median of each distribution. Figure S15: The distributions of the average number of segmented field boundaries (shown in blue) at the 100 validation sites per AOI, compared with the average number of digitized polygons (shown in red) by the most accurate worker to label each site. The mean and median of each distribution is shown. Vertical lines indicate the mean and median of each distribution. Figure S16: The A) average sizes of fields and B) total number of fields in each 0.05 degree tile, as calculated from the vectorized field boundaries.