Assessment of Bias in Pan-Tropical Biomass Predictions

Above-ground biomass (AGB) is an essential descriptor of forests, of use in ecological and climate-related research. At tree- and stand-scale, destructive but direct measurements of AGB are replaced with predictions from allometric models characterizing the correlational relationship between AGB, and predictor variables including stem diameter, tree height and wood density. These models are constructed from harvested calibration data, usually via linear regression. Here, we assess systematic error in out-of-sample predictions of AGB introduced during measurement, compilation and modeling of in-sample calibration data. Various conventional bivariate and multivariate models are constructed from open access data of tropical forests. Metadata analysis, fit diagnostics and cross-validation results suggest several model misspecifications: chiefly, unaccounted for inconsistent measurement error in predictor variables between in- and out-of-sample data. Simulations demonstrate conservative inconsistencies can introduce significant bias into tree- and stand-scale AGB predictions. When tree height and wood density are included as predictors, models should be modified to correct for bias. Finally, we explore a fundamental assumption of conventional allometry, that model parameters are independent of tree size. That is, the same model can provide predictions of consistent trueness irrespective of size-class. Most observations in current calibration datasets are from smaller trees, meaning the existence of a size dependency would bias predictions for larger trees. We determine that detecting the absence or presence of a size dependency is currently prevented by model misspecifications and calibration data imbalances. We call for the collection of additional harvest data, specifically under-represented larger trees.


INTRODUCTION
Above-ground biomass, AGB, is central to assessments of forest state and change because of its relationship with the carbon cycle, and ecosystem services including net primary production (Field et al., 1998;Pan et al., 2011;Costanza et al., 2014;Martin et al., 2018). The above-ground biomass of a particular tree, at a given point in time, is the result of lifetime cumulative gross primary production P g , respiration, r, and loss, d (Roberts et al., 1993).
To measure the AGB of a tree would require: (i) its harvesting flush with the ground, (ii) the removal of water through drying, and (iii) its mass measured via weighing. These destructive measurements are necessarily limited due to their difficulty. Instead, measurements of AGB are replaced at the tree-and stand-scale with estimates predicted from allometrics (Picard et al., 2012). Allometric models exploit the correlational relationships that exist between AGB, and more readily measurable tree parameters (e.g., stem diameter, D, tree height, H, and wood density, ρ). These relationships are discovered empirically, from calibration data where AGB has been directly measured via destructive harvest, concurrent with measurement of the predictor variables.
The conventional approach for modeling such calibration data is a combination of ordinary least squares (OLS) linear regression and log-log transformation. OLS is favored because of broad coverage in the wider statistical literature and its long history in the field of allometry since introduction in the 1900s (Lapicque, 1907;Huxley, 1932). The log-log transformation is undertaken because AGB is usually observed to scale with predictor variables such as D according to a power law (Brown, 1997), and it is a convenient approach for modeling the multiplicative nature of plant growth: variance in AGB normally increases with tree size (Kerkhoff and Enquist, 2009).
Once an allometric model has been constructed from some underlying in-sample calibration data, an out-of-sample prediction of tree AGB is made by inputting the measurements of the predictor variables of that tree into the model. Stand-scale AGB is estimated by summing predictions for every tree inside a particular forest stand.
For tropical forests, these models are most often constructed at the population-scale because of diversity (Gibbs et al., 2007), implying calibration data must represent upwards of 40 000 species (Slik et al., 2015). Occasionally, more constrained models of tropical forests are available, where calibration data were acquired exclusively from either a geographic subsection of the tropics, or from a specific plant taxa (Basuki et al., 2009). However, in the interests of consistency, or because of the general unavailability of these more specific models, pan-tropical models are usually preferred.
A number of pan-tropical models exist (Henry et al., 2013), but a few have become particularly prominent (e.g., Brown et al., 1989;Chave et al., 2005Chave et al., , 2014Feldpausch et al., 2012), and their subsequent predictions of stand-scale AGB are the cornerstone to multiple activities across the environmental sciences. One example usage of these predictions is to calibrate remotely sensed signals from earth observation instruments, from which regionaland global-scale AGB products are derived (Saatchi et al., 2011;Baccini et al., 2012;Avitabile et al., 2016). Another example is their provision of reference AGB stocks to intergovernmental initiatives on climate change, such as the UN-REDD program (Angelsen et al., 2012).
Total error then, is the allometric-derived prediction of treeor stand-scale AGB minus the true (or reference) value of AGB. Total error is the sum of two components: (i) systematic error, which describes predictable non-zero mean offsets from the true value, and (ii) random error, which describes unpredictable zero mean offsets from the true value. Trueness, precision, and accuracy are qualitative terms describing the effect on the performance of a prediction by systematic, random and total error respectively. These qualitative performance characteristics are quantitatively expressed as a bias, standard deviation and uncertainty respectively. That is, the uncertainty of a prediction should account for both systematic and random error.
Error in out-of-sample allometric predictions is potentially introduced during the selection, measurement and modeling of the in-sample calibration data, as well as in the measurement of the out-of-sample data. Possible sources of random error include: (R1) noise in the measurement of the in-sample calibration data, (R2) variance in the subsequently constructed model, which arises from the stochastic nature of plant allometry, and (R3) noise in the measurement of the out-of-sample data. Possible sources of systematic error include: (S1) biased measurement of the in-sample calibration data, (S2) bias introduced by the selected modeling methods, (S3) the possibility that the insample data are unrepresentative of the out-of-sample data, and (S4) biased measurement of the out-of-sample data.
The gold standard for quantifying these uncertainties in outof-sample tree-or stand-scale predictions is direct measurement via destructive harvest. However, across tropical forests, direct measurement at the stand-scale has never been undertaken. Aside from the difficulties associated with large-scale destructive harvest, this is perhaps also due to the so-called "fallacy of misplaced concreteness" (Clark and Kellner, 2012). That is, uncertainty associated with these predictions is often ignored as a result of erroneously deeming them reference measurements, rather than the estimates they are. Indeed, only a small body of literature has considered uncertainty in out-of-sample pan-tropical predictions of AGB (Chave et al., 2004;Molto et al., 2013;Picard et al., 2015a;Réjou-Méchain et al., 2017). As outlined below, the focus of these studies has been on the precision of predictions (i.e., the effect of random error), with particular attention to sources R2 and R3. Chave et al. (2004), using an OLS model constructed from a compilation of pan-tropical calibration data, found relative uncertainty to approximate 5-10 % at the 1 ha stand-scale (Chave et al., 2014), when accounting for source R2 using the standard error of the regression, and R3 using a Taylor series expansion. Réjou-Méchain et al. (2017), using the same model and calibration data, but perturbing the parameters of the model via a Bayesian framework to simulate further error arising from R2, found relative uncertainty to approximate 10 % at the 1 ha stand-scale (Chave et al., 2019). Picard et al. (2015a) considered R2 still further, and cognisant of the multiple, nominally suitable allometric models available, estimated their aggregate variance FIGURE 1 | Definition of the terms used in this paper to describe the concept of error in out-of-sample pan-tropical allometric AGB predictions (ISO 5727 and BIPM definitions) (ISO-5725-1:1994(ISO-5725-1: (en), 1994JCGM-200:2012JCGM-200: , 2012. The upper chart, adapted by permission from Springer Nature: (Menditto et al., 2007) defines the relationships between error type and associated performance characteristic. The lower plot illustrates the effect on predictions from improving trueness, precision, and accuracy.
Frontiers in Forests and Global Change | www.frontiersin.org 3 February 2020 | Volume 3 | Article 12 using Bayesian model averaging, from which relative uncertainty was found to approximate 44 % at the 1 ha stand-scale. Whilst the contribution to uncertainty by random error has received some attention, the contribution by systematic error has received considerably less. Here, we focus on systematic error in allometric-derived pan-tropical predictions of AGB, with particular attention to sources S1, S2, and S4. That is, we focus on bias introduced during the measurement and modeling of the in-sample calibration data, and measurement of the out-of-sample data.
Initially, we undertake a review of the metadata of existing pan-tropical calibration data, to note the measurement methods particular to destructive harvest experiments. We then review the underlying assumptions of OLS modeling necessary to justify unbiased predictions of out-of-sample AGB. Using open access pan-tropical calibration data, we construct several conventional models, and test whether these assumptions are met using various fit diagnostics and statistical tests. We assess and compare the precision, trueness and accuracy of predictions from these models using bootstrapping and cross-validation.
We identify several potential sources of bias, simulate their likely influence, and discuss their implication for OLS predictions of tree-and stand-scale AGB. We suggest some recommendations for quantifying and minimizing bias during the measurement and compilation of calibration data. We also discuss approaches for minimizing bias during the modeling of calibration data.
Finally, we consider one aspect of error source S3: the possibility that in-sample data are unrepresentative of outof-sample data. We assess whether pan-tropical allometry is independent of tree size; that is, an assumption of conventional pan-tropical allometry is that the model parameters necessary for predicting the AGB of a tree are constant regardless of tree size. This is necessary to consider because calibration datasets are often imbalanced (i.e., the majority of observations will be from small trees, with relatively few large trees) (Duncanson et al., 2015;Jucker et al., 2017). If pan-tropical allometrics are dependent on tree size, these imbalances will introduce a bias into predictions of AGB for under-represented size classes.

Characteristics
In this paper, we consider the Chave et al. calibration dataset (Chave et al., 2014). These open access data are currently the most comprehensive available, compiled from many independent destructive harvest experiments over 7 decades, from 58 sites spanning the tropics, with measurements of AGB (kg), D (m), H (m) and ρ b (kg m −3 ) 1 obtained from 4004 trees. Figure 2 illustrates these data, presenting a scatter plot of AGB against D, and histograms displaying the distributions of the 4 variables.
It is worth noting two characteristics of the dataset; the first is non-constant variance in AGB. That is, for a given value of D, the 1 ρ b describes dry mass divided by wet volume. range of values taken by AGB is not constant across scale; rather, this range increases as D increases. The second characteristic of these data is their non-uniform distribution, with the majority collected from relatively small trees. Some statistics illustrating this: AGB ranges from 1 to 76064 kg; the median and mean values are 98 and 1,134 kg; the first and third quartiles are 22 kg and 491 kg; 5.7 and 2.7 % of these data (by stem count) have AGB > 5,000 and 10,000 kg, respectively.
For the measurement of D, the measurement device, point of measurement and buttress treatment are recorded. For the measurement of H, the measurement device and whether measurement was made in situ or post-felling are noted. For the measurement of AGB, the methods for measuring wet mass and the subsequent conversion to dry mass are recorded. Finally, the methods used in each study for the measurement of ρ b are also recorded.

Ordinary Least Squares Linear Regression
The conventional approach for constructing a model to predict AGB from these calibration data is ordinary least squares (OLS) linear regression. An OLS model takes the form: Where, y, is a nx1 vector of n observations of the dependant variable (e.g., AGB); ε, is a nx1 vector of unobserved random error in y; X, is a nxp design matrix of observations of the predictor variables (e.g., D, where p is the number of included predictor variables plus a constant term); and, β, is a px1 vector of the unknown population parameters. The closed-form OLS solution to estimating β is minimization of the sum of the squared differences between observations and predictions of the dependant variable: Although ε are unobserved, they are estimated, and then represented by, the residuals of the model fit, e, e i = y i − y i (i = 1, ..., n). The standard error of the regression, s, which is the OLS estimate of the standard deviation of ε, σ , which itself AGB is plotted against D, the dominant predictor variable in most allometric models; it can be seen that variance in AGB is non-constant. The lower histograms present the distributions of the 4 measured variables across these data. It is noted observations are non-uniformly distributed across the range of each variable, with the majority collected from relatively small trees.
is necessary for frequentist tests of statistical significance (e.g., prediction/confidence intervals), is defined as:

Assumptions and Finite-Sample Properties of Ordinary Least Squares
Forβ to be an unbiased estimate of β (i.e., the expected value of β is β, E[β] = β), the following three assumptions must be met (Hayashi, 2000): 1. Linearity: a linear relationship exists between X and y. 2. Strict exogeneity: the expected mean of ε, conditional on X, is zero; which in practice, implies ε is expected to have an unconditional zero mean, and that it is expected X is uncorrelated with ε: 3. Absence of perfect collinearity: meaning the relationship between the predictor variables is not deterministic, which would prevent the necessary inversion of X in Equation 2.
Finally, one further assumption is sometimes associated with OLS: 6. Normality: the error term is normally distributed: A normal error term is not required for eitherβ orσ to retain unbiased properties, but a non-normal ε potentially invalidates t-and F-tests, or the consistency of tests of models selection, such as the Akaike information criterion, whose underlying likelihood function usually expects a normally distributed ε. However, even in such circumstances, if the model is correctly specified as above (assumptions 1-5), a non-normally distributed ε is often dismissed when n is sufficiently large by invoking central limit theorem (Pek et al., 2018).

Predictive Modeling
It is now necessary to consider these assumptions in the context of predicting out-of-sample AGB. These assumptions have been derived with the classical application of regression in mind: causal understanding (Shmueli, 2010). That is,β are interpreted as explaining the relationship of the predictor variables on the dependant variable (in this context, the term predictor variable would usually be replaced with independent variable).
Here however, our interest withβ is solely to predict a value of AGB, for a given out-of-sample value of X, and to understand the statistical significance of this prediction. By making this fundamental distinction, it is possible to relax or discard several of the above assumptions. For simplicity, for the remainder of this subsection, X is considered to be a single predictor variable, D. For the assumption of strict exogeneity, relevant potential sources of endogeneity include: 1. Omitted variable bias. AGB is not caused by D, but is caused by the aforementioned causal variables (i.e., gross primary production, P g , respiration, r, and losses, d): Such that when these causal variables are omitted, and replaced with a non-causal predictor variable, their influence is subsumed into the error term: Whereby if D were correlated with the combination of omitted variables, then D and ε are correlated, which violates the assumption of strict exogeneity.
2. Systematic error in the measurement of AGB. At the most simple, if a constant bias, c, is present in the measurement of AGB, then the mean of ε is now non-zero: Meaning the intercept, β 0 , is biased: 3. Errors-in-variables: OLS assumes predictor variables are measured without error. In the case of a single predictor variable, the model is described as: Where the OLS estimate of β 1 is: However, suppose D were measured with some random error, , then the estimate becomes: Meaning D and ε are now correlated via η. This manifests in a downward bias of β 1 , which is often termed regression dilution.
The consequences of these various sources of endogeneity differ depending on the application of the model. If the application is explanation, then all three sources bias estimates of β, which we discuss further in the discussion section. If the application is prediction, then systematic error in the measurement of AGB will also persistently bias out-of-sample predictions of AGB. Errorsin-variables do not necessarily bias AGB predictions, although a bias will be present when the measurement error distributions between in-and out-of-sample measurements are inconsistent (Jonsson, 1994;Molto et al., 2013). Omitted variable bias however, which potentially results in the discovery of so-called spurious relationships, can be ignored in predictive models. That is, the influence of the omitted variables on estimates of β will not bias predictions. However, as also discussed later, omitted variable bias profoundly limits the application of the model outside of prediction.
If the purpose of the model is prediction, we can also largely disregard the assumption of multicollinearity (i.e., significant correlations between the predictor variables, provided the correlation is not perfect) (Hyndman and Athanasopoulos, 2018). Also, because the calibration data are comprised of single, independent observations (with the possible exception of ρ b , which we discuss further in the discussion section), the assumption of autocorrelation can also be disregarded.
Therefore, for OLS estimated β to have unbiased properties suitable for prediction of AGB, the following assumptions must be met: (A1) a linear relationship exists between the predictor variables and dependant variable, (A2) the unconditional mean of ε is zero, (A3) measurement error in the predictor variables is consistent between in-and out-of-sample data. Further, σ has unbiased properties, andβ become efficient, when ε is homoscedastic.

Log-Log Transformation
To achieve the required linearity, given the calibration data exhibit a power law relationship in real-space (Figure 2), log-log transformation is necessary: A further beneficial trait of this transformation, in the context of calibration data where the variance in AGB is non-constant, is the increased likelihood of homoscedastic behavior from the residuals. Once β are estimated, subsequent AGB prediction requires re-transformation of the model to real-space: That is, in real-space, ε is no longer additive (independent of the predictor variables; scale invariant), but multiplicative (dependent on the predictor variables; relative to scale). A corollary of this re-transformation is that error is described by the log-normal distribution, which does not share the same expectation of the mean with that of the normal distribution. This mismatch introduces a bias, which is usually countered through application of a correction term (Neyman and Scott, 1960), formed using σ , as: An implication of employing this correction term is that predictions of AGB are unbiased only whenσ itself is unbiased (E[σ ] = σ ).

Considered Model Forms
Here, we explore the fit of various model forms to the pantropical calibration data. The chosen selection of bivariate and multivariate models covers a range of complexities, given the predictor variables available in the calibration data. The five considered models are: Once each model is fitted, we apply several diagnostics and statistical tests to the resulting residuals, e, in an effort to interpret whether the error term, ε, is homoscedastic and normally distributed. Variance of ε is assessed by visually inspecting e plotted against predicted AGB. The Breusch-Pagan and White statistical tests are applied to e to further evaluate variance of ε [null hypotheses: constant variance (homoscedasticity)] (Breusch and Pagan, 1979;White, 1980). The distribution of ε is assessed by comparing the studentised residuals, ê σ , with the expected normal distribution via a Quantile-Quantile plot.
The variance inβ is quantified using confidence intervals. The classical frequentist approach for generating confidence intervals requires an unbiased estimate of σ . However, as previously identified,σ has unbiased properties only when ε is homoscedastic. As this assumption may not necessarily hold, confidence intervals are instead generated here using a nonparametric bootstrap.
From the calibration data, a random-with-replacement sample is drawn, from which the five OLS models are constructed. Across N draws, confidence intervals aboutβ, at the level α, are estimated for each model using the bias-corrected and accelerated approach (Efron, 1987).

Trueness and Accuracy of Predictions
To assess the closeness of agreement between predicted and observed AGB (accuracy) from these models, given both the random error (often referred to in a modeling context as simply variance) which affects precision, and the systematic error (similarly often referred to as bias) which affects trueness, we use k-fold cross-validation.

Stratified k-fold Cross-Validation
The calibration data are folded (or split) k-times, where each fold is a representative subset of the full data. Sequentially iterating through the folds, each of the five considered models are constructed from observations in the unselected folds (training data, k − 1). AGB is predicted by each model for each observation in the selected fold (validation data), and compared with observed AGB.
Prediction error is assessed here using the log of the accuracy ratio (Tofallis, 2015). We deliberately avoid the more widely-used mean absolute percentage error (MAPE) because of its undesirable properties including asymmetric penalty, asymmetric bounds and outlier penalty. Instead, the log of the accuracy ratio exhibits symmetric properties, and is particularly well-suited to an assortment of predictions that could reasonably be expected to span five orders of magnitude. The accuracy ratio of a prediction, Q, is defined as: Where the log of the accuracy ratio is defined as ln(Q).
To quantify the uncertainty and bias of predictions from each fold, we use two metrics proposed by Morley et al. (2018). First, uncertainty is assessed using the median symmetric accuracy (MSA): Which can be readily interpreted as a percentage error. Second, bias is assessed using the symmetric signed percentage bias (SSPB): Which produces a similarly interpretable percentage, whereby a positive or negative sign denotes an over-or under-estimation of the prediction respectively.

Simulating Inconsistent Measurement Error
Inconsistent measurement error in predictor variables between in-and out-of-sample data can be simulated by adding further noise to the in-sample calibration data themselves, e.g.,: From which N draws of η are made, and subsequent models constructed. The mean values ofβ across these N models are those necessary to provide unbiased predictions of AGB when the out-of-sample data are measured with η-more measurement error than that present in the calibration data.

Tree Size and Allometry
Finally, the independence of β from tree size is considered. That is, all else being equal, if pan-tropical allometry is independent of tree size,β should remain statistically indistinguishable between models constructed from subsets of data belonging exclusively to either small or large trees. To explore this, a series of subsets are generated from the data that contain sequentially fewer small trees, removing those below (D ≥ 0.1, 0.25, 0.5 m, and 0.75 and 1 m). The variance in these model parameters is then estimated using the aforementioned bootstrapped BCa confidence intervals.

Methods Availability
The source code for these methods, implemented in R, is available in the treeallom package, which is released under the MIT license, and hosted at https://github.com/apburt/treeallom.

Review of the Calibration Data
Across the considered studies, a measuring tape was the most commonly used measurement device, although calipers were occasionally used instead ( Table 1). The point of measurement was often not reported, but D was referred to as "diameterat-breast height" or "girth-at-breast-height, " which is usually assumed as 1.3 m, although historically this has sometimes been considered 4.5 ft (∼1.37 m). Finally, for the treatment of buttresses, two separate approaches were reported: (i) measurement directly above the buttress, and (ii) measurement 0.2 m above.

Measurement of H
Most often, H was measured post-felling using a tape measure ( Table 2), although a number of studies measured H pre-harvest (i.e., with the tree in situ). On the resolution to which H was reported, the majority provided to the nearest 0.1 m, although this was occasionally to the nearest 1 m.

Measurement of AGB
For the measurement of wet mass, some studies weighed each tree in its entirety using scales (Martinez-Yrizar et al., 1992;Nelson et al., 1999;Mackensen et al., 2000;Cairns et al., 2003;Burger and Delitti, 2008;Kenzo et al., 2009;Djomo et al., 2010;Niiyama et al., 2010;Ryan et al., 2011;Vieilledent et al., 2012;Colgan et al., 2013;Mugasha et al., 2013;Ngomanda et al., 2014). Other studies mixed direct measurements with indirect measurements from volume estimates derived from diameter and length measurements. Some studies weighed the crown of each tree, but stem wet mass was derived from volume estimates for some or all trees (Edwards and Grubb, 1977;Saldarriaga et al., 1988;Araújo et al., 1999;Ketterings et al., 2001;Brandeis et al., 2006;Nogueira et al., 2008;Alvarez et al., 2012;Goodman et al., 2014). The remaining studies used volume estimates for stem and large branching for some or all trees (Yamakura et al., 1986;Brown et al., 1995;Fromard et al., 1998;Ebuy et al., 2011;Henry et al., 2010). There was variation in the treatment of stumps, with some considering everything flush with the ground (Brandeis et al., 2006), whilst others ignored stump material (Ebuy et al., 2011). Few reported on losses from chainsaw cuts: sometimes woody swarf was weighed (Nogueira et al., 2008), and othertimes ignored (Mugasha et al., 2013). No study reported duration between felling and measurement, and on any subsequent water losses. No studies reported applying correction factors to account for either source of loss.
To estimate dry mass (AGB) from wet mass, most often subsamples were gathered from each tree, and their dry-to-wet ratio measured via oven-drying. This was usually undertaken by partitioning the wet mass into pools (e.g., stem, large branches, fine branches, twigs, leaves, and fruit), and taking subsamples from each pool. The type of subsample, the number of pools, and the number of subsamples acquired per pool varied between studies, as did the application of the dry-to-wet ratio (i.e., some derived the mean dry-to-wet ratio across the subsamples that was subsequently applied to total wet mass, whilst others applied the dry-to-wet ratio on a per-pool basis). The temperature at which the subsamples were dried and their final dry mass reported, varied from 55 • C (Cairns et al., 2003) to 105 • C (Ketterings et al., 2001). Some exceptions to this general approach were the selection of subsamples by height rather than pool (Vieilledent et al., 2012), taking subsamples from only a subsample of the harvested trees (Saldarriaga et al., 1988), and sourcing dry-to-wet ratios from literature (Araújo et al., 1999).
or sometimes ρ b was not a variable under consideration, but subsequently added to these data during compilation using global databases (Edwards and Grubb, 1977;Yamakura et al., 1986;Fromard et al., 1998;Mackensen et al., 2000;Cairns et al., 2003;Burger and Delitti, 2008;Kenzo et al., 2009;Niiyama et al., 2010;Ryan et al., 2011). For those studies that did measure, the most common approach was to determine the wet volume from the subsamples (Saldarriaga et al., 1988;Brown et al., 1995;Brandeis et al., 2006;Henry et al., 2010;Vieilledent et al., 2012;Alvarez et al., 2012;Nogueira et al., 2008), although there were variations on this: e.g., only a single subsample from the stem was considered (Nelson et al., 1999), or only subsamples from the stem (Goodman et al., 2014). Other approaches involved taking cores from each tree (Djomo et al., 2010) and combining measurements with literature values (Ketterings et al., 2001). For the measurement of wet volume, the subsamples were usually measured via Archimedes' principle (Goodman et al., 2014), but sometimes graduated cylinders (Colgan et al., 2013), estimates from geometry (Henry et al., 2010), or a combination (Brown et al., 1995). Similar to the application of the dry-towet ratio, ρ b was sometimes derived from the mean across subsamples, or othertimes weighted by pool.
In summary then, measurement protocol between studies were inconsistent for each of the 4 measured variables. This is of course a largely unavoidable inevitability, given the nature of these data compiled from multiple independent studies and operators, across both a large spatial extent and time-span.

Bivariate Models
The relative strength of the correlation between the predictor variables D and H with AGB is demonstrated by the two bivariate models, with the standard error of the regression from the AGB = f (D) model considerably smaller than the
In the case of the AGB = f (D) model, residual variance decreases with increasing predicted AGB, and a combination of light-heavy tails are observed in the distribution of studentised residuals. Multiple outliers are seen, which likely exert undesirable leverage onβ, suggesting robust regression techniques might be more appropriate. The AGB = f (H) model has clear deficiencies: AGB will be underestimated for both short and tall trees.

Multivariate Models
Including additional predictor variables leads to a significant reduction in the standard error of the regression relative to the bivariate models (Figure 4). However, similar to the bivariate AGB = f (D) model, residuals from the three multivariate models are heteroscedastic (Figure 4 and Table 4) and nonnormally distributed (Figure 4). Across the multivariate models, residual variance consistently decreases as predicted AGB increases. The distributions of studentised residuals exhibit various combinations of heavy/light tails and bowing.

Cross-Validation
The calibration data were folded 10 times, resulting in ∼400 observations per fold. This might be similar to the stem count encountered in a 1 ha tropical forest stand, so the uncertainty and bias metrics reported here might provide something of an expectation for those at the out-of-sample stand-scale.
Prediction accuracy increased with increasing predictor variable count (Figure 5 and Table 5 Predictions from all 5 models were persistently biased upward ( Figure 5 and Table 5). A small reduction in bias was observed when the multivariate models are compared with the AGB = f (D) model. Overall, the minimum observed mean bias was 6 %.

Inconsistent Measurement Between
In-and Out-of-Sample Data Inconsistent measurement error in predictor variables between in-and out-of-sample data was simulated by adding further error, drawn from normal distributions, to the calibration data. Simulated error added to H had standard deviations, σ η , of 0.25, 0.5, 1, and 2 m. Simulated error added to ρ b had σ η of 2, 50, 75, and 100 kg m −3 .
Large fluctuations are observed in the parameters of the models constructed from these various combinations of added noise ( Table 6), which regularly fall outside the 95 % confidence intervals of the base model presented in Figure 4. Additional measurement error in a predictor variable manifests in a downward force on its corresponding parameter (regression dilution), and a variable upward force exerted on the remaining parameters, with a particularly pronounced effect on the intercept, β 0 . That is, as measurement error inconsistency increases, the less influence that particular predictor variable has on predicted AGB.

The Effect of Tree Size on Model Parameters
The AGB = f (D, H, ρ b ) model was constructed from the various considered subsets of the calibration data (these subsets contained sequentially fewer trees, removing those below diameter thresholds of D ≥ 0.1, 0.25, 0.5 m, and 0.75 and 1 m). There is a tendency for the parameters associated with the predictor variables to increase as fewer smaller trees are considered, whilst the intercept parameter decreases (Figure 6). Whilst the changes in these parameters are substantial, it is noted that rarely do confidence intervals not overlap. The confidence intervals themselves rapidly inflate because of the relatively few observations in the larger size-classes.

DISCUSSION
The residuals of each bivariate and multivariate model were heteroscedastic and non-normally distributed. The crossvalidation results found the minimum relative uncertainty in fold-scale AGB predictions achieved by these various models was ∼24 %, and that predictions were also persistently upward biased by a minimum of 6 % (∼400 observations per fold). Our analysis suggests that these results are likely symptoms of model misspecification. That is, the models do not account for everything they should.

Inconsistent Measurement Error
It was noted in the methods section that error in the measurement of predictor variables will not necessarily affect the trueness of AGB predictions. For example, if in-sample , then the subsequently constructed model characterizes the relationship AGB = f (H ′ ). That is, the imprecise expectation of H is baked-in to the OLS estimate of the population parameters. Provided the out-of-sample measurement of H shares this expectation, predicting AGB using these parameters is unproblematic (Jonsson, 1994). However, if the out-of-sample measurement has a different expectation of error, then a systematic error will be introduced.
The key point then, is not the presence of measurement error itself, but the difference in its distribution between inand out-of-sample measurements. As discussed below, we think that assuming these distributions are approximately consistent for the predictor variables H and ρ b is unjustifiable. Crucially, if it is assumed this difference is negligible (which is the current position of all widely-used pan-tropical allometric models), when it is not, a bias of unknown direction and magnitude will be present in AGB predictions. The null hypothesis of residuals having constant variance (homoscedasticity) is rejected in both models, and the alternative hypothesis of heteroscedasticity is accepted.

Differences Between In-and Out-of-Sample Measurement Error
In the metadata review it was noted that for the majority of insample data, measurement of H was made via tape measure postfelling. It would seem plausible to assume this method provides true and precise measurements, e.g., it would not be unreasonable to speculate η could take a form similar to η ∼ N(0.0 m, 0.5 m). However, out-of-sample measurements of H are made with the tree in situ using clinometers and range finders via either the tangent or sine method. Two prominent studies have explored the accuracy of these measurements in tropical forests. Larjavaara and Muller-Landau (2013) found η to take the average forms η ∼ N(−0.8 m, 6.8 m) and η ∼ N(−4.5 m, 2.3 m) for the tangent and sine methods respectively. Hunter et al. (2013) found η to take the average form η ∼ N(−1.1 m, 4.7 m) for the tangent method.
This would suggest measurement error distributions between in-and out-of-sample measurement of H are significantly different. More problematic, out-of-sample H is often not measured, but replaced with predictions from models (Feldpausch et al., 2011;Sullivan et al., 2018). It is likely that these models share issues similar to those encountered here (e.g., a heteroscedastic error term means E[σ ] = σ ), such that the out-of-sample error structure becomes misleading.
Unlike the measurement of H, few studies have explored measurement error in ρ b , but it would seem reasonable to suggest that obtaining a robust description of the in-sample measurement error distribution is impossible. The metadata review showed the in-sample methods include a variety of direct measurements on subsamples and cores, and acquiring values from global databases. Therefore, the mean in-sample definition of the measurement of ρ b itself is an unknown. That is, the aggregate in-sample measurement of ρ b , which is the expectation of the out-of-sample measurement, is some unmeasurable and unknown composite of these various methods. If the definition of the in-sample measurement is unknown, then the difference in measurement error between in-and out-ofsample measurements is unknown.
With respect to the measurement of D, it was assumed in the simulations that measurement errors were consistent. This is possibly justified as widely-used field guides for tropical forest inventorying are consistent and unambiguous in the definition of the measurement (Marthews et al., 2014). We do acknowledge however, there are reasons why in-and out-ofsample measurement error distributions might be inconsistent. For example, the metadata review identified the use of different measurement devices (e.g., diameter tape and calipers), point of measurement and buttress treatment. Similar to the in-sample measurement of ρ b , this would lead to the mean in-sample definition of the measurement of D being some fusion of these approaches, which cannot be mirrored by a single out-of-sample measurement. There are also possibly human factors at play: the skill and diligence of operators may vary between separate data acquisitions.

Implications of Inconsistent Measurement Error
The question remains then: what are the likely consequences to predictions of tree-and stand-scale AGB from inconsistent error in the measurement of in-and out-of-sample predictor variables? Given the above discussion, we think our worst-case simulations presented in Table 6 provide a particularly conservative insight.
We assumed out-of-sample measurements of H and ρ b were only more imprecise than in-sample measurements (i.e., measurement trueness remained consistent). We assumed these differences were characterized by normally distributed error with 2 m and 100 kg m −3 standard deviation respectively. Under these assumptions, our simulations of the AGB = f (D, H, ρ b ) model found the parameter β 0 to change from 0.821 to 4.009, β 1 from 2.019 to 2.206, β 2 from 0.888 to 0.566, and β 3 from 0.821 to 0.508. Both absolutely and relatively, these changes in population parameters have implications to predictions of AGB.
Absolutely, these differences can be demonstrated by predicting AGB for two hypothetical trees: first a tree with D = 0.1 m, H = 20 m and ρ b = 600 kg m −3 has predicted AGB of 63.2 kg in the original model, and 52.7 kg in the simulated model, a −16.6 % change. Second, a larger tree with D = 1.5 m, H = 40 m and ρ b = 500 kg m −3 has predicted AGB of 23,857.6 kg and 27,976.5 kg respectively, a 17.3 % change. There might therefore be some degree of cancellation when up-scaling to the stand, but this would be both a function of structural composition, and dangerous to assume.
Relatively, there are two scenarios where not accounting for inconsistent measurement error would lead to potentially spurious predictions of AGB change: inter-plot comparison and change detection. To illustrate this, we downloaded some field data from https://forestplots.net/ for 2 plots included in the Global Ecosystem Monitoring network (GEM, http:// gem.tropicalforests.ox.ac.uk). These two 1 ha plots (designation: MNG-03 and MNG-04) are in close proximity to one another in l'Arboretum Raponda Walker, Estuaire, Gabon (location: 0.576 • , 9.323 • and 0.576 • , 9.328 • ). Both plots are moist, lowland, Terra Firme, secondary forests; MNG-03 has a monodominant composition whilst MNG-04 is mixed. MNG-03 has a stem count, basal area, Lorey's height and basalarea-weighted basic density of 436 , 47.6 m ha −2 , 39.1 m and 489 kg m −3 respectively; MNG-04 has 437 , 34.8 m ha −2 , 30.8 m and 605 kg m −3 respectively. First, with respect to inter-plot comparison then, the original model predicts stand-scale AGB of 579,591 kg and 421,141 kg for MNG-03 and MNG-04 respectively, whilst the simulated model predicts 588,370 kg and 407,950 kg respectively. That is, the original model predicts a 31.7 % difference in AGB between plots, whereas the simulated models predicts 36.2 % difference.  Format is consistent with Table 3. These results corroborate with those from Figure 4 in indicating the residuals of each model are heteroscedastic.
Second, to explore the implications to change detection, we hypothetically assume some changes in the composition of MNG-04 since these data were collected. We assume a uniform increase in D, H, and ρ b of 0.01 m, 2.5 m and 25 kg m −3 respectively per tree. The original and simulated models now predict stand-scale AGB as 489,633 kg and 458,380 kg FIGURE 5 | Stratified 10-fold cross-validation results for three of the pan-tropical models. For each considered model, per validation fold, the distribution of the log of the accuracy ratio is shown (ln(ÂGB/AGB)). Each fold contains ∼400 observations. Distributions are represented via standard format box-and-whisker. It is observed that the variance of these distributions tends to reduce as additional predictor variables are added. The median value of these distributions is consistently greater than zero, signifying predictions of AGB are generally larger than observed AGB. Uncertainty and bias of AGB predictions from these models is quantified using the median symmetric accuracy (MSA) and the signed symmetric percent bias (SSPB) respectively. These metrics were generated per fold (∼400 observations per fold), and here the mean, minimum and maximum of the 10 values are reported for each model. It is noted minimum observed mean uncertainty is ∼24 %, and that accuracy generally improves as predictor variables are added to the model. The sign of SSPB is persistently positive, again signifying predictions of AGB are generally larger than observed AGB.
respectively. That is, the original model predicts a 16.3 % increase in AGB, whilst the simulated models predicts a 12.4 % increase.

Including Tree Height and Wood Density in Pan-Tropical Allometry
Considering these implications, and given our assertion that these simulations of inconsistent measurement error were conservative, we think careful thought is required on how best to include H and ρ b as predictor variables in pan-tropical allometric models. Across the literature there is a consensus that their inclusion is worthwhile: multivariate models including these variables generally exhibit a smaller standard error of the regression than bivariate D-only counterparts; H and ρ b are therefore correlated with AGB, whilst not perfectly correlated ln(AGB) = β 0 + β 1 ln(D) + β 2 ln(H ′ ) + β 3 ln(ρ ′ b ) + ε σ η (H ′ = H + ε, ε ∼ N(0, σ 2 η )) (m) 0.00 0.25 0.50 1.00 2.00 Here, the multivariate model, AGB = f(D, H, ρ b ), is considered, whereby additional random error has been added to the in-sample calibration data. The additional noise is included in predictor variables H and ρ b , which is drawn from normal distributions with increasing variance. Per distribution, 10,000 draws were made, and the mean estimate of the parameters is reported in the table. These parameters represent those required for unbiased prediction of AGB when out-of-sample data are measured with ση-more measurement error than the in-sample data.
with D. Given that in tropical forests, H and ρ b are often observed to vary for a fixed value of D, it is therefore the expectation that their inclusion as predictors will improve treeand stand-scale prediction accuracy. Furthermore, it has been demonstrated that at the landscape-and regional-scales, ρ b varies systematically as a response to multiple environmental factors (Baker et al., 2004;Phillips et al., 2019). If ρ b were excluded from pan-tropical allometry, these systematic variations would go undetected in up-scaled predictions of AGB (Mitchard et al., 2014). These benefits of including H and ρ b were reflected in the cross-validation results, whereby the AGB = f (D, H) model yielded 10.0 % less uncertain predictions than the bivariate AGB = f (D) model, and the AGB = f (D, H, ρ b ) improved on this by a further 6.4 %. However, these results do not account for systematic error introduced by inconsistent errors-in-variables (e.g., in the majority of these calibration data, H was measured post-felling with a tape measure). Therefore, the decision FIGURE 6 | Are pan-tropical model parameters independent of tree size? The parameters of the multivariate model, AGB = f(D, H, ρ b ), when observations below several D-thresholds are sequentially removed. Bootstrapped BCa 95 % confidence intervals (N = 10 000 ) are shown for each parameter. It is seen that as smaller trees are removed, the parameters associated with the predictor variables tend to increase, whilst the intercept tends to decrease. However, it is also seen that confidence intervals generally overlap one another.
to include these variables as predictors is balanced between reducing random error by known amount, and introducing an unknown amount of systematic error whilst inconsistent errorsin-variables remain unaccounted for.
Given the above discussion makes the case that it would be unjustifiable to assume in-and out-of-sample measurement error distributions in H and ρ b are consistent, this would imply unknown bias is always present in AGB predictions from these conventional multivariate models. We therefore think formal steps are necessary to account for, and minimize, this bias. This action can take two forms: first, the in-and out-of-sample measurement methods become consistent, such that it is assumed the respective measurement error distributions are consistent, or second, inconsistencies are corrected for during modeling.
In the particular case of measuring H, in the above referenced studies of in situ error distributions, it was noted both the tangent and sine method were relatively inaccurate. Significantly, it was also seen that between the two independent studies, the resulting error distributions for the tangent method were different. This might suggest these distributions are not consistent across forest type and/or operator. It would follow then, that the in-sample measurement of H should be made from the more true and precise measurements obtained from a tape measure postfelling. This implies that the in-and out-of-sample measurement methods will be different, and some form of modeling correction is required.
To minimize systematic error introduced by the inclusion of H in pan-tropical models then, we think the following three steps are necessary: (1) The in-sample data are measured postfelling via tape measure, where measurement error is quantified through repeated measurements, ideally by multiple operators. If calibration data are compiled from multiple individual studies, then those data where H has been measured using other methods must be excluded (e.g., in situ pre-harvest).
(2) Out-of-sample H is measured in situ using the tangent or sine method, whereby measurement error is concurrently quantified, or estimated via known distributions. (3) The OLS estimators account for the inconsistencies between these two error distributions using either errors-in-variables modeling (Jonsson, 1994), or simulation approaches similar to those used here.
It would seem that the appropriate approach for including ρ b in pan-tropical models whilst minimizing systematic error is a more open question. Firstly, the definition of the measurement of ρ b requires standardization. Because these measurements are currently not standardized, both across and between inand out-of-sample data, robust quantitative descriptions of measurement errors are unavailable, meaning reliably correcting for inconsistencies and resulting bias is impossible. One approach might be to replace all measurements with values from global databases (Chave et al., 2009), but this requires careful consideration: (i) the measurement methods used to collect the underlying data are themselves likely inconsistent and (ii) errors become autocorrelated.

Is Pan-Tropical Allometry Independent of Tree Size?
An interesting question when considering systematic error in allometric-derived AGB predictions is whether model parameters are independent of tree size. That is, are the population parameters necessary for predicting the AGB of a small tree, the same as those necessary for a large tree? This question has previously been posed by others including Picard et al. (2015b), who found, using calibration data from central Africa, that bivariate power law models did not hold across all size-classes, and that some size dependency existed.
Within these specific calibration data considered here, Ploton et al. (2016) noted a break point, whereby models constructed from calibration data below and above ∼20,000 kg did not share the same population parameters. This was similarly observed in Figure 6: when trees belonging to specific D-classes were sequentially removed, the parameters of the AGB = f (D, H, ρ b ) model changed substantially. But are these changes significant, and if so, is this a detection of size dependency?
As to the first question, it would appear these changes were not statistically significant because the bootstrapped 95 % confidence intervals for each parameter generally overlapped. Whilst the confidence intervals are compact for parameters describing the complete dataset (n = 4004 ), they quickly expand as the smaller trees are removed. This is inevitable given the non-uniform distribution of these data, where n = 215 and 90 for observations with D ≥ 0.75 m and 1.0 m respectively. So the observed changes in the population parameters were not significant within these particular data, but this does not rule out the existence of a size dependency in the population.

Potentially Biased Measurement of Calibration Data
Even if confidence intervals were not to overlap, attributing changes (or indeed the lack of change) to a size dependency is challenging when the models are potentially misspecified. A further potential misspecification, aside from inconsistent measurement error, is that the unconditional mean error in observation of AGB is possibly non-zero (E(ε) = 0).
The metadata review identified that for the measurement of wet mass (i.e., via weighing), the destructive methods introduce several sources of loss. For example, most studies did not account or correct for losses from chainsaw cuts, or water losses accrued between felling and measurement. Several studies also excluded stump material from measurement.
As shown in the methods sections, if a bias, c, were consistent across observations, then only the intercept parameter is biased, E(β 0 ) = β 0 + c. However, if bias in the observation of AGB is correlated with tree size, the effects are more complex, and contaminate all parameters. It would again not seem unreasonable to speculate that if bias is present, that this second form is the more likely.
For example, we recently harvested 4 tropical trees in Brazil; we measured the wet mass of the stem by cutting it into manageable sections that were possible to weigh. We also estimated the losses from these cuts by estimating cut volume. Across these 4 trees, the wet masses of these four stems were 3,229, 3,636, 5,097, and 16,780 kg. The cumulative volumederived wet mass of losses from chainsaw cuts were 28, 41, 58, and 330 kg, respectively. These losses represent ∼0.9, 1.1, 1.1, and 2.0 %, respectively. For these particular trees and measurements methods then, these losses are correlated with tree size.
Returning then to the original question, we are not trying here to suggest that observations of AGB are necessarily biased; rather that the possibility exists that AGB are biased, and it is also possible bias is correlated with tree size. In order to attribute statistically significant changes in population parameters to a dependency on tree size, it would need to be demonstrated that bias in observations of AGB is negligible.
It is also noted that the wet mass for a large section of the calibration data was not measured, but instead estimated from volume measurements (indeed the measurement method itself would appear correlated with tree size: volume-derived estimates were often used when it was logistically impracticable to weigh). Expectations of systematic error in these two measurement methods may therefore be inconsistent. Random error would likely also share a disparate expectation, which may offer a partial explanation as to why model residuals were heteroscedastic.

Additional Calibration Data Are Required
Answering the question of whether pan-tropical allometric models are independent of tree size would be of general scientific interest, but more specifically, it is critical to understanding the trueness of AGB predictions. Currently, the above-ground biomass of large trees is predicted from empirical relationships discovered from imbalanced calibration data (e.g., in these considered data the median value of AGB is 98 kg).
In OLS, each observation similarly influences the population parameter estimates when any leverage effects from outliers and influential points are ignored. That is, because of this imbalance, large trees currently have little influence on model parameters. If the allometric relationship is independent of tree size (implicitly, this is the assumption of current widely-used pan-tropical allometric models), then this is of little concern, but likewise, if the relationship is size dependant, predictions of AGB for the larger size-classes are biased.
To answer these questions requires the collection of more calibration data. Specifically, these new data need to be gathered from larger trees. If these data are to supplement existing data, it is more beneficial to acquire a small number of observations from larger trees, than a large number of observations from smaller trees. Indeed, adding further small trees to these calibration data will only further reduce the influence of larger trees on the OLS estimators. Additional data from larger trees will also reduce the size of confidence intervals in model parameters constructed solely from the larger trees.
Of course, in the wider context of considering whether out-ofsample data are adequately represented by in-sample data, size is only one contributing factor. Another key consideration is the geographical representation of the sample, given that allometries are geographically variable (Henry et al., 2013). These additional large trees then, would ideally be uniformly collected from across the tropics (Banin et al., 2012;Gorgens et al., 2019;Shenkin et al., 2019).
As an aside to the comment that each observation will similarly influence the OLS estimate of the population parameters, it would therefore not be sufficient to argue that a particular allometric model is suitable for predicting the AGB of a particular type of tree (e.g., a large tree or from specific geography/species), just because observations from that type are present in the calibration data, if those data are overwhelmed by observations from other types.
The form of the OLS model informs where to focus efforts in quantifying measurement error in these new data. That is, a random error term is included for observations of AGB, so the vital characteristic in the measurement of AGB is trueness, with precision a secondary concern. Whereas for observation of the predictor variables, no error term is present, meaning both characteristics of the measurement are of equal importance.
A caveat to this comment on precision in AGB, given the previous discussion on heteroscedasticity, is that we have not considered in this paper the implication of a heteroscedastic error term to predictions of AGB. It was noted in the methods section that most widely-used pan-tropical models employ a correction factor that includesσ when re-transforming predictions from log-to real-space. However, a heteroscedastic error term means E(σ ) = σ , which presumably biases AGB predictions.

A Note on Causality
Finally, we conclude with a comment on causality. Throughout the paper we have been careful to distinguish between prediction and explanation. In the methods sections we acknowledged that the models constructed here are endogenous: the assumption of strict exogeneity was violated by omitted variable bias.
In the introduction section it was noted the causes of above-ground biomass are lifetime cumulative gross primary production, respiration and loss. Omitting these causal variables has a fundamental implication: it would be spurious to infer from these models that D, H and ρ b cause AGB. That is, if the D of a particular tree has changed over time, the AGB = f (D) model predicts a change in AGB proportional to D 2.580 , but it does not explain it.
This distinction means care must be taken with causal interpretations of allometric-derived AGB predictions. Examples of spurious causal claims might be inter-plot comparisons, where the differences in structural composition between two stands [i.e., (D, H, ρ b ) A − (D, H, ρ b ) B ] is proposed as the explanation for their difference in predicted stand-scale AGB; or intra-plot change detection studies, where growth/death/recruitment between surveys [i.e., (D, H, ρ b ) A ] is proposed as the explanation for change in predicted stand-scale AGB.
The models used in this paper then, are only for the purpose of prediction. For that reason, we are comfortable with the various multivariate forms considered here that might stand accused of being a form of data dredging (Sileshi, 2014). Given that these models have no theoretical grounding, and provided they will only be used for prediction, we see no obvious reason such forms, or even more exotic forms, should not be considered, provided that the precision, trueness and accuracy of their AGB predictions are well-understood.
In conclusion, we constructed various conventional bivariate and multivariate models for predicting above-ground biomass from open access pan-tropical calibration data. We found the residuals of each model were heteroscedastic and non-normally distributed. Stratified k-fold cross-validation found the minimum uncertainty in fold-scale predictions from these models to be 24 %, and that predictions were persistently biased upward by 6 % (∼400 observations per fold). These results are likely symptoms of model misspecification: in particular, that the models do not account for inconsistent measurement error in predictor variables between in-and out-of-sample measurements. Through simulation, we showed how even a conservative degree of inconsistent measurement error can potentially lead to both absolute and relative bias in tree-and stand-scale AGB predictions. We presented the case that whilst including H and ρ b as predictor variables in pan-tropical models alongside D increased prediction precision, their inclusion introduces a bias of unknown size and direction when inconsistent measurement error remain unaccounted for. We suggested several measurement and modeling approaches to formally compensate for this bias whilst retaining the predictive benefits of these variables. Finally, we asked the question of whether pantropical allometric model parameters are independent of tree size. Our analysis indicates that potential model misspecifications and imbalanced calibration data currently prevent finding a definitive answer. This can only be addressed with additional calibration data, specifically from larger trees.