Intra- and inter-rater reliability in log volume estimation based on LiDAR data and shape reconstruction algorithms: a case study on poplar logs

Forkuo, Gabriel Osei; Borz, Stelian Alexandru

doi:10.3389/frsen.2025.1506838

ORIGINAL RESEARCH article

Front. Remote Sens., 12 September 2025

Sec. Lidar Sensing

Volume 6 - 2025 | https://doi.org/10.3389/frsen.2025.1506838

Intra- and inter-rater reliability in log volume estimation based on LiDAR data and shape reconstruction algorithms: a case study on poplar logs

Gabriel Osei Forkuo

Stelian Alexandru Borz*

Department of Forest Engineering, Forest Management Planning and Terrestrial Measurements, Faculty of Silviculture and Forest Engineering, Transilvania University of Brasov, Brașov, Romania

Producing reliable log volume data is an essential feature in an effective wood supply chain, and LiDAR sensing, supported by portable platforms, is a promising technology for volume measurements. Computer-based algorithms like Poisson interpolation and Random Sampling and Consensus (RANSAC) are commonly used to extract volume data from LiDAR point clouds, and comparative studies have tested these algorithms for accuracy. To extract volume data, point clouds require several post-processing steps, while their outcome may depend largely on human input and operator decision. Despite the increasingly number of studies on accuracy limits, no paper has addressed the reliability of these procedures. This raises at least two questions: (i) Would the same person, working with the same data and using the same procedures get the same results? And (ii) How much would the results deviate when different people process the same data using the same procedures? A set of 432 poplar logs placed on the ground and spaced about 1 m apart, was scanned by a professional mobile LiDAR scanner in groups; the first 418 logs were then individually scanned using an iPhone-compatible app, with the remainder being excluded from this part of the study due to field time constraints and all the logs were manually measured to get the reference biometric data. Three researchers with different experiences processed the datasets produced by scanning twice, following a protocol that included shape reconstruction and volume calculation using Poisson interpolation and RANSAC algorithm for cylinders and cones. The intra- and inter-rater reliability were evaluated using a comprehensive array of statistical metrics. The results show that the most reliable estimates correlate with a greater experience. The Cronbach’s alpha metric at the subject level was high, with values of 0.902–0.965 for the most experienced subject, and generally indicated moderate to excellent intra-rater reliabilities. Moreover, working with Poisson interpolation and RANSAC cylinder shape reconstruction, respectively, indicated a moderate to excellent reliability. For the Poisson interpolation algorithm, the Intraclass Correlation Coefficient (ICC) ranged from 0.770 to 0.980 for multi-log datasets, and from 0.924 to 0.972 for single log datasets. For the same type of input datasets, the ICC varied between 0.761 and 0.855 and from 0.839 to 0.908 for the RANSAC cylinder, and from 0.784 to 0.869 and 0.843 to 0.893 for the RANSAC cone shape reconstruction algorithms, respectively. These values indicate a moderate to excellent inter-rater reliability. Similar to Cronbach’s alpha, the Root Mean Square Error (RMSE) was related in magnitude to the ICC. The results of this study indicate that, for improved reliability and efficiency, it is essential to automate point cloud segmentation using advanced machine learning and computer vision algorithms. This approach would eliminate the subjectivity in segmentation decisions and significantly reduce the time required for the process.

1 Introduction

In the wood supply chain, measurement of forest-related features such as standing trees, tree lengths and logs, is essential for practice and science. For practice, accurate estimates of the volume are important for wood transaction on the market (Davis et al., 2001; Gregory et al., 2003; Malinen et al., 2006; Moskalik et al., 2022), while for science, accurate estimates are required to support modelling and comparison, run wider system analyses (Janák, 2012), and generally to provide the data supporting informed decisions for management (Davis et al., 2001; Rauscher, 2005; Lauri et al., 2008).

Log measurement by manual means has long been used in forestry, but it requires important resources (Janák, 2007; 2012; Li et al., 2015; de Miguel-Díez et al., 2022; Niţă and Borz, 2023; Purfürst et al., 2023), and may generate bottlenecks within the supply chain (Berendt et al., 2021; de Miguel-Díez et al., 2021). Along with the introduction of Forestry 4.0 concepts in the wood supply chain (Feng and Audy, 2020; He and Turner, 2021), digitalization of operations and transactions has gained a lot of momentum (Feng and Audy, 2020; Müller et al., 2019), pushing the science and industry in exploring alternative ways of collecting wood biometric data. Various proximal sensing technologies supported by mobile platforms equipped with apps integrating the latest technologies, such as augmented reality and computer vision, were tested to determine how well they perform in obtaining reliable estimates on the main biometrics of the logs (Knyaz and Maksimov, 2014; Jodłowski et al., 2016; Kruglov, 2016; Chatzopoulos et al., 2017; Mehrentsev et al., 2019; Pasztory et al., 2019; Berendt et al., 2021; Panagiotidis and Abdollahnejad, 2021; Borz et al., 2022a; Purfürst et al., 2023). In this regard, several tests were carried out by dedicated studies to compare the log volume estimates produced by digital technologies to those obtained by water immersion (Ljubojević et al., 2011; Hohmann et al., 2017; de Miguel-Díez et al., 2022), or by manual measurement (Niţă and Borz, 2023).

Based on the most recent findings, the use of LiDAR-based platforms in collecting tree- and log-based biometric data seems to be a viable and feasible alternative due to the accuracy provided (Hyyppä et al., 2008; Chen, 2015; White et al., 2016; Beland et al., 2019; Alvites et al., 2022; de Miguel-Díez et al., 2022). In addition, the resources spent to procure the data, such as measurement time (Borz and Proto, 2022), were comparable to those of conventional log measurement methods. Moreover, LiDAR-based log measurement may remove important inconveniences of the manual log measurement methods, which typically relate to the safety and ergonomics of the work, environmental efficiency, labor shortage, skills and knowledge required, density of data sampling, and integration into technologies holding data flow capabilities along the supply chain.

In terms of safety and ergonomics, digital methods were shown to expose less their operators to postural risks (Borz et al., 2022b), which may also contribute to a less physical effort required when measuring; in addition, digital measurement does not require a direct contact with the work object, and therefore, it can prevent work-related accidents. Although the platforms used to collect data, such as smartphones, may require a wide range of resources and more complex manufacturing processes, by their material weight could be less resource-intensive compared to those required for manual measurement, which is typically done by a tape and a forest caliper. Frequently, such platforms are also multi-purpose devices storing the apps for different tasks in one place, hence the multi-function allocation from an environmental impact point of view. An important challenge that forest sector faces today is labor shortage (Proteau, 2008; Tsioras, 2010; 2012; Šporčić et al., 2024), with many choosing to work in other industrial sectors; this creates important problems and threatens the security of wood supply, at least at the operational level which, for log measurement, typically employs people that hold the knowledge and skills required to grade the logs and to record the data needed for volume estimates (Cown, 2005; Thomas and Bennett, 2017; TraitLab, 2024). Then, the estimates produced by manual methods are as good as the density used in sampling the data used to produce them, which is typically low; LiDAR measurement, on the other hand, can represent more accurately the objects in the three-dimensional space (White et al., 2016; Beland et al., 2019; Alvites et al., 2022), accounting for those variations that may make the difference in volume estimates. Last but not least, the advancements in computer vision and deep learning is likely to provide the tools needed for an automatic documentation and transfer of the data along the supply chain, removing the need for manual input of the data into a dedicated system (Gingras and Charette, 2017; Feng and Audy, 2020; Morin et al., 2020; He and Turner, 2021).

To extract meaningful data, currently the LiDAR point clouds require a workflow composed of several post-processing tasks. Most of these still require the human intervention over the point clouds to prepare them for the extraction of useful information. Software packages such as CloudCompare or Open 3D are equipped with such functionalities and are the main choices when one relies on open-source alternatives. In the CloudCompare app (CloudCompare, 2024), for instance, which is commonly used now in the related science, these functionalities include a manual segmentation, noise removal using a filtering tool, normalization using a given model, and shape reconstruction by various algorithms (Schnabel et al., 2007; Girardeau-Montaut, 2015; Girardeau-Montaut, 2016; Kazhdan et al., 2020; Panagiotidis and Abdollahnejad, 2021). As such, there can be a high subjectivity in the estimates, which may primarily come from the way in which the points assigned to logs are segmented out from a cloud containing all the data (i.e., log and the background). Typically, a human processor of such data, accounts for what the eyes can see in the cloud such as shapes and, when available, the color of the points. Assuming that a human processor would have to repeat the processing steps on the same cloud twice or more times, there is a degree of uncertainty in estimates as coming from the same source of processing error, since it is difficult for a human to make exactly the same cloud segmentation decision each time. Then, assuming that the same cloud would be processed by different persons, there is a degree of subjectivity that comes from how different people see and make decisions on what points should be included in processing following the segmentation.

Among the main concepts of digitization is that of a technology being sufficiently reliable for a given task (Wang et al., 2019a; Panagiotidis and Abdollahnejad, 2021) which comes from our understanding of systems design and analysis (Wasson, 2006). Since sourcing and processing data are integrant workflow steps, one would expect to get the same, readily usable estimates at the end of the process. If this would not be possible, then it is important to see how much such estimates may deviate due to the intra- or inter-person subjectivity, which are commonly known as intra- and inter-rater reliability, respectively (Gwet, 2001; 2008; Bonnet, 2023).

This study tried to quantify the reliability of log volume estimates sourced from LiDAR data, by employing the concepts of intra- and inter-rater variability and reliability. LiDAR point clouds are commonly sourced from mobile platforms such as smartphones or mobile professional scanners. This study employed both types of platforms because those from the first category are typically used for individual logs, while those from the second category are more effective for handling groups of logs. Therefore, the latter platforms produce increasingly crowded point clouds as the number of logs increases. Besides, different people can have different experience with the point cloud processing tasks and, as such, it is important to assess if significant deviations arise from this factor. The study aimed to answer the following questions: i) is the inter-person experience with point cloud processing—defined herein as an individual’s prior practical engagement, training, and familiarity with point cloud processing software (specifically CloudCompare) and related data manipulation tasks—a factor that may cause significant deviations in log volume estimates?, ii) are there significant deviations in log volume estimates related to the subjectivity of the same person when segmenting the point clouds?, and iii) which type of point clouds are likely to produce the most stable estimates: single or multiple log point clouds?

2 Materials and methods

2.1 Study site

The point clouds required by this study were collected in the Southern part of Romania, in the forests managed by the Regional State Directorate of Dolj, which is under the authority of National Forest Administration – RNP Romsilva. The Romanian forests are highly diverse (Ioras et al., 2009; Nicolescu, 2022) and distributed on altitudinal layers, starting from mountainous pure Norway spruce stands and ending with plain and meadow poplar forests (Toader and Dumitru, 2005; Stăncioiu et al., 2018). The Southern part of Romania is mostly bordered by the Danube River and poplar forests have an important share in the area. Data was collected by scanning with a ZEB-REVO Portable Scanning System and an iPhone 13 Pro Max (Figure 1), which was done at two yards (located 43°51′00″ N 23°06′53″ E and 43°51′20″ N 23°11′43″ E, respectively) that concentrated logs from several harvesting areas located nearby.

Figure 1

Two images and a flowchart demonstrate a log scanning process using an iPhone. Image (a) shows a person with a vest walking beside logs in a dirt area. Image (b) depicts the person stepping over logs. The flowchart outlines steps: starting scanning, initializing the sensor, maintaining distance, walking a loop, capturing geometry, varying orientation, and returning to start. A legend identifies symbols as Starter/Terminator and Process.

Figure 1. Examples from the field data collection activity showing (a) scanning of log groups by a ZEB-REVO Portable Scanning System (left), (b) individual log scanning by an iPhone 13 Pro Max (right), and (c) a schematic diagram illustrating the typical closed-loop scanning path for the iPhone around an individual log, maintaining an approximate 1m distance and varying sensor orientation.

The characteristics of the poplar logs used in this study revealed a diverse range of biometric attributes. The logs exhibited a length varying from 1.96 to 9.07 m, with an average value of 5.49 ± 1.23 m. At the small end, the diameter ranged from 11.50 to 68.00 cm, yielding an average value of 33.36 ± 8.93 cm. The large end diameter varied from 16.00 to 78.00 cm, averaging 41.43 ± 11.47 cm. Meanwhile, the diameter measured at the midpoint of the log varied from 15.50 to 67.00 cm, averaging 35.59 ± 8.62 cm.

2.2 Data collection and processing

A total of 432 logs were scanned using a ZEB-REVO Portable Scanning System (GeoSLAM Ltd and Ruddington, 2017). This handheld mobile LiDAR scanner, which uses Simultaneous Localization and Mapping (SLAM) technology for geo-referencing, was employed to capture 3D point cloud data of logs arranged in groups. The logs within each group were placed on the ground with an approximate spacing of 1 m between them. A total of 22 groups were considered, containing 14 to 26 logs (about 20 per group, on average). From this set, 418 logs were individually scanned using an iPhone 13 Pro Max equipped with its integrated LiDAR sensor and running the “3D Scanner App” (Laan, 2021a). This setup allowed for the direct generation of 3D point clouds from the LiDAR data. The iPhone 13 Pro Max performed adequately for capturing the geometry of individual logs at close range, providing sufficient point density for subsequent analysis. The remaining 14 logs, while scanned as part of a group with the ZEB-REVO system, were not scanned individually with the iPhone due to time limitations and were therefore excluded from the analyses pertaining to single-log datasets.

The scanning procedures and protocols used for the two mobile LiDAR platforms were comparable to those used in previous studies (Borz and Proto, 2022; Niţă and Borz, 2023). The two mobile LiDAR platforms differ in their sensing distances, and the typical workflows and algorithms used to reconstruct objects from the collected data depend on the reconstruction technology employed. The 3D Scanner App (Laan, 2021a) was pre-installed on the iPhone prior to field data collection. The app’s free version is currently available for download (Laan, 2021a). The 3D Scanner App facilitates real-time LiDAR scans and point cloud computation (Laan, 2021b; Gharge and Ali, 2024). The iPhone scanning process operated in a closed loop, commencing at one end of each log and involving data collection from approximately 1 m by moving forward and backward around the log at low speed (Figure 1, right). This was done while orienting the sensing devices toward the log from various positions and angles, as guided by the app’s interface. For this device and the associated app, the scans were conducted at medium density (MD), allowing the point clouds to effectively represent the logs. After scanning each log, an ID was assigned in the app, and the results were saved to the device’s memory before the operator proceeded to the next log. In contrast, the ZEB-REVO Portable Scanning System was used for scanning groups of logs (14–26 logs per group). The process involved walking around the logs and scanning them from approximately 2 m, starting and finishing at the same point, on a horizontally leveled platform. This device enables scanning at higher distances (up to 30 m), collects substantial amounts of data as spatially referenced point clouds, and is suitable for large-area scans (GeoSLAM Ltd and Ruddington, 2017). After each scanning session and based on external control, the device automatically processed and saved the point cloud data onto a USB stick (Forkuo and Borz, 2023). Both scanning platforms can output data in 3D file formats that are quite similar; specifically, these include files containing point clouds and images or media captured during scanning (Laan, 2021b). However, in this study, the point clouds captured by both platforms were exported in the office in a binary format specifically designed for storing LiDAR data (.LAS) (Laan, 2021b). Specifically, the point clouds from the iPhone equipped with the 3D Scanner App were exported at medium density, with the Z-axis oriented upwards, and in LAS format in meters with color settings.

Data collection, pre-processing and processing workflows are described in Figure 2 for the LiDAR-based data. The method for processing data and estimating log volumes in CloudCompare consists of several key steps, each designed to enhance the quality of the point cloud data and facilitate accurate shape reconstruction. The parameters used in the process were those described in Niţă and Borz (2023). Accordingly, noise was eliminated from the point cloud data to improve overall clarity and accuracy, using a k-Nearest Neighbors (kNN) algorithm specifically configured with a value of 60. This parameter aids in identifying and removing points that do not conform to the overall structure of the data, thereby enhancing the integrity of the subsequent analyses (Dong et al., 2023). Following noise removal, normalization of the point cloud was conducted to ensure consistent orientation and scaling, using a Least-Squares Method (LSM) with a quadric shape (Rusu et al., 2009), in combination with an automatically generated Octree structure and the same kNN value set at 60. Normalization is deemed very important in aligning the data for accurate comparative analysis and model fitting (Lin et al., 2022). Once normalization was completed, shape reconstruction was performed to accurately estimate the geometries of the logs, consisting of several sub-processes, starting with Poisson Surface Reconstruction. In this phase, the Poisson interpolation method was employed, with parameters set as follows: octree depth was set to 12, boundary to 3, samples per node to 6, point weight to 0, and 8 threads were used for parallel processing. This method is robust when reconstructing complex geometries from the point cloud, effectively capturing fine details (Kazhdan et al., 2006; Kazhdan et al., 2020). After the Poisson reconstruction, the RANSAC (Random Sample and Consensus) algorithm was used to fit cylindrical shapes to the point cloud data representing the logs, requiring a minimum of 500 support points for the algorithm to establish a reliable cylinder model, while the remaining RANSAC parameters were automatically adjusted to optimize the fitting process (Fischler and Bolles, 1981). Finally, a similar RANSAC approach was applied to fit conical shapes to the data, maintaining the same minimum support requirement of 500 points while using automatically configured parameters to enhance fitting accuracy. Collectively, these steps enabled the effective processing of LiDAR data in CloudCompare software, resulting in log volume estimates to five digits. The relevant inputs and outputs for data processing are shown in Figure 3. The initial point clouds sourced by both platforms were stored in a Google Drive data repository, along with a Microsoft Excel datasheet to support the collection of metadata for the processing effort.

Figure 2

Flowchart illustrating data processing steps for estimating log volumes. It includes phases: data collection using Zeb Revo and iPhone apps, downloading, preprocessing, segmentation, noise removal, normalization, shape reconstruction using Poisson and RANSAC methods, and volume measurement with corresponding parameters and tools like Cloud Compare.

Figure 2. Description of the workflow used to collect, pre-process, process the data, and to get the log volume estimates.

Figure 3

Two flowcharts of a 3D modeling process. The first chart at the top shows a sequence starting with colored scans labeled

Figure 3. Steps used in data processing by CloudCompare. Legend: (a) – raw point cloud of a ZEB-REVO scan, (b) – raw point cloud of a iPhone scan, (c) – an example of a product after segmentation and noise removal, (d) – an example of a product after normalization, (e) – product after Poisson interpolation, (f) – product after RANSAC cylinder reconstruction, (g) – product after RANSAC cone reconstruction, (h) – an example of volume estimation (cone volume by RANSAC), (i) – volume display and visualization.

Reference data was collected as well, by manual measurements using a forestry tape and a caliper. Log diameters and lengths were measured to the nearest centimeter. Typically, this involved systematic measurements of diameters at a 0.5 m interval along the log, coupled with measurements of the log end and mid diameters and the length of the last segment of the log if it was shorter than 0.5 m. The manual measurements were then recorded on paper sheets. The concept used for manual measurement and volume estimation is fully described in Niţă and Borz (2023), as a part of the Hypercube 4.0 project (Hypercube 4.0, 2021). The reference volume for all logs was calculated based on the detailed manual field measurements (diameters at 0.5 m intervals along the log, end diameters, and total length), using established forestry formulas, primarily the section-wise truncated cone method for the highest accuracy, as detailed in Niţă and Borz (2023).

The Excel database included, among others, the metadata of each log such as the ID, group to which the log belonged when the case, platform used to collect the data, name of the person that processed the data, as well as the log volume estimates produced by the manual measurement. In addition, data fields such as the date of processing, time spent in processing, and volume estimates were included according to the workflows shown in Figures 2, 3. Repository of point clouds included a folder structure to contain all the intermediary processing products starting from the initial point clouds and ending with those used for volume estimation.

Data processing was performed by three persons with different experience (hereafter called subjects) who volunteered for the study. Subject 1 possessed substantial experience with CloudCompare and point cloud processing tasks, defined as over 2 years of consistent use in various research projects and formal training. In contrast, Subjects 2 and 3 were beginners, whose primary exposure consisted of an introductory tutorial and practice exercises provided specifically for this study before commencing data processing. To support the learning of the basic steps, a video tutorial developed in the framework of the Hypercube 4.0 (Hypercube 4.0, 2021) was used as a guiding reference. The subjects were provided with any other information and guidance as they asked for it in advance of the study, and they were allowed to experiment with some data as a training exercise.

2.3 Experimental design and data analysis

Volume data was used for comparison of differences and for checking the intra- and inter-rater reliability. To do so, the reference data coming from manual measurements was codded according to the method used to estimate the volume, and named hereafter RVHuber, RVSmalian, RVCil, and RVCone. These were used only to show the differences that can arise from the choice of measurement and estimation method applied to manual data. For the assessments done to check the intra- and inter-rater variability, the datasets were codded by abbreviations showing the shape reconstruction algorithm used, platform used to source the point clouds, processing attempt (repetition), and a number from 1 to 3 to designate the subject that carried out the processing tasks. Figure 4 shows the system used to code the two datasets: the dataset collected with Zeb Revo scanner and the dataset collected with iPhone.

Figure 4

A table comparing processing algorithms and sourcing platforms for various subjects and attempts. It lists

Figure 4. Description of the system used to code the datasets for intra- and inter-rater reliability analysis. The “Code” column shows the abbreviation for each dataset, where “VP” is Volume by Poisson, “VRCi” is Volume by RANSAC Cylinder, “VRCo” is Volume by RANSAC Cone; “ZR” denotes ZEB-Revo sourced data, “I3D” denotes iPhone 3D Scanner App sourced data; the number “1” or “2” immediately following the platform code indicates the first or second processing attempt (repetition) by the subject; “S1,” “S2,” or “S3” in the “Subject” part of the code refers to Subject 1 (experienced), Subject 2 (beginner), or Subject 3 (beginner), respectively. The left panel shows codes for datasets sourced by the ZEB Revo scanner, and the right panel for those sourced by the iPhone.

Subject 1 (coded as S1) was the individual with more experience with the use of Cloud Compare and LiDAR data, while subjects 2 and 3 were at their first experience with such tasks. Each subject worked twice with each dataset. In other words, the data was processed from end-to-end by each subject two times. This allowed to measure the intra-rater reliability (commonly known as internal consistency), which was measured using Cronbach’s alpha (Johnson, 2021; Revicki, 2023). This metric describes to what extent the measurements remain consistent over repeated trials ran under identical conditions, and the data is said to be reliable if the experiment yields consistent results on the same measure (Ferketich, 1990; Revicki, 2023). In this study, “identical conditions” refers primarily to the cloud segmentation step, as the input point cloud data was identical in each trail by the same subject. The parameters for subsequent steps, such as denoising and shape reconstruction, were also kept consistent as per the defined protocol. However, by different decisions on point selection, the outcomes of segmentation may affect the outcomes of the subsequent processing steps, therefore the volume estimates. Accordingly, if the decisions made on the points to include by segmentation led to identical point clouds for the following processing steps, then it is likely that the volume estimates would be identical and, as such, the results considered as being consistent between the trials. This part of the experiment led to the comparison of 18 datasets, coming from the platform used to collect the data (2), shape reconstruction algorithm used to process it (3) and the number of subjects (3).

The inter-rater reliability was measured using the inter-class correlation coefficient (ICC), which was adapted to the type of experiment conducted, as there are several classes and types of experiments (Bobak et al., 2018; Ten Hove et al., 2022). ICC assesses the reliability of ratings by comparing the variability of different ratings of the same observation against the variation of all ratings and observations. A class 2 experiment was setup in which a number of raters (subjects herein) are selected at random from a population of raters, and the selected raters rate all the observations (Koo and Li, 2016; Shieh, 2016). This part of the experiment used the data coming from same shape reconstruction procedure and platform used to collect it, but compared between each two possible subjects. This resulted in a number of 72 comparisons.

Analysis of the data was carried out in Microsoft Excel equipped with the Real Statistics (Zaiontz, 2023) add-in. Real Statistics is a tool that extends the capabilities of Microsoft Excel in statistical analysis and which provides dedicated functionalities for reliability assessments such as Cronbach’s alpha and ICC. The standard functionalities as included in Real Statistics were used in this study for analysis. Before running the analyses, the data on volume estimates was paired so as to remove those pairs which i) were commented by a given subject as unreliable in estimates, ii) had at least one data point missing in the pair due to various reasons and iii) had data that was showing contrasting differences which were less likely to occur solely due to a bad decision in segmenting the point clouds. A numerical description of the size of initial and compared datasets is provided in Supplementary Material (Supplementary Table S1).

The resulting datasets were also subjected to an advanced quantitative assessment of the differences. Error metrics such as the bias (BIAS), mean absolute error (MAE) and root mean squared error (RMSE) were computed for each dataset to characterize the deviation in volume estimates. These were used in conjunction with Cronbach’s alpha and ICC to characterize the reliability of estimates.

Bias (BIAS) is calculated as the average error, representing the difference between predicted (measured) and actual (reference) values (Willmott and Matsuura, 2005). It indicates whether the predictions are systematically higher or lower than the observed values. By definition, a positive bias value suggests underestimation, while a negative bias indicates overestimation. Bias is frequently used to assess also the accuracy of predictive models, particularly in fields such as forecasting and regression analysis (Hyndman and Koehler, 2006). Mean Absolute Error (MAE) is computed as the average of the absolute differences between actual and predicted values, providing a straightforward assessment of prediction accuracy by averaging the absolute errors (Willmott and Matsuura, 2005; Hyndman and Koehler, 2006). MAE is preferred in many analyses due to its reduced sensitivity to outliers, especially when compared to RMSE (Hyndman and Koehler, 2006). It is also commonly used in regression analysis and model evaluation to measure the average magnitude of prediction errors (Hyndman and Koehler, 2006). Root Mean Squared Error (RMSE) is calculated as the square root of the average of the squared differences between actual and predicted values (Chai and Draxler, 2014; Hodson, 2022). By squaring the differences, RMSE gives greater weight to larger errors (differences), thereby making it more sensitive to outliers compared to MAE (Armstrong, 2001; Chai and Draxler, 2014). RMSE offers a measure of the standard deviation of prediction errors and is also widely used in regression analysis, forecasting, and model validation to assess prediction accuracy (Li, 2017).

Cronbach’s alpha is derived by correlating the score for each scale item with the total score for each observation and comparing it with the variance of all individual item scores (Cronbach, 1951). This statistic measures the internal consistency or reliability of a set of items, where higher values indicate greater reliability (Tavakol and Dennick, 2011). Cronbach’s alpha is widely used in survey research and psychometrics to assess the reliability of scales and questionnaires (Tavakol and Dennick, 2011). It is particularly valuable in the process of designing and testing new survey or assessment instruments (Frost, 2022), and it is also well fitted to quantitative continuous data. The intraclass correlation coefficient (ICC) is calculated using variance components obtained from an analysis of variance (ANOVA) (Koo and Li, 2016). It measures the ratio of variance between groups to the total variance, thereby assessing the reliability of measurements or ratings for groups or clusters (Shrout and Fleiss, 1979). Higher ICC values indicate greater reliability and are commonly used in reliability studies, particularly in the context of repeated measurements or ratings by different observers (McGraw and Wong, 1996).

The complete statistical workflow included a test to check the normality of data, estimation of intra- and inter-rater reliability, and the estimation of BIAS, MAE and RMSE. Signed differences, absolute differences and squared differences between the observations of a given dataset used for comparison were included in the test for normality in data, and the results were presented in an aggregated form by considering i) the differences that occurred over all logs measured in the field by the manual method by taking one of the volume estimation method as a reference, ii) the intra-rater reliability, characterized qualitatively by bivariate plots showing the magnitude of differences, iii) the intra-rater reliability characterized, quantitatively by the Cronbach’s alpha, BIAS, MAE and RMSE, and iv) the inter-rater reliability, characterized quantitatively by the ICC, BIAS, MAE and RMSE.

3 Results and discussion

3.1 Agreement of log volume estimates based on manual measurement

The results of normality check indicated that none of the variables could be assumed to come from a normal distribution (data not shown herein). The data reported in Figure 5 indicate several degrees of agreement between the log volume estimates based on manually measured data. The methods that used finer sampling of the diameters produced more concordant results as shown by the trend equations included in the figure, a result that is consistent with previous findings (e.g., Niţă and Borz, 2023).

Figure 5

Scatter plot comparing RVCone (cubic meters) with three other measurements: RVHuber, RVSmalian, and RVCil, each represented by different markers (yellow diamonds for RVHuber, red triangles for RVSmalian, and green circles for RVCil). Lines of best fit with corresponding equations and R-squared values are shown in the legend. The grid is marked in increments of 0.1 along both axes.

Figure 5. Agreement of log volume estimates based on the manual measurement methods. Legend: RVHuber - volume computed by Huber’s formula, RVSmalian - volume computed by Smalian’s formula, RVCil - volume computed by 0.5 m cylinders, RVCone - volume computed by 0.5 m truncated cones.

The analysis of different volume estimation methods highlights their tendencies to either overestimate or underestimate the log volumes. In this study, Smalian’s formula tends to overestimate volume when compared to the truncated cone (RVCone) method, as it is evident from the scatter plot showing that the trend line for RVSmalian is positioned above the diagonal. The overestimation arises because Smalian’s formula calculates volume based on the average of the cross-sectional areas at the ends of the log, which can be particularly inaccurate for tapering logs (de León and Uranga-Valencia, 2013; Li et al., 2015; Ahmad et al., 2020). In contrast, Huber’s formula tends to underestimate volume compared to the reference cone method used in this study, as the scatter plot indicates that the trend line for RVHuber appears slightly below the diagonal, expressed by the trend equation. This underestimation is a result of Huber’s calculation of volume based on the cross-sectional area at the midpoint of the log, which can lead to inaccuracies in cases of significant tapering or irregularities along the log (de León and Uranga-Valencia, 2013; Li et al., 2015; Ahmad et al., 2020). However, the reference cylinder method (RVCil), which calculated volume based on manually measured diameters at 0.5 m intervals, yielded estimates that showed extremely high agreement (R² = 1.000) with the reference volumes derived from the section-wise truncated cone formula using the same detailed 0.5 m interval measurements. This high concordance, effectively a near 1:1 relationship, is due to the dense and comparable diameter sampling used in both the RVCil calculation and the most detailed reference method (Niţă and Borz, 2023). This method’s success is due to the diameter sampling procedure used, which was denser compared to that applied for Smalian’s and Huber’s formulae, an approach that better fitted the taper of the logs and had less residuals compared to the estimation of the logs’ segments volumes by truncated cones since the reference diameters used were the same.

Overall, these findings align with existing literature (e.g., Niţă and Borz, 2023), emphasizing the need to acknowledge inherent biases in various volume estimation methods based on their underlying assumptions and calculations, which is essential for selecting the appropriate method for accurate volume estimation in forestry and timber management.

3.2 Intra-rater reliability of log volume estimates

Figure 6 shows the bivariate plots of comparisons ran between the trials (repetitions) of the same subject processing the same dataset. The logic of this qualitative assessment is that, if there would be no differences between the estimates of a given dataset coming from separate trials, then all data points would distribute over the 1:1 identity line, which was not the case for any of the analyzed datasets. As a fact, there were important differences which seemed to depend on the level of experience with point cloud processing, type of input data (single log or group of logs) and, most importantly, the method used for shape reconstruction.

Figure 6

Eighteen scatter plots arranged in a three-by-six grid, comparing different volume measurements in cubic meters for three subjects. Each row represents a different subject, labeled “Subject1” in green, “Subject2” in yellow, and “Subject3” in red. Each plot shows varying degrees of correlation between paired variables, with data points marked as circles in colors matching their subject label.

Figure 6. Intra-rater reliability described by bivariate plots. Legend: the color-coded bivariate plots represent different subjects.

Shape reconstruction using Poisson interpolation provided the highest intra-rater reliability, irrespective of the subject (Figure 6); however, it showed different degrees of reliability when the type of dataset was considered for the same subject. For instance, working with point clouds of single logs was more reliable compared to that of working with groups of logs, even if the Zeb Revo point clouds are often described as being accurate (Bauwens et al., 2016; Dewez et al., 2017; Sammartano and Spanò, 2018; Warchoł et al., 2023), hence there is a difference between accuracy and reliability. Then, intra-rater reliability depends on experience, as it can be seen when looking at the inter-subject plots of the same compared datasets. These results may have several explanations. For instance, Poisson surface reconstruction is a method used for shape reconstruction that is effective at reconstructing smooth surfaces from point clouds (Kazhdan et al., 2006; 2020); it can fill in gaps in the data to create a complete 3D model. One of the main advantages of Poisson interpolation is its resilience to data noise, as it considers all points at once without resorting to heuristic spatial partitioning or blending (Kazhdan et al., 2006; Kazhdan et al., 2020). However, Poisson interpolation may introduce artifacts if the point cloud is noisy or has outliers (Kazhdan et al., 2006; 2020). Additionally, it requires tuning parameters for the best results, which can be time-consuming and may require expertise (Kazhdan et al., 2006; Kazhdan et al., 2020).

RANSAC (Random Sample Consensus) cylinder reconstruction is another method used for shape reconstruction (Fischler and Bolles, 1981; Raguram et al., 2008; Cavalli et al., 2023); it is robust against outliers in the data, making it effective for modeling objects that are cylindrical in shape (Raguram et al., 2008; Niebles and Krishna, 2017). RANSAC cylinder reconstruction is relatively easy to implement and can handle a moderate percentage of outliers without significant computational cost (Bolles and Fischler, 1981; Niebles and Krishna, 2017). However, this method may not perform well with non-cylindrical shapes or if the cylinder axis is not well-defined by the data points (Niebles and Krishna, 2017). Additionally, while efficient, the method can become computationally expensive if the share of outliers is high (Niebles and Krishna, 2017). RANSAC cone reconstruction is suitable for objects with a conical shape and can handle noisy data points (Schnabel et al., 2007). It can be adapted to different shapes and sizes of cones, making it versatile for various applications (Schnabel et al., 2007; Niebles and Krishna, 2017). However, similar to RANSAC cylinder reconstruction, it may struggle with shapes that do not conform to an ideal cone (Niebles and Krishna, 2017). The accuracy of the reconstruction can also be sensitive to the parameters chosen for the RANSAC algorithm (Niebles and Krishna, 2017). Looking at the data outputted by these two shape reconstruction methods (Figure 6), it is evident that they had different degrees of disagreement. In addition, magnitude of disagreement seemed to increase as a function of log size, particularly for those logs that had an estimated volume which was higher than 0.5 m³.

The relevant quantitative measures of the intra-rater reliability are provided in Figure 7. As measured by Cronbach’s alpha at the subject level, the highest intra-rater reliability was that when working with data inputs on single logs and reconstructing the log shapes by Poisson interpolation. This was consistent among the subjects, although Cronbach’s alpha had different magnitudes among the subjects and assessments. At the dataset level, single log data and reconstruction by Poisson interpolation returned the best result for the first subject, with a maximum Cronbach’s alpha of 0.998. Next in line was the multi-log dataset of which the volume estimates were produced by Poisson reconstruction by the first subject, which returned a Cronbach’s alpha of 0.987, and the third and fourth best inter-rater reliabilities were found for the second and third subject based on single log data and Poisson reconstruction.

Figure 7

Table presenting datasets for three subjects with calculated metrics: BIAS, MAE, RMSE, and Cronbach's alpha values. Datasets are color-coded based on consistency levels: highest (green), lowest (red), and medium (yellow) for Cronbach's alpha: 0.965, 0.902, and 0.926, respectively. Each subject’s results are compared across reference and compared datasets.

Figure 7. Intra-rater reliability measured by difference metrics and Cronbach’s alpha. Legend: reference and compared datasets are described in Figure 4, BIAS – bias, MAE – mean absolute error, RMSE – root mean squared error. Note: in green, yellow, orange and red, are provided the first, second, third and fourth-best results of all comparisons for the Cronbach’s alpha. The order of reliability is given from green to red for the compared datasets in terms of Cronbach’s alpha; for example green, yellow and red highlighting indicate the first-, second- and third-ranked intra-rater reliabilities for each subject, respectively. The rightmost column classifies the intra-rater reliability at subject level by the mean values of Cronbach’s alpha.

In general, the internal consistencies measured as the average values of Cronbach’s alpha at the subject level were high, with values of 0.902–0.965. However, the best intra-rater reliability was found for the first (experienced) subject, being followed by the third and the second subjects. The significance of these values can be understood by referring to the classifications of Cronbach’s alpha values. Cronbach’s alpha is a measure of internal consistency, which indicates how closely related a set of items are as a group (Cortina, 1993; Johnson, 2021). It is considered a measure of scale reliability (Tavakol and Dennick, 2011; Johnson, 2021). According to general guidelines, a Cronbach’s alpha value above 0.7 is considered acceptable, values above 0.8 are considered good, and values above 0.9 are considered excellent (Tavakol and Dennick, 2011; Johnson, 2021). In this study, the values between 0.902 and 0.965 indicate an excellent internal consistency, suggesting that the items used in the measurement are highly correlated and reliably measure the same construct. High values of Cronbach’s alpha imply that the measurements are consistent across different items, and the responses are reliable (Streiner, 2003; Tavakol and Dennick, 2011; Johnson, 2021). This level of internal consistency is important in ensuring the validity and reliability of the measurements, especially in research settings where accurate and consistent data are essential (Cortina, 1993; Taber, 2018; Johnson, 2021; Frost, 2022). In the context of this study, the high values of Cronbach’s alpha indicate that the log measurements taken by the subjects were generally consistent and reliable. The fact that the first (experienced) subject had the highest intra-rater reliability suggests that experience may play an important role in achieving more consistent and reliable estimates. This finding aligns with previous research that highlights the importance of experience and training in improving measurement accuracy and reliability (Hobbs-Murphy et al., 2024).

Also, the differences measured with BIAS, MAE and RMSE metrics were generally consistent with the assessments done by Cronbach’s alpha. For instance, the RMSE kept the same order of magnitude depending on the value of Cronbach’s alpha, with lower magnitudes in RMSE corresponding to higher values of Cronbach’s alpha, which was only partly true for MAE, and inconsistent for the BIAS, since this metric is signed and shows the direction of under- or over-estimation. The consistency among the outputs of BIAS, MAE, RMSE, and Cronbach’s alpha can be understood by examining how each metric interacts with the data. When the internal consistency is high, as indicated by Cronbach’s alpha, the errors in the measurements are likely to be smaller and the estimates more consistent (Goforth, 2024). This is reflected in lower values of RMSE and MAE. RMSE, in particular, is sensitive to larger differences, and its lower values correspond to higher values of Cronbach’s alpha, indicating that the measurements are more accurate, reliable and consistent (Singh, 2022; Goforth, 2024).

3.3 Inter-rater reliability of log volume estimates

Inter-rater reliability was measured by the intra-class correlation (ICC) and was complemented with the confidence intervals and the difference metrics, as shown in Figures 8–10. Figure 8 shows the results grouped by the Poisson interpolation algorithm, while Figures 9, 10 show the results based on RANSAC interpolation as cylinders and cones. Irrespective of the dataset, on average, the lowest ICC was consistently found for the multi-log datasets (Figures 8–10).

Figure 8

A table comparing datasets with columns for bias, mean absolute error (MAE), root mean square error (RMSE), and intraclass correlation coefficient (ICC). Highlighted rows show the lowest inter-rater reliability with an average ICC of 0.869 in red and the highest in green with 0.955.

Figure 8. Inter-rater reliability for the Poisson shape reconstruction algorithm, measured by difference metrics and intra-class correlation (ICC). Legend: reference and compared datasets are described in Figure 4, BIAS – bias, MAE – mean absolute error, RMSE – root mean squared error. ConfLower and ConfUpper stand for the lower and upper confidence thresholds of the ICC. Note: in green, yellow, orange and red, are provided the first, second, third and fourth best results of all comparisons for the ICC. The order of reliability is given from green to red for the compared datasets in terms of ICC; for examplegreen, yellow and red highlighting indicate the first-, second- and third-ranked inter-rater reliabilities for each dataset. The rightmost column classifies the inter-rater reliability at dataset level by the mean values of ICC.

Figure 9

A table comparing datasets with columns for bias, MAE, RMSE, ICC, and confidence intervals. Highlighted cells indicate important differences. Side notes show lowest inter-rater reliability with average ICC of 0.810, in pink, and highest inter-rater reliability with average ICC of 0.873, in green.

Figure 9. Inter-rater reliability for the RANSAC cylinder shape reconstruction algorithm, measured by difference metrics and intra-class correlation (ICC). Legend: reference and compared datasets are described in Figure 4, BIAS – bias, MAE – mean absolute error, RMSE – root mean squared error. ConfLower and ConfUpper stand for the lower and upper confidence thresholds of the ICC. Note: in green, yellow, orange and red, are provided the first, second, third and fourth best results of all comparisons for the ICC. The order of reliability is given from green to red for the compared datasets in terms of ICC, such as the green, yellow and red were the first, second and third ranked inter-rater reliabilities for each dataset. The rightmost column classifies the inter-rater reliability at dataset level by the mean values of ICC.

Figure 10

A table comparing reference and compared datasets with columns for BIAS, MAE, RMSE, ICC, and confidence intervals. Colored cells indicate different reliability levels, with a note highlighting the lowest inter-rater reliability with an average ICC of 0.822 and the highest with an average ICC of 0.871.

Figure 10. Inter-rater reliability for the RANSAC cone shape reconstruction algorithm, measured by difference metrics and intra-class correlation (ICC). Legend: reference and compared datasets are described in Figure 4, BIAS – bias, MAE – mean absolute error, RMSE – root mean squared error. ConfLower and ConfUpper stand for the lower and upper confidence thresholds of the ICC. Note: in green, yellow, orange and red, are provided the first, second, third and fourth best results of all comparisons for the ICC. The order of reliability is given from green to red for the compared datasets in terms of ICC, such as the green, yellow and red were the first, second and third ranked inter-rater reliabilities for each dataset. The rightmost column classifies the inter-rater reliability at dataset level by the mean values of ICC.

However, there was an order of magnitude in differences between the shape reconstruction algorithms used, with Poisson interpolation being ranked the first. For the Poisson interpolation algorithm, the ICC varied between 0.770 and 0.980 when working with multi-log datasets, and from 0.924 to 0.972 when working with single log datasets. For the same types of input datasets, it varied between 0.761 and 0.855 and from 0.839 to 0.908 for RANSAC cylinder, and from 0.784 to 0.869 and 0.843 to 0.893 for RANSAC cone shape reconstruction algorithms, respectively, indicating moderate to excellent inter-rater reliabilities. Similar to Cronbach’s alpha, RMSE was related in magnitude with ICC, with low values of RMSE being associated to high values in ICC.

The ICC plays an important role in delineating the reliability of measurements, with values indicating substantial reliability across various analyses (Chinn, 1991; Gisev et al., 2013; Koo and Li, 2016; Bruton et al., 2000). However, as with other reliability coefficients, there is no universally accepted level of reliability for the ICC (Bruton et al., 2000). The ICC ranges from 0 to 1, with values closer to one indicating a higher reliability. Chinn (1991) suggests that any measure should have an ICC of at least 0.6 to be considered useful, thus reinforcing the relevance of our findings. The ICC is particularly valuable when comparing the repeatability of measurements across different units, as it is a dimensionless statistic (Bruton et al., 2000). It is most effective when three or more sets of observations are collected, either from a single sample or from independent samples, according to Bruton et al. (2000). For instance, an average ICC of 0.900 was obtained from all the comparisons using the 3D Scanner App, whereas the average value of comparisons over the Zeb Revo data was much lower (0.833), indicating different degrees of reliability, and placing the comparisons on 3D Scanner App data in the category of excellent reliability (Bruton et al., 2000; Gisev et al., 2013). However, it is important to acknowledge certain limitations outlined by Rankin and Stokes (1998) that can render the ICC unsuitable for use in isolation. Specifically, when the ICC is applied to data from a diverse group of individuals exhibiting a wide range of the measured characteristic, the reliability may appear higher than when it is applied to a group with a narrow range of the same characteristic (Bruton et al., 2000). This underscores the need for caution when interpreting ICC values, particularly in contexts involving heterogeneous populations.

To answer the first question of the study, the inter-person experience is an important factor affecting the reliability of volume estimates. This is supported by both, the internal consistency of the assessments as well as by the intra-class correlation assessments, by which the second and third subjects, which were less experienced, performed poorer. Indeed, the differences in reliability were co-factored by the type of datasets used for segmentation, but still, there are important trends in data supporting the effect of experience on reliability. The question on whether the differences between the assessments are important and the assessments are still ranked as highly reliable is worth pursuing further. For instance, the experienced subject working on single log datasets achieved an internal consistency close to 1, meaning excellent reliability. However, the bias was positive, accounting for 0.010 m³, mean absolute error for 0.016 m³ and the root mean squared error for 0.027 m³, figures which some may interpret as big differences between the assessments.

Based on both the internal consistency and deviation metrics, the degree of deviation due to a person’s subjectivity may vary significantly, which answers to the second question of the study. Experienced persons are likely to come to more accurate, consistent and reliable estimates, which is less likely for unexperienced persons, therefore experience may be a co-factor in subjectivity when segmenting point clouds. In addition, the degree of reliability is also affected by the type of input dataset in question, with single log datasets generally improving the reliability of estimates.

The answer to the third question seems to be the most evident based on both types of reliability assessments. For the same person working on the same type of input dataset and with the same shape reconstruction method, the results indicate that working with single log input datasets produce the most reliable estimates. This is further supported by the results of inter-person assessment, consistently showing better outcomes when working with single log datasets. This may come from the fact that a given person can work more focused when the point clouds are not crowded and, as such, the geometry of the points is easier to understand by the human brain. Form our experience, working with point clouds of single logs can also prevent error in segmentation, and rendering the point clouds in natural colors may help deciding more accurately the boundaries of segmentation, which is a feature that can assist less experienced subjects.

Looking again at the data, it seems now that a new question may come up in relation to intra-rater reliability. By the Poisson interpolation, it seems that reliability of the estimates was consistently higher, irrespective of the type of dataset used and experience level of the subject. This should be explored in the future to check the degree at which the agreement and reliability of estimates may be related to the shape reconstruction method used since this study considered only the segmentation step of the process as being a subjective one.

Ideally, reliability figures should be built based on very large datasets containing both a high variability in log biometrics as well as in species and in the environment used for measurements, which could not be accommodated by this study. In addition, having much more subjects to make the assessments would contribute to a better understanding of the reliability of estimates, as well as to more representative figures under the statistical point of view. It is difficult, however, to run larger experiments, mainly due to the resources needed to accommodate them. As an example, the speed of processing depends largely on the architecture and power of the computers used to run the data processing steps. For instance, the experienced subject processed the data for both data processing attempts from 12 November 2023 to 9 February 2024, where the segmentation step took in between 2 and 3 min for one observation, meaning that for the data samples used in the study, this processing step alone took close to 60 h for one person. In addition, segmentation alone was the most time-consuming processing step (data not shown herein). Furthermore, while the data collection time was found to be comparable to conventional methods (Borz and Proto, 2022), future studies could explore alternative observation strategies to further enhance efficiency. For example, optimizing the scanning path or employing automated mobile platforms could significantly reduce the time required for data acquisition in the field, complementing the gains made by automating post-processing tasks.

Based on the findings of this study regarding reliability and the resources required to perform specific steps, there is a potential to improve both the efficiency and reliability. One suggested approach is to transfer the decision-making process for cloud segmentation to the computer. This would be the primary strategy to minimize the accuracy gap between novice and experienced operators, as it would effectively remove the human subjectivity identified as a key factor in our results (Hobbs-Murphy et al., 2024). By automating segmentation, the reliability of volume estimates would no longer depend on an operator’s experience level. As an intermediate step, the development of semi-automated tools and standardized, guided workflows within processing software could also help reduce inconsistencies. However, the most robust solution lies in leveraging advanced machine learning and computer vision algorithms (Guo et al., 2020; Sarker et al., 2024). Most likely, automated decisions will rely on the architecture of the point clouds and probably will use the color information contained in them. To do so, effective machine learning and computer vision algorithms will be required (e.g., Guo et al., 2020; GitHub, Inc, 2024; Sarker et al., 2024) since, for similar applications, several machine learning and computer vision algorithms have been proposed to perform point cloud data segmentation and shape reconstruction (Guo et al., 2020; Wang et al., 2020; Fang et al., 2023; Sarker et al., 2024). For example, PointNet (Qi et al., 2017) and DGCNN (Wang et al., 2019b) are state-of-the-art deep learning models for point cloud segmentation (GitHub, Inc, 2024). PointNet, for instance, takes a point cloud as the input and processes it directly with MLPs (multi-layer perceptron) layers, followed by a max-pooling layer for implicit global feature extraction (Qi et al., 2017), while DGCNN uses dynamic graph convolution to capture local geometric structures (Wang et al., 2019a). Another example is the 3D Convolutional Neural Network (3D CNN), that extracts features from the point cloud and then classifies each point on the basis of those features (Lee et al., 2022; Wang et al., 2018). More recently, advanced models that improve the results for point cloud segmentation by using mask-based learning approaches include MaskNet (Sarode et al., 2020) and MaskNet++ (Zhou et al., 2022). Besides, PCN (Point Completion Network) (Yuan et al., 2018) is a novel learning-based method for shape completion. Unlike previous shape reconstruction algorithms that usually require structural assumptions (such as symmetry) or annotations (such as semantic classes) regarding the underlying shape, PCN directly processes raw point clouds with no prior knowledge of the underlying shapes (Yuan et al., 2018). This reinforces that the outlook for AI techniques in the field of point cloud data processing is promising, with advancements in these deep learning models paving the way for more accurate and efficient point cloud segmentation and shape reconstruction. This class of computational approaches would likely solve the problem of accurate, consistent and reliable log segmentation from background data of point clouds.

4 Conclusion

Intra- and inter-rater reliability of log volume estimates based on LiDAR point clouds depends on experience of the person running the processing steps, in particular point cloud segmentation. Although the intra- and inter-rater figures indicate a moderate to excellent reliability, one must consider also other metrics such as those characterizing the deviation of the results. In other words, the magnitude of reliability, as indicated by figures such as the bias, mean absolute error and mean squared error, needs to be defined and categorized to serve the end users of the data. The potential to arrive at the same estimates lies in the way in which some of the point cloud processing steps are taken, particularly in the decision on how to segment the data characterizing the logs from the surrounding environment. Automation of these data processing steps may lead to a higher effectiveness in both reliability and resources spent.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

Written informed consent was not obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article because the person identifiable in photo is an author of this study.

Author contributions

GF: Data curation, Formal Analysis, Investigation, Resources, Software, Visualization, Writing – original draft, Writing – review and editing. SB: Conceptualization, Data curation, Formal Analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work received funding from Romanian Ministry of Education and Research, CNCS – UEFISCDI, project number PN-IV-P8-8.1-PRE-HE-ORG-2023-0141, and project number PN-IV-P8-8.1-PRE-HE-ORG-2024-0186, within PNCDI IV.

Acknowledgments

The Authors acknowledge the support of the Department of Forest Engineering, Forest Management Planning and Terrestrial Measurements for supporting this study. Data collection and part of data processing tasks were supported by the Hypercube 4.0 Project - Moving Wood Measurement Towards a new Dimension (https://sites.google.com/view/hypercube40). The Authors would like to thank to Subject 2 and Subject 3 for their voluntary involvement in the study.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frsen.2025.1506838/full#supplementary-material

References

Ahmad, S. S. S., Mushar, S. H. M., Shari, N. H. Z., and Kasmin, F. (2020). A Comparative study of log volume estimation by using statistical method. Educ. - J. Sci. Math. Technol. 7 (1), 22–28. doi:10.37134/ejsmt.vol7.1.3.2020