A New Pipeline to Automatically Segment and Semi-Automatically Measure Bone Length on 3D Models Obtained by Computed Tomography

The characterization of developmental phenotypes often relies on the accurate linear measurement of structures that are small and require laborious preparation. This is tedious and prone to errors, especially when repeated for the multiple replicates that are required for statistical analysis, or when multiple distinct structures have to be analyzed. To address this issue, we have developed a pipeline for characterization of long-bone length using X-ray microtomography (XMT) scans. The pipeline involves semi-automated algorithms for automatic thresholding and fast interactive isolation and 3D-model generation of the main limb bones, using either the open-source ImageJ plugin BoneJ or the commercial Mimics Innovation Suite package. The tests showed the appropriate combination of scanning conditions and analysis parameters yields fast and comparable length results, highly correlated with the measurements obtained via ex vivo skeletal preparations. Moreover, since XMT is not destructive, the samples can be used afterward for histology or other applications. Our new pipelines will help developmental biologists and evolutionary researchers to achieve fast, reproducible and non-destructive length measurement of bone samples from multiple animal species.


INTRODUCTION
Skeletal measurement is the pillar of many research applications, such as developmental studies on limb patterning (Summerbell, 1977;Galloway et al., 2009) and growth (Rosello-Diez et al., 2017;Marchini and Rolian, 2018), main axis segmentation (Casaca et al., 2014;Wong et al., 2015), evolutionary studies (Sears et al., 2006;Sheth et al., 2012;Kherdjemil et al., 2016), disease modeling Li et al., 1999;Rowe et al., 2018), adult phenotyping of mutant models (Boskey et al., 2003), etc. Whereas clinical musculoskeletal research often uses non-destructive imaging as routine (Cheng and Wang, 2018), fundamental evolutionary and development (evo-devo) studies often rely on differential staining of bone and cartilage (the so-called ex vivo skeletal preparations) (Rigueur and Lyons, 2014;Mead, 2020) and subsequent two-dimensional (2D) imaging for quantitative comparisons of the models of interest. Despite being broadly used, the skeletal preparation technique is ridden by several disadvantages. First, it is a destructive technique in the sense that the samples cannot be used for further histological or molecular applications. Second, it involves lengthy staining and clearing of cadavers, followed by laborious and damage-prone dissection of the skeletal elements of interest, in their preparation for imaging. Third, accurate measurements depend heavily on the imaged sample being positioned as flat as possible; otherwise, the apparent length will be shorter than the real one due to parallax error. As a result, measurements are often prone to user error and require multiple measurements to calculate standard error. These limitations prompted us to seek alternative methods to measure bone length in a fast and reliable way, without destroying the sample.
X-ray microtomography (XMT) is a non-destructive imaging modality that uses radiographic projections taken at multiple angles to reconstruct two-dimensional tomograms (literally, slice images) whose pixel values represent the X-ray attenuation coefficient at each point in the imaged object (Elliott and Dover, 1982). It is common to arrange the tomograms into a stack to produce a three-dimensional (3D) image (Elliott and Dover, 1982;Christiansen, 2016;du Plessis et al., 2017). We reasoned that since XMT can be used to image undissected samples, it would allow us to scan multiple samples relatively fast, with the advantage of preserving their integrity in case they are needed for further processing. Moreover, computer-based image processing would in principle allow us to maximize the automation of the subsequent 3D reconstruction and measurements. Methods based on manual landmarking and measurement of the 3D models have already been developed, but we wanted to eliminate the human interaction component as much as possible. Our main goal, in summary, was to develop a pipeline to scan multiple whole-animal samples in a batch, and bulk-process the scans to extract linear measurements of the bones of interest. Minimal user intervention was the most important requirement, both to enable its use as a workhorse method in skeletal development labs, but also to eliminate any potential unconscious bias in the process. Within this general goal, we established three objectives: (1) to identify standard conditions (i.e., combination of scan resolution and analysis parameters) that yield low inter-batch variability; (2) to obtain a versatile pipeline that could be applied with minimal variation to a range of developmental stages; (3) to achieve enough precision to detect even small phenotypes, such as the 5-10% bone-length differences we have previously described with some of our models (Rosello-Diez et al., 2017. In XMT, the ability to independently analyze distinct tissues relies on their accurate separation through so-called segmentation (Bouxsein et al., 2010;Weissheimer et al., 2012). Since bone has a high mean atomic number and linear X-ray attenuation coefficient compared to other body tissues, it generates strong contrast in X-ray imaging modalities such as XMT and can be readily segmented through thresholdbased methods where grayscale values determine what is bone tissue and what is background (Campbell and Sophocleous, 2014). There are several modalities of segmentation. Manual segmentation involves the manual selection of the areas of interest section by section, and is therefore quite laborious and subjective, thus prone to user error (Rathnayaka et al., 2011). Semi-automated methods, on the other hand, use algorithms like edge detection (Rathnayaka et al., 2011) and/or local differences in gray values (Zhang et al., 2010) with some user input for initial parameters. Another common method is automated segmentation, whereby image-processing algorithms are used to segment elements of interest with minimal to nouser interaction (Šajn et al., 2007;Okada et al., 2008;Heidrich et al., 2013;Yiannakas et al., 2016). Algorithm-based automatic segmentation, however, requires the user to have programming knowledge and a thorough understanding of mathematical algorithms related to the image processing software being used (Rathnayaka et al., 2011). Deep learning-based segmentation methods may out-perform simple thresholding, but at a high cost for initial training (Galvez-Hernandez et al., 2021).
There are a wide range of software solutions that can analyze XMT data in the form of digital imaging communications in medicine (DICOM) files to segment a variety of high-contrast tissues like lungs (Weissheimer et al., 2012;Reynisson et al., 2015), liver (Okada et al., 2008;Huhdanpaa et al., 2011) and bone (Rios et al., 2014;Mehadji et al., 2019;Taghizadeh et al., 2019). After some pilot testing of both open-source and commercial solutions, we settled on the Mimics Innovation Suite (Materialize, Leuven, Belgium) as the one that most readily suited our needs. Mimics has been previously benchmarked against other programs like Syngo , OsiriX (Reynisson et al., 2015) and ITK-snap (Weissheimer et al., 2012), and some of its key features are its flexibility, ease of use, sensitive and controlled segmentations (Weissheimer et al., 2012;Reynisson et al., 2015) and the possibility to integrate Python scripting modules to further extend its automation capabilities.
Here we present a semi-automated analysis pipeline for the fast and robust characterization of long-bone length, using two solutions that can be adopted by non-experts: (1) Python scripting and segmentation tools of the commercially available software package Mimics; (2) a standardized pipeline in the ImageJ plugin BoneJ. We report the advantages and caveats of each method.

Animal Experiments
Mouse embryo and pup samples were retained as residual tissues from other experiments in the Rosello-Diez lab, approved by the Animal Ethics Committee at Monash University (protocol 17048). Wild-type E17.5 samples were obtained from Asmu:Swiss crosses. P7 samples consisted of tTA-negative littermates (phenotypically wild-type) obtained from crosses of females containing the left-lateral plate mesoderm specific Pitx2-Cre (Shiratori et al., 2006) and a cartilage-specific Col2a1-tTA (Rosello-Diez et al., 2018) with males bearing a Tigre Dragon−DTA allele (Ahmadzadeh et al., 2020). P3 samples consisted of pups obtained from crosses of females containing the leftlateral plate mesoderm specific Pitx2-Cre, a cartilage-specific Col2a1-rtTA (Posey et al., 2009) and 1 copy of an Egr1 null allele (JAX#012924) with males bearing a Tigre Dragon−DTA allele (Ahmadzadeh et al., 2020). Control and experimental animals were separated based on rtTA genotype, regardless of the presence of the Egr1 null allele. Doxycycline hyclate (Sigma, 0.5 mg/ml in the drinking water, with 0.5% sucrose to increase palatability) was given to the pregnant female from E12.3 to E13.8. Noon of the day the vaginal plug was detected was considered E0.5.

Samples Collection and Fixation
Mouse embryos/pups were decapitated, the skin, internal organs and adipose tissue were removed. The samples were then fixed in 4% paraformaldehyde overnight at 4 • C and then stored in PBS until ready for XMT scans.

X-ray Microtomography Scans
A Siemens Inveon PET-SPECT-CT Small Animal scanner in CT modality was used for all experiments. Parameters: 20-and 40µm resolution, 360 projections at 80 kV, 500 µA, 600 ms exposure with a 500 ms settling time between projections. Binning was applied to vary resolution with 2 × 2 for 20 µm and 4 × 4 for 40 µm scans and data was reconstructed using a Feldkamp algorithm. The samples (beheaded embryo and pup bodies) were placed in supine position over custom-fitted foam bedding, so that the limbs were not in contact with any hard surface. Scan time for each sample is roughly 10 min (plus 30 min of set up per imaging session). However, scan time can depend on the scan resolution, number of projections, exposure time and settling time between projections.

Data Processing
Data was reconstructed using a Feldkamp algorithm and further converted to DICOM files using Siemens software.

Note on File Size Limits
Current versions of Mimics (v24) have a file size limit of 256 TB (Materialize, email communication July 2021). In ImageJ the pixel limit for a 3D image stack is 2 62 (i.e., 2 31 in z and 2 31 in xy, about 4.6 EB in total.) because of how the image slices are arranged and the pixels indexed with signed 32-bit integers. These limits are still well above the typical datasets obtained from XMT instruments, which are in the order of 1-10 GB. As a reference, our typical 20-µm datasets are 5-7 GB.

BoneJ Software and Pipeline
FIJI was used to develop the analysis pipeline. See section "Results" for an overview; each step in the process is outlined in detail below.

Data Loading
The entire image set is loaded into memory, as many of the downstream actions requires to have the entire image set in memory. A virtual stack is possible as the initial action, if it is necessary to crop or reduce the size of the image, if the image is too large for the computer being used. When the crop/size reduction is performed, FIJI will load the entire image stack into main memory.

Segmentation
This method employs the segmentation method of maximum entropy. Segmentation divides the image into multiple parts (at least two, usually). This yields the objects of interest, and everything else. Some methods, such as Otsu's three-class, can produce more than two layers. The choice of maximum entropy in this case is based on making the method as widely applicable as possible. Maximum entropy is implemented in ImageJ as a method to maximize the inter-class entropy (between the selected objects and everything else in the image). This involves the average gray values of the pixels present in each image (or in a reference image), the individual gray value of each pixel, and the gray values in the local neighborhood of each pixel. The initial threshold value the algorithm selects is based on the probability estimation from a histogram of all pixels in the image stack. Therefore, maximum entropy offers the best robustness, rejection of undesirable elements, acceptable performance in low-contrast images, and good noise tolerance. The method is described in Kapur et al. (1985). Threshold values of 280 to 500 HU were chosen, depending on the developmental stage, as a best compromise on the images included. This value may need to be adjusted depending on the CT scanner, reconstruction kernel, and the sample. The segmentation was further refined via the Erode and Watershed tools (2D versions).

Generation and Labeling of Aligned Boxes With Particle Analyser
BoneJ's Particle Analyser code was altered to use the bone's inertia tensor to define the three principal axes of each element, and then generate the minimum-size cuboidal box that fits the skeletal element and that is aligned with the principal axes (BoneJ styloid-r11). The coordinates and dimensions of these boxes are then appended to the Results window in Fiji. The Label Elements (3D) feature then allows the user to Ctrl-click the surfaces of interest and assign a label to them. These are added as a new column to the Results window. The macro below also includes the possibility of saving a screenshot of the 3D Viewer window with the desired orientation.

Mimics Software and Pipeline
Mimics Research (v21.0; Materialize, Leuven, Belgium) equipped with the scripting module was used to develop the analysis pipeline and the Python script described here. See section "Results" for an overview and each step in the process is outlined in detail below.

Data Loading
DICOM files are imported into Mimics via the New Project wizard.

Thresholding
As soon as the DICOM data is uploaded into Mimics, the first step is to distinguish the bones from all the other tissue by defining a range of Hounsfield Units (HU) that corresponds to bone density. The first Python command in BASILISC creates a mask labeled "ALL, " which will segment all the skeletal elements present, creating a global threshold specific to this mask. Since the goal was to measure the developing mineralized part from end to end, this step had to detect immature trabecular bone at the ends of the growing elements. In our uncalibrated XMT scans, we realized that the custom minimum threshold for bone tissue defined by Mimics (226 HU) often over-represented the actual bone tissue in the scans as it selected a greater area of tissue. The optimal lower threshold for the developmental stages of interest had thus to be determined empirically. Although Mimics can take both gray scale values (GV) and HU units, the input in the script can only be into GV, and therefore the first step was to transform the data into GV to adequately segment all bones from the rest of the tissue. This is achieved through the "segment" attribute seen on the last line of code for this section. In principle, different optimal thresholds exist for different scanning conditions and certainly for different developmental stages, as the ratio between woven and lamellar bone decreases, and hence BASILISC was designed in such a way that the user can select among three pre-defined thresholds via a pop-up menu. This can be easily changed within the following section of the script (pre-defined values appear in orange font):

Landmarking
The purpose of this step is to segment and uniquely label all the bones of interest, using Mimics tools. This is achieved using a function that prompts the user (via a pop-up window) to select a landmark on the bone of interest. The first step in landmarking is to select the bones of interest to create a list of "landmarks." This list contains the unique name of each selected bone and defines the order of segmentation during the process. The "indicate_landmark" function guides the user through each of the bones to be segmented by means of a dialog box, asking the user whether a given element is present in the scan or not and with two active buttons: "Select" and "Skip" (Figure 1C). The user has the option to skip an element if a given bone is not present in the scan, this would then be excluded from the analysis. If the "Select" button is activated, a second dialog box prompts the user to select a region (landmark) of the indicated element by simply clicking on it on one of the 2D views of the sample. BASILISC will automatically label and segment the selected element without further user interaction, through the Mimics "region_grow" function. A FOR loop has been included in the BASILISC script when executing the "indicate_landmark" function, so that the steps above are recursively followed for each of the bones of interest sequentially, using the name of each bone as an index within the FOR loop.

Generation of 3D Models
Once all the elements of interest have been segmented and labeled accordingly, a function has been created in BASILISC that creates 3D models of each element, "create_3D." A FOR loop in the script steps through each of the segmented bones and creates a 3D model of each at the highest possible resolution ( Figure 1D). This provided the most accurate measurements possible and since a limited number of bones are analyzed, computing time to create each 3D model did not increase significantly.

Measurement
Once BASILISC has automatically made 3D models, it will fit a center line to each bone within the "create_3D" function. This is achieved through the "analyze.create_line_fit_to_surface" attribute in Mimics. The script has been designed to then automatically obtain the length of the fitted line and save the measurement in a text file ( Figure 1E). Since this step is included within the function described above, which includes a FOR loop, the line is fitted as each 3D element is made, and the measurement is recorded progressively. The text file created will have the name of the given part, e.g., RIGHT HUMERUS, followed by a comma and the corresponding length of the element. This step is done automatically without any user input required after the landmarking step has been finalized. As the file created is only labeled with the name of the developmental stage created, the user should change the name of the text file to be sample specific before analyzing the next sample.

Manual Corrections During Image Analysis
For E17.5 samples, the radius and ulna were segmented together at the thresholds we use, but they could be easily separated using the Split mask function of Mimics, as their interaction surface was quite reduced. This requires identification of specific areas where there was an overlap in pixels, manually selecting areas that corresponded to each bone in a 2D view in at least three areas across the length of the bone (proximal, distance and middle area), before extrapolating the selected regions to create two separate elements. This tool is not available in BoneJ.

Pipeline Benchmarking
For Figures 2, 3, each specimen was scanned in triplicate or quadruplicate (on three or four different days), at two resolutions each (20 and 40 µm), and each of the six scans was segmented at two different lower thresholds (in Mimics: 650 and 398 HU for P7, 398 and 226 HU for E17.5; in BoneJ: 500 and 385 HU for P7, 400 and 280 HU for E17.5) to perform length measurements. Humerus, radius, femur, tibia and clavicle (the latter only for E17.5) were analyzed for two (P7) or three (E17.5) different specimens.

Skeletal Preparations
After embryo collection, the skin, internal organs and adipose tissue were removed. The samples were then fixed in 95% EtOH overnight at room temperature. To remove excess fat, the samples were then incubated in acetone overnight at room temperature.
To stain the cartilage, the samples were submerged in a glass scintillation vial containing Alcian blue solution (0.04% (w/v), 70% EtOH, 20% acetic acid) and incubated at least overnight at room temperature. The samples were de-stained by incubating them in 95% EtOH overnight, and then equilibrated in 70% EtOH, prior to being pre-cleared in 1% KOH solution for 1-10 h at room temperature (until blue skeletal elements were seen through). The KOH solution was replaced with Alizarin red solution (0.005% (w/v) in 1% KOH) for 3-4 h at room temperature. The Alizarin red solution was then replaced with 1-2% KOH until most soft tissues were cleared. For final clearing, the samples were progressively equilibrated through 20% glycerol:80% (1% KOH), then 50% glycerol:50% (1% KOH) and finally transferred to 100% glycerol for long-term storage.

Statistical Analysis
Experimental data are presented as the mean ± SD. P < 0.05 (two-way ANOVA) was considered statistically significant.

A Script for Bone-Length Measurement on XMT Scans With Minimal User Input Using Mimics
After testing multiple software solutions, we chose a commercial one (Materialize Mimics Research software) to develop the initial pipeline, based on the promising initial results and the possibility of automation via scripting. We thus developed a Python script that utilizes Mimics capabilities to segment and measure the mouse bones of interest (humerus, radius, ulna, tibia and sometimes clavicle) from CT scans. This script is   Figure 1A). See Supplementary Video 1 for an overview of the whole procedure. The first Python command in BASILISC segments all the skeletal elements, using a global threshold for bone tissue (Figure 1B). Since these were developmental samples, segmentation was applied to the mineralized region, not the cartilage poles. To increase its applicability to different developmental stages and scanning conditions, BASILISC was designed in such a way that the user can select among three pre-defined thresholds [Low, Medium, High, ranging from ∼200 to 500 Hounsfield Units (HU)] via a pop-up menu. These predefined values can be easily changed within the script (see section "Materials and Methods"). The next section of the script is landmarking, which uses Mimics tools to segment and uniquely label all bones of interest via a pop-up menu (Figure 1C). The user is prompted to select a region (landmark) of the indicated element by simply clicking on it on one of the 2D views of the sample. BASILISC will automatically label and segment the selected element without further user interaction. In the third section of the script, once all bones of interest have been segmented and labeled accordingly, a 3D model of each skeletal element is created (Figure 1D). Then BASILISC automatically fits a "line to surface" running from end to end along the center of each element. In the last section, the script automatically obtains the length of the fitted lines and saves the measurements to a comma-separated-values (csv) text file (Figure 1E).

A Pipeline for Streamlined Bone Length Measurement in the Open-Source ImageJ Plugin BoneJ
We also developed a similar pipeline to semi-automatically measure bone length using open-source software, as a broadly useful approach. We chose ImageJ 2 , a popular Java-based image analysis program developed at the United States. National Institutes of Health, as a platform for such a solution. Importantly, one of us (M.D.) previously developed an ImageJ plugin (BoneJ) with modules designed for analysis of bone geometry (Domander et al., 2021). ImageJ's ability to load and segment XMT data was the starting point we required. We also developed new modules and new capabilities for existing ones to suit our purposes (see section "Materials and Methods"). These capabilities will soon be included as a BoneJ update and the code is currently available at https://github.com/bonej-org/ BoneJ2/blob/0e295520e4956662e94363ab61a09c190d24b727/ Legacy/bonej/src/main/java/org/bonej/plugins/extensions/ MouseSkeleton.java. The pipeline can be divided into loading, segmentation, fitting a minimal bounding box to each element and measurement of said box ( Figure 1F). A brief overview follows.
We first loaded the whole data set (DICOM format) into FIJI [a package that includes ImageJ and multiple biologyoriented plugins for image analysis (Schindelin et al., 2012)]. We then proceeded with segmentation (i.e., the isolation of skeletal elements from everything else). As with Mimics, segmentation was applied to the mineralized region, not the cartilage poles. We chose to use a maximum entropy method because in our experience it offers the best robustness, acceptable performance with low-contrast images, and good noise tolerance (Kapur et al., 1985). The macro prompts the user to enter a value for the lower threshold, which needs to be adjusted depending on the stage being analyzed, and for our scans it ranged between 280 and 500 HU. We then used the Fill holes tool (2D version) to close solid structures such as the marrow, which facilitates downstream analysis. To clean up the data, we used the Remove outliers tool to remove small radio-dense objects that may appear in some images, followed by the Erode and the Watershed commands (2D versions), to facilitate the separation of structures that are frequently close to each other (e.g., radius and ulna). Finally, to obtain the length of the different bones without the time investment of selecting points manually on a 3D object, we first attempted to use the Particle Analyser tool of BoneJ (Doube, 2021) to calculate approximations of bone dimensions and plot them on 3D models of the segmented elements (Supplementary Figure 1A). Supplementary Figure 1B shows examples of the common measurements and fitted shapes that could be computed with BoneJ before this project was started, namely the maximum Feret's diameter, a fitted ellipsoid and the three moments of inertia. The Feret's diameter (also known as the maximum caliper diameter) is the longest distance between any two points on the particle's surface. For rod-like structures with relatively flat ends, Feret's diameter is a reasonable approximation of length. However, we noticed that in the case of real-life bones, the Feret's diameter tends to find a diagonal that is not aligned with the main bone axis, leading to an overestimation of the length (Supplementary Figures 1B-B , green arrowheads). Additionally, it is implemented as a brute-force method so it is computationally demanding. The fitted ellipsoid is the bestfit ellipsoid to the particle's surface mesh. Since long bones are not ellipsoidal in shape, the ellipsoid fit is not precise and the fitted ellipsoid's long axis tends to be longer than the bone's (Supplementary Figure 1B, blue mesh). Finally, the moments of inertia are defined with respect to the three mean principal axes of the segmented element (longest, shortest and middle, Supplementary Figure 1B, red lines). Therefore, we reasoned that the intersections between the longest axis and the surface of the 3D model of the skeletal element could be used to define a straight line approximating bone length. However, we noticed that while the principal axis ran reasonably well along the center of bones such as the femur (e.g., Supplementary Figure 1B), this was not the case for other bones such as the humerus, where the identified principal axis formed a steep angle with the line that a human user would use to measure bone length (Supplementary Figure 1B ). In summary, BoneJ was able to automatically provide good approximations of bone length for only some bones, depending on their shape.
Given these limitations, we developed a new capability for BoneJ's Particle Analyser, to generate and measure Aligned boxes fitting each element ( Figure 1G). The new tool uses the bone's inertia tensor to define the three principal axes of each element, and then generates the minimum-size cuboidal box that fits the skeletal element and that is aligned with the principal axes. The coordinates and dimensions of these boxes are then appended to the Results window in ImageJ. One challenge of this approach is that for whole-body scans, hundreds of elements are segmented and measured (Figure 1G), and the numeric IDs assigned to the different bones are not the same from sample to sample, precluding the streamlined export of only the measurements of interest. While filtering by volume in the Particle Analyzer tool allows reducing the number of elements to a certain extent, it is often impossible to reduce it only to the bones of interest. Therefore, we generated a new analysis module for BoneJ, Label Elements (3D), which allows the user to interact with the 3D Viewer in order to Control-click the elements of interest, subsequently selecting or writing the label for the clicked element in an iterative manner ( Figure 1H). That label is appended to the Results window (Figure 1I), which is subsequently exported as a csv file. Lastly, we developed an R-script to clean up the data tables (i.e., to remove all the rows but the ones of interest), sort the remaining values in alphabetical order and finally to bind several data tables together, if required (see section "Materials and Methods").

Standardized Conditions to Achieve Robustness to Batch Effect at Multiple Stages
A semi-automatic protocol to measure bone length would only be useful if it yielded consistent measurements for a given sample, scanned and analyzed repeatedly on different days. We thus explored different scan resolutions (20 and 40 µm) to analyze at least three technical replicates per resolution, obtained from two different postnatal day (P) seven mouse specimens (see section "Materials and Methods"), and assessed the reproducibility of the results (Figure 2). We report the Mimics and BoneJ results separately below.
In the Mimics pipeline, 40-µm scans showed relatively high inter-batch effect, especially for hindlimb bones, regardless of the threshold (Figures 2A,B, left), whereas 20-µm scans yielded more consistent measurements, including hindlimb bones, especially for the lower threshold (Figures 2C,D, left). To compare the batch effect more quantitatively, we then calculated the coefficient of variability (CV) for each bone's measurements across the three batches, and compared the CV for the different conditions and bones. A 2-way analysis of variance (ANOVA) showed that there was a significant effect of the imaging and analysis conditions, although the extent of it was likely distinct for the different bones ( Figure 2E). In summary, these results identified a 20-µm resolution and a 398-HU threshold as the optimal conditions to minimize inter-batch variability in this type of samples (i.e., P7 mouse long bones). Importantly, this protocol succeeded to separate radius from ulna in most cases, despite their proximity, with only a few scans requiring the manual use of Split Mask. However, it was not able to separate tibia from fibula ( Figure 2F), potentially affecting tibial length measurement (see section "Discussion" below).
With BoneJ, the 20-µm resolution also led to somewhat less variable results, but in this case the differences between different resolution-threshold combinations were not statistically significant ( Figure 2G). Moreover, 20-µm Low Threshold was the condition that yielded length values more similar to those obtained by Mimics (e.g., femur data in Figures 2C,D). As in the Mimics case, the femur and to some extent the humerus were the long bones most affected by the changes in resolution and threshold ( Figure 2G). One obvious difference with the Mimics pipeline is that, in the BoneJ pipeline, radius and ulna were segmented together in most cases ( Figure 2H). In the absence of a Split mask function that could be applied to separate those elements in a quick and efficient way, we concluded that radius measurements could not be included in the analysis.
In order to test the versatility of BASILISC across developmental stages, we performed a similar battery of scan and measurement analyses exploring different scan resolutions and threshold values at embryonic day (E) 17.5 (Figure 3). With Mimics, most of the conditions performed similarly in terms of reproducibility across batches, except for low resolution and threshold, for which some femora were not properly segmented and consequently their length was overestimated (Figures 3A-D). Although there was no overall difference in the CV across conditions (Figure 3E), the data trends suggested that a 398-HU threshold outperformed a 226-HU threshold in terms of inter-batch robustness. Moreover, length variation in individual bones due to differences in scan resolution were minimized with a 398-HU threshold (compare Figure 3A with Figure 3C and Figure 3B with Figure 3D). Importantly, tibia and fibula were again segmented together no matter the conditions of analysis (Figure 3F).
With the BoneJ pipeline, variability was somewhat higher, although again high resolution and low threshold tended to yield more robust results, and closer to those obtained by Mimics (Figures 3A-D,G). At this stage, radius and ulna were, as with Mimics, individually segmented in all cases. One important difference with the Mimics pipeline is that tibia and fibula were not always segmented together. Namely, low resolution and high threshold led to separated tibia and fibula in approximately half of the cases and therefore to high variability (Figures 3A,G); low resolution and low threshold led to fibula and tibia always segmented together, such that there was less variability in the measurements, although the measurement itself was somewhat overestimated (Figures 3B,G); high resolution and high threshold led to separated tibia and fibula in ∼88% of the cases but also to highly eroded skeletal elements (Figure 3H), which was associated with intermediate variability (Figures 3C,G); and high resolution and low threshold also led to separated tibia and fibula in ∼88% of the cases but with less eroded surfaces, which likely corresponds to more exact measurements (Figures 3D,G).

Internal Consistency Across Stages, Scan Resolutions and Segmentation Thresholds
Besides inter-batch variability, a good analysis pipeline should yield low intra-scan variability. In other words, the ratios between two specific bones should be very similar across different scans. One of the advantages of working with paired bones is the possibility of assessing internal consistency of the BASILISC method by measuring the left/right ratio for each bone and condition. We therefore calculated a left/right ratio for the P7 samples, including replicates, to determine how close the ratio was to the hypothetical value of 1 (i.e., equally long left and right paired bones) and how much variability there was between replicates. As shown in Figures 4A,B and Supplementary Figure 2, for P7 samples the variability was frequently quite low, with 20-µm resolution and low threshold again showing the lowest inter-batch variability and a L/R ratio remarkably close to 1. As parameters moved away from these optimal settings, there were several bones (femur for Mimics, femur and humerus for BoneJ) for which either the average value was not as close to 1 as for other bones, and/or the variability between batches was higher than 5% (Figures 4A,B and Supplementary Figure 2). Similarly, we calculated internal ratios for E17.5 bones to determine optimal scan and segmentation parameters. In this case we chose intra-limb ratios (humerus/radius and femur/tibia) as a normalization approach that could be achievable in the case that contralateral bones were not available (as it was our case for these scavenged samples). With this approach, the parameter of interest to estimate the precision of the approach was the variability of each measurement across replicates. High resolution and low threshold minimized variability, and in all conditions the Mimics pipeline was more accurate (Figures 4C,D and  Supplementary Figure 2).
In summary, these results suggest that, as expected, 20-µm resolution is preferred to 40-µm in order to achieve more robust measurements. In terms of threshold, a higher threshold is preferred to segment out bones that are close to each other, but if the separation is not always achieved this leads to higher variability of the measurement. We thus concluded that the exact threshold needs to be determined for each scanner and/or scanning condition. Of note, the threshold can be easily changed in the Mimics script and the ImageJ macro that we present here.

Correlation Bone Lengths Obtained via Mimics, BoneJ and Skeletal Preparations on the Same Samples
We next compared the bone lengths obtained by BASILISC (Mimics pipeline) with the lengths obtained from the same samples via skeletal preparations and digital measurement of photographed bones (Supplementary Figure 2D), a method frequently-used in developmental biology studies. We used eight long bones from two different specimens at P7. The linear relationship between both measurements was very good in all conditions (Figure 5A, p-value for Pearson correlation < 0.0001 in all cases), and the slopes were not significantly different (p = 0.2506), with an average common value of 0.9336. As expected, however, the BASILISC measurements that used lower thresholds tended to overestimate bone length (as the resulting 3D model includes less dense tissue), as indicated by the differences in the intercepts with the axes (Figure 5A). Overall, the conditions that yielded measurements with better correlation to the skeletal preparations were 20µm resolution and low threshold. We thus restricted our next analysis to these conditions, comparing BoneJ and Mimics BASILISC with the skeletal preparations ( Figure 5B). Both correlations showed very similar slopes and Y-intercepts with the P7 samples. Both BASILISC approaches led to greater measurements than the skeletal preparation, especially in the case of Mimics ( Figure 5B).
As the last test for our pipelines, we compared the measurements obtained by Mimics and BoneJ from the same samples and scan batches, covering a broader range of stages, i.e., from E17.5 to P7. We focused on the conditions that yield most robust results, namely 20µm resolution and low threshold. As seen in Figure 5C, there is remarkable correlation between both methods (R 2 = 0.9982), although BoneJ leads to smaller lengths than Mimics (which itself leads to greater lengths than the skeletal preps, Figures 5A,B).
We concluded that both pipelines were suitable for highthroughput analysis of bone length from whole-body scans, with minimal user intervention. Next, we tested these methods in proof-of-principle studies. Correlation between bone length measurements of 22 bones (10 from E17.5 embryos and 12 from P7 pups) obtained with our Mimics (X axis) and BoneJ (Y) pipelines. Each Mimics-BoneJ comparison was done on the same scan batch. In panels (A-C), the dashed black line represents a 1:1 correlation as a reference, and the dotted ones delimit the 95% confidence interval of the regression.

Application of BASILISC to the Analysis of Genetic Mouse Models With Altered Skeletogenesis
To test the utility of the BASILISC pipeline for the detection of skeletal phenotypes that affect bone length, we applied it to one of our models of limb asymmetry that we recently reported (Ahmadzadeh et al., 2020). In this model, diphtheria toxin expression (DTA) is activated in an inducible and reversible manner in the cartilage template that drives growth of the left limb bones, killing chondrocytes and mostly sparing the right limbs (Ahmadzadeh et al., 2020). While continuous expression of DTA from E12.5 generates extreme asymmetries by birth (not shown), transient activation from E12.3 to E13.8 (see section "Materials and Methods") leads to a subtler asymmetry (Figure 6A), well suited to test the sensitivity of BASILISC. We scanned eight P3 mouse pups (four control and four experimental ones) and applied the Mimics version of BASILISC to measure left and right humerus, ulna, radius, femur and tibia. We then calculated the left/right ratio for each bone, and compared this parameter for all bones between experimental and control mice, via 2-way ANOVA ( Figure 6B). This analysis revealed a significant effect of the Genotype on limb asymmetry, with asymmetries ranging from ∼5 to 20%, similar to our previous report (Ahmadzadeh et al., 2020). These results suggest that BASILISC can be used to quickly detect and characterize skeletal phenotypes affecting the long bones.

DISCUSSION
Here we have presented a fast and easy method to determine calcified bone length from XMT scans of whole mouse samples, without the need for dissecting the limbs, skinning or eviscerating the bodies. We tested our algorithm on a range of developmental stages (E17.5 through P7) that covers 9 days of very fast growth (Sanger et al., 2011).

Advantages Over Classic Skeletal Preparations
As any developmental biologist working on limb patterning and/or growth has experienced, analyzing one litter's worth of samples by the classic method of skeletal preparation, limb microdissection, photograph acquisition and length measurement on the 2D pictures takes at least ten days and close to 20 h of dedicated hands-on work (Rigueur and Lyons, 2014). With the BASILISC approach, decapitation and storage of the mouse bodies takes just a few minutes per litter; scan time is roughly 10 min per sample (plus 30 min of set up per imaging session); data loading and analysis takes ∼5 min per scan. On average, this amounts to 3-4 h of hands-on work per litter. Another advantage is that the measurement is threedimensional, as opposed to two-dimensional, and therefore robust to orientation errors. Lastly, the scan is not destructive, and therefore the samples can be later on processed for histology or other procedures (Hopkins et al., 2015;Baier et al., 2019).

Comparison to Previous Automation Approaches
In principle, the ideal pipeline for the kind of analysis that we perform here would be a fully automated method that recognized each of the long bones from a full-body scan, measured their length and exported those measurements without user intervention. In fact, there have been very impressive attempts at achieving this goal, combining object-based image analysis (that utilizes shape and context-dependent information in addition to pixel intensity values) with machine learning. For example, Heidrich et al. (2013) used Cognition Network Technology to extract objects and their properties from XMT data of chicken embryos at multiple stages, and then used these data to train a machine learning tool for automatic long bone classification. BASILISC is obviously far from achieving that level of automation, and while ImageJ is progressively including more machine learning modules, Mimics does not currently allow it. However, one of the strengths of BASILISC stems from its simplicity, as it can be used directly on any set of data, with minimal modification of the script. Contrary to this simplicity, the pipeline described in Heidrich et al. (2013) required a large training set of close to 3,000 instances, and also complex iterative thresholding methods. Moreover, although the classification achieved via this complex process was remarkably accurate, it still required supervision and was only applied to a reduced developmental window.
In contrast to other automation procedures where edge detection has been used to determine optimal thresholds (Zhang et al., 2010;Rathnayaka et al., 2011), here we rely on a global threshold optimized by trial and error to find an optimal range of gray values. BASILISC could be further refined by implementation of widely used edge detection algorithms to further improve the segmentation process and potentially increase the accuracy of the measurements obtained. However, since intensity can vary across the length of long bones (Rathnayaka et al., 2011), edge detection would require the use of multiple thresholds to reduce the degree of error in segmentation. Thus, here we opt for a single global threshold to extend the capabilities of the algorithm for a range of developmental stages. With this approach the earliest stage at which we were able to detect the mineralized portion of the bone was E15.5 (not shown).
To our knowledge, our Mimics script is the first algorithm that makes use of the Python library within the Mimics software to automate the segmentation, 3D modeling and analysis of length of skeletal elements. Previously, Mimics has been complemented with other scripting languages like MATLAB (Huhdanpaa et al., 2011) for image processing before segmenting the data, or software like Creo elements (Rios et al., 2014) to analyze scans after they have been segmented. In the latter case, though, the reference points for length measurement had to be manually selected, which is a time-consuming step to do in 3D. Through BASILISC, segmentation and length measurements can all be obtained within the one program (be it Mimics or ImageJ) and extensive programming knowledge is not required. Furthermore, we provide a processing pipeline that goes from optimized scanning conditions of mouse samples across a range of developmental stages, to streamlined image processing and data analysis, making BASILISC a readily available tool for the research community.
It should be noted, however, that the BoneJ version of BASILISC is less accurate than the Mimics one, especially at young stages (Figures 2, 3), and this could lead to some limitations in its applications. We hypothesize that BoneJ is more sensitive to discretization and thresholding artifacts because it uses binary pixel values for input, while Mimics makes a mesh over grayscale data, smoothing out noisy pixels at the crucial areas on the bones' extremities.

Comparison With "Real" Length Measurements
Strictly speaking, the "true value" of bone length cannot be obtained with absolute certainty by any method, as no measurement is devoid of error. However, given the widespread use of skeletal preparation, flat mounting and imaging to obtain 2D length estimations, we compared the measurements obtained by the BASILISC approach (Mimics and BoneJ) with the length obtained by skeletal preparations (Rigueur and Lyons, 2014). Of note, all conditions showed remarkable correlation between both methods, with 20-µm resolution and 398-HU threshold yielding measurements very well correlated to those obtained via skeletal preparations across the whole range of lengths analyzed. In general, the lengths obtained with Mimics and BoneJ were bigger than those obtained via skeletal preparations. While the threshold choice could have some effect on this comparison, the most likely explanation is that the classic skeletal preparation method involves quite a harsh procedure, including increasing gradients of glycerol that can shrink the sample up to 3-6% (Mabee et al., 1998). Another important consideration of the Mimics version is that the "line to surface" fitting method generates the longest possible distance, which in some cases is not strictly running parallel to the element's main axis (e.g., Figure 2F). This obviously generates a small bias in the measurement, but as long as the same method is used to compare different experimental conditions, this bias will be consistent and is not expected to contribute to the observed biological effect. Similarly, the aligned box approximation used in our BoneJ version of BASILISC can also potentially lead to slight length overestimation, depending on the shape of the bone. Importantly, we also showed that the optimal Mimics and BoneJ versions of BASILISC yield highly correlated measurements when applied to the same scans.

Limitations and Future Improvements
One potential limitation of fitting lines or boxes to the skeletal elements is that the measurement is sensitive to shape distortion due to mutations. If the shape distortion is not consistent across different specimens, then the length comparison will be less reliable. In this case, accurate measurement would require manual identification of multiple points along the length of the element, as it would be done with a 2D picture of a skeletal preparation. A potential solution for future versions of the algorithm could be to identify bending points along the length of the element and then measure the line defined by the points.
Another limitation is that some long bones are often segmented together in our pipeline, most often tibia and fibula (Figures 2F, 3F), and sometimes radius and ulna. This is because their automatic separation would require too high a threshold. While radius and ulna can often be quickly separated manually using the Split_mask function in Mimics (see section "Materials and Methods"), this is not feasible for the tibia and fibula, because their interaction surface is too large. This issue has some impact on the tibial measurements, because the fibula protrudes a bit farther than the tibia on the distal end (Figures 2F, 3F). However, the effect is quite minor and we showed that under the right conditions the error is very consistent, as the left/right ratio for the tibia is quite tightly centered on 1 (Figure 4). Therefore, the slightly overestimated tibial lengths can still be used for comparison purposes between different genotypes and/or treatments. The decision to invest more time in splitting them as opposed to accepting the error is up to the user and depends on two main aspects: the degree of accuracy desired and the time investment required to correct the error in all samples. In our case, we opted not to correct this segmentation error, as the minor gain in accuracy would be outweighed by the extra time investment.
The aforementioned limitation would be corrected with an automatic classification system based on machine learning (Heidrich et al., 2013), but the implementation of these methods is not straightforward. If this capability is implemented in the future, it could speed up image processing even further, as in theory no user intervention would be required to seed landmarks and/or label the bones of interest.

Other Applications
The current BASILISC pipeline in Mimics only measures length of the elements, because it applies the "fit line to surface" tool to the 3D models of the bones. However, it could in principle be adapted to measure width, by fitting a cylinder to the model and interrogating the width of the cylinder. This approach would require careful selection of the fitting parameters, so that the surface of the cylinder coincides with the surface of the 3D model. The BoneJ version, in turn, can directly be used to measure bone width, just by using the other dimensions of the aligned box.

AUTHOR'S NOTE
Beltran Diaz et al. present a semi-automated pipeline for fast and versatile characterization of bone length from micro-CT images of mouse developmental samples.

DATA AVAILABILITY STATEMENT
All the measurements presented in the study are included in the article/Supplementary Material/Table 1, further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
The animal study was reviewed and approved by Monash Animal Ethics Committee.

AUTHOR CONTRIBUTIONS
SBD, CH, and XQ acquired the raw data. SBD and MD developed the essential tools. SBD, CH, JN, and AR-D performed the measurements. AR-D conceived and supervised the study, designed experiments, analyzed and interpreted the data, and drafted the manuscript, helped by CH. MD and OP co-supervised aspects of the study. AR-D and SBD prepared the figures. MD, MV, and OP contributed to the scientific discussion and revised the manuscript. All authors critically reviewed the manuscript and approved the submitted version. Feret's diameter. The red arrowhead points to a region where the longest principal axis does not align with the skeletal element. (C) Two different samples were scanned four times at 40-µm resolution, and each of those scans were analyzed twice, with identical or nearly identical parameters. (D) Two different samples were scanned four times at 20-µm resolution, and most of those scans were analyzed once. (E) Heatmap with the average coefficient of variability for each bone and resolution.
Supplementary Figure 2 | Coefficients of variability for P7 and E17.5 analysis using Mimics and BoneJ. (A,B) Heatmaps for the CVs of the Left/Right length ratios obtained after analysis of P7 bones with Mimics (A) and BoneJ (B) pipelines. (C,D) Heatmaps for the CVs of the indicated ratios obtained after analysis of E17.5 bones with Mimics (C) and BoneJ (D) pipelines. (E) Left/Right ratios for the P7 bones used to benchmark BASILISC, measured after classic skeletal preparations.
Supplementary Table 1 | All the measurements obtained and analyzed in the manuscript.
Supplementary Video 1 | Overview of the BASILISC process in Mimics, performed on one of the scans used for this study. The messages prompted by the script were taken as screenshots and added to the video clip. Please note that an artifact of the video capture causes the mouse cursor to be shown slightly displaced from its real position.