A modular U-Net for automated segmentation of X-ray tomography images in composite materials

X-ray Computed Tomography (XCT) techniques have evolved to a point that high-resolution data can be acquired so fast that classic segmentation methods are prohibitively cumbersome, demanding automated data pipelines capable of dealing with non-trivial 3D images. Deep learning has demonstrated success in many image processing tasks, including material science applications, showing a promising alternative for a human-free segmentation pipeline. In this paper a modular interpretation of U-Net (Modular U-Net) is proposed and trained to segment 3D tomography images of a three-phased glass fiber-reinforced Polyamide 66. We compare 2D and 3D versions of our model, finding that the former is slightly better than the latter. We observe that human-comparable results can be achievied even with only 10 annotated layers and using a shallow U-Net yields better results than a deeper one. As a consequence, Neural Network (NN) show indeed a promising venue to automate XCT data processing pipelines needing no human, adhoc intervention

X-ray Computed Tomography (XCT), a characterization technique used by material scientists for non-invasive analysis, has tremendously progressed over the last 10 years with improvements in both spatial resolution and throughput [1,2].Progress with synchrotron sources, including the recent European Synchrotron Radiation Facility (ESRF) upgrade [3], made it possible to look inside a specimen without destroying it in a matter of seconds [4] -sometimes even faster [5].
This results in a wealth of 3D tomography images (stack of 2D images) that need to be analyzed and, in some applications, it is desirable to segment them (i.e.transform the gray-scaled voxels in semantic categorical values).A segmented image is crucial for quantitative analyses; for instance, measuring the distribution of precipitate length and orientation [6], or phase characteristics, which can be useful for more downstream applications like estimating thermo-mechanical properties [7].
XCT images typically have billions of voxels, weighting several gigabytes, and remain complex to inspect manually even using dedicated costly software (e.g.: Avizo2 , VGStudioMax 3 ).Using thresholds on the gray level image is an easy, useful method to segment phases in tomographies, but it fails in complex cases, in particular when acquisition artifacts (e.g.: rings, beam hardening, phantom gradients) are present.Algorithms based on mathematical morphology like the watershed segmentation [8,9] help tackling more complex scenarios, but they need human parametrization, which often requires expertise in the application domain.Thus, scaling quantitative analyses is expensive, creating a bottleneck to process 3D XCT -or even 4D (3D with time steps).
Deep Learning approaches offer a viable solution to attack this issue because Neural Networks (NNs) can generalize patterns learned from annotated data.A NN is a statistical model originated from perceptrons [10] capable of approximating a generic function.Convolutional NNs (CNNs) [11], a variation adapted to spatially-structured data (time series, images, volumes), made great advances in computer vision tasks.Since the emergence of popular frameworks like TensorFlow [12], more problem-specific architectures have been proposed, such as Fully-convolutional NNs [13], a convolution-only type of model used to map image pixel values to another domain (e.g.classification or regression).
[6] trained a model to segment three phases in 3D nanotomographies of an Al-Cu alloy, showing that even a simple CNN can reproduce patterns of a human-made segmentation.[14] optimized a SegNet [15] to segment dendrites of different alloys, including a 4D XCT.[7] identified Aluminides and Si phases in XCT using a U-Net, an architecture that, along with its many flavors [16][17][18][19], has shown success in a variety of applications [20][21][22].Finally, [23] combined U-Nets with classic segmentation algorithms (e.g.marker-based watershed) to segment grain boundaries in successive XCTs of an Al-Cu specimen as it is submitted to Ostwald ripening steps.
In this paper, an annotated 3D XCT of glass fiber-reinforced Polyamide 66 is presented as an example of segmentation problem in Materials Science that can be automated with a deep learning approach.Our NN architecture, the Modular U-Net (Fig. 2 and Fig. 3), is proposed as a generalized representation of the U-Net, explicitly factorizing the U-like structure from its composing blocks.
Like [23], we compare three variants on the composite material dataset focusing on the dimensionality of the convolutions (2D or 3D), obtaining qualitatively human-like segmentation (Fig. 1 and Fig. 4) with all of them although 2D-convolutions yield better results (Fig. 5a).We find that (for the considered material) the U-Net architecture can be shallow without loss of performance, but batch normalization is necessary for the optimization (Fig. 5b).Finally, a model's learning curve (Fig. 6) shows that only ten annotated 2D slices are necessary to train our NN.
Our results show that NNs can be not only a quality-wise satisfactory but also a viable solution in practice for XCT segmentation as it requires relatively little annotated data and shallow models (therefore faster to train).

Data
The data used in this work is composed of synchrotron X-ray tomography volumes recorded using 2 mm × 2 mm cross section composite specimens of Polyamide 66 reinforced by glass fibers.A volume of 2048 3   the specimen's borders, and its ground truth segmentation was created semi-manually with ImageJ [24] (using Fiji [25]) and Avizo.The tomography of another specimen of the same material was also processed and partially annotated in order to evaluate our models (Fig. 4) -it is further referred to as the "crack" volume because of the fracture inside it.Both image volumes and the former's annotations are publicly available [26].
Acquisition.X-ray tomography scans were recorded on the Psiché beamline at the Synchrotron SOLEIL using a parallel pink beam.The incident beam spectrum was characterized by a peak intensity at 25 keV, defined by the silver absorption edge, with a full width at half maximum bandwidth of approximately 1.8 keV.The total flux at the sample position was about 2.8 × 10 12 photons/s/mm 2 .The detector placed after the sample was constituted by a LuAG scintillator, a 5× magnifying optics, and a Hamamatsu CMOS 2048 x 2048 pixels detector (effective pixel size of 1.3 µm).1500 radiographs were collected over a 180 • rotation and an exposure of 50 ms (full scan duration of 2 minutes).The sets of radiographs were then processed using PyHST2 reconstruction software [27] with the Paganin filter [28] activated to enhance the contrast between the phases.
Phases.The three phases present in the material are visible in Fig. 1: the polymer matrix (gray), the glass fibers (white, hatched in blue), and damage in form of pores (dark gray and black, contoured in yellow) -referred as porosity here.One can observe that the orientation of the fibers is unevenly distributed; they are mostly along the axes Y (vertical) and Z (out of the plane) in Ground truth.The data was annotated in two steps: first, the fiber and the porosity phases were independently segmented using Seeded Region Growing [29]; then, ring artifacts that leaked to the porosity class were manually corrected.A detailed description of the procedure is presented in the Appendix A.
Data split.The ground truth layers (of the Train-Val-Test volume) were sequentially split -their order were preserved to train the 3D models (Section 2.2) -into three sets: train (1300 layers), validation (128 layers), and test (300 layers).A margin of 86 layers between these sets was adopted to avoid information leakage.The train layers were used to train the NNs, the validation layers were used to select the best model (during the optimization), and the test layers were used to evaluate the models (Section 3).
Class imbalance.Due to the material's nature, the classes (phases) in this dataset are intrinsically imbalanced.The matrix, the fiber, and the porosity represent, respectively, 82.3%, 17.2%, and 0.5% of the voxels.

Neural Network
Problem formulation.Let x ∈ X = [0, 1] w×h×d be a normalized gray 3D image.Its segmentation y ∈ Y = C w×h×d , where C = {0, 1, . . ., C − 1} , contains a class value in each voxel, which may represent any categorical information, such as the phase of the material.In this setting, a segmentation algorithm is a function f : X → Y .In this section we present our approach (that is, the f ) used to segment the data described in the previous section.
First, a generic U-Net architecture, which we coined Modular U-Net (Fig. 2), is proposed; then, the modules used in this work are briefly exposed (detailed description and hyperparameters in the Appendix B); finally, three variations of the Modular U-Net, based on the input, convolution, and output nature (2D or 3D), are presented.Our training setup (loss function, optimizer, learning rate, data augmentation, implementation framework, and hardware) is described in Appendix C.
The left/right side of the architecture corresponds to an encoder/decoder, a repetition of pairs of ConvBlock and DownBlock/UpBlock modules.They are connected by concatenations between their respective parts at the same U-level -which corresponds to the inner tensors' resolutions (higher U-level means lower resolution).The U-depth, a hyperparameter, is the number of U-levels in a model, corresponding to the number of DownBlock (and equivalently UpBlock) modules.
The ConvBlock is a combination of operations that outputs a tensor with the same spatial dimensions of its input, though the number of channels may differ -in our models it always doubles.The assumption of equally-sized input/output is optional, but we admit it for the sake of simplicity because it makes the model easier to be used with an arbitrarily shaped volume.The numbers of channels after the ConvBlocks is 2 U-level × f 0 , where f 0 is the number of filters in the first convolution.
The DownBlock/UpBlock divides/multiplies the input tensor's shape by two in every spatial dimension: width, length, and depth in the 3D case.In other words, a tensor with shape (w, h, d, c) , where c is the number of channels, becomes, respectively, ( w 2 , h 2 , d 2 , c) after a DownBlock and (2w, 2h, 2d, c) after an UpBlock.
In [16], for instance, the ConvBlock is a sequence of two 3x3 convolutions with ReLU activation, the DownBlock is a max pooling, and the UpBlock is an up-sampling layer.In [17], the ConvBlock is a 3D convolutional layer, and in [18] it is a nested U-Net.
The ConvBlock used here (Fig. 3) is a sequence of two 3x3 (x3 in the 3D case) convolutions with ReLU activation, a residual connection with another convolution, batch normalization before each activation, and dropout at the end.The DownBlock is a 3 × 3 convolution with 2 × 2 stride, and the UpBlock is a 3 × 3 transposed convolution with 2 × 2 stride.

Variations: 2D, 2.5D, and 3D
Since our dataset contains intrinsically 3D structures, we compared the performances of this architecture using 2D and 3D convolutions.The former processes individual tomography z-slices (XY plane) independently, and the latter processes several at once (i.e. a volume).
We also compared an "intermediate" version, which we coined 2.5D, that processes one tomography z-slice at a time using 2D convolutions, but takes five slices at the input, as if the pairs of data slices above and below were channels of the 2D image.Table B.2 summarizes these differences.
The visual characteristics of the z-slices are mostly invariant, and we observed a high correlation between adjacent z-slices; therefore, the 2D and 2.5D models take 2D cuts in the XY plane, as in Fig. 1

Results
In this section we present a compilation of qualitative and quantitative results obtained.The segmentations from the three Modular U-Net versions presented in Section 2 are quantitatively compared, then an ablation analysis and the learning curve of the 2D model are presented.All the quantitative analyses were made on the test split (see Section 1), which contains 1300 × 1040 × 300 ≈ 406 × 10 6 voxels.Other images and videos are provided along with further detailed analysis as supplementary material in the Appendix D. The trained models and the data used to produce our results are publicly available online [26,30].
Qualitative results.Figure 1 shows two snapshots of the segmentation generated with a 2D model.Figure 4 shows the segmentation obtained with the 2D model from the crack volume, used evaluate its usability in terms of processing speed and applicability of the method to other data.The segmented data was then used to generate a surface of the crack inside it 4 we refer to it as the "crack" volume in the next section.Using an NVIDIA Quadro P2000 5 (5 GB), it took 32 minutes to process a 1579 × 1845 × 2002 (≈ 5800 Mvoxels) volume.This shows that this type of analysis could carried out almost in real time using typical hardware available at a synchrotron beamline.
Baseline.For the sake of comparison, we considered the expected performance of two theoretical models: the Zero th Order Classifier (ZeroOC) and the Bin-wise ZeroOC.The ZeroOC relies only on the class imbalance, while Bin-ZeroOC takes the individual gray values into consideration, leveraging information from the histograms of each class (Fig. A.8) -see Table D.4 for more details.
Quantitative results.Figure 5 presents a comparison of the three model variations (2D, 2.5D, and 3D) and an ablation study of the 2D model in terms of number of parameters and performance.The three variations of the Modular U-Net are evaluated with varying sizes.The models are scaled with the hyperparameter f 0 (values inside the parentheses in Fig. 5a).The performance is measured using the Jaccard index, also known as Intersection over Union (IoU), on each phase (class).Our main metric is the arithmetic mean of the three class-wise indices, and the baseline (minimum) is 76.2% (Table D.4).This metric provides a good visibility of the performance differences and resumes the precision-recall trade off; other classic metrics -even the area under the ROC [31] curve -are close to 100% (see Appendix D), so the differences are hard to compare.
Ablation study.Figure 5b is a component ablation analysis of the 2D model with f 0 = 16 .Starting with the 2D model with the default hyperparameters (see Fig. 3, and Section Appendix B), we retrained other models removing one component at a time.The learnable up/down-samplings were replaced by "rigid" ones (see Fig. 3), the 2D convolutions were replaced by separable ones, and the batch normalization was replaced by layer normalization.Finally, we varied the U-depth from 2 to 4.
Notice that the model without dropout performed better than the reference model, but we kept it in our default parameters because the same thing did not occur with other variations and sizes.
Learning curve.Finally, we computed the learning curve of the 2D model (Fig. 6) logarithmically reducing the size of the train dataset from 1024 Components were removed individually, or replaced by alternatives.Removals: dropout, gaussian noise, residual (skip connection), batch normalization (out of scale).Replacements: convolutions by separable ones, learnable DownBlock/UpBlock by rigid ones (Fig. 3), and batch normalization by layer normalization.We also compare the effect of the U-depth, i.e. number of levels in the U structure.Notice that the data point of the batch normalization removal is out of scale in the y-axis for the sake of visualization.
z-slices until a single layer.

Overview
Our models, trained in one to three hours 6 , achieved, qualitatively, very satisfactory results from a Materials Science application point of view, with 87% of mean class-wise Jaccard index and an F1-score macro average of 92.4% (Table D.5).We stress the fact that these results were achieved without any strategy to compensate the (heavy) class imbalance (82.3% of the voxels belong to the class matrix); they may be further improved using, for instance, re-sampling strategies [32,33], class-balanced loss functions [34,35], or self-supervised pre-training [36].
The results obtained with another specimen (the crack volume), thus with slight variations in the acquisition conditions, were of good quality (inspection by an expert showed no visible error in the segmentation) and way faster than the manual process.The crack was mostly, and correctly, segmented as porosity without retraining the model, showing its capacity to generalize -an important feature for its practical use, although some misclassified regions can be seen as holes (missing pieces) in the fracture's surface (Fig. 4a).
Moreover, the processing time achieved (32 miutes) is indeed a promising prospect compared to classic approaches.

Segmentation errors
As highlighted in Fig. 1, the model's mistakes (in red), are mostly on the interfaces of the phases, which are fairly comparable to a human annotator's.We (informally) estimate that they are in the error margin because, in some regions, there is no clear definition of the phases' limits.The fibers may show smooth, blurred phase boundaries with the matrix, while part of the porosities are under the image resolution.
Another issue is the loss of information (all-zero regions) in some rings (e.g.: Fig. A.7b).In such cases, even though one could deduce that there is indeed a porosity, it is practically impossible to draw a well-defined porosity/matrix interface.
Finally, we reiterate that the ground truth remains slightly imperfect despite our efforts to mitigate these issues.For instance, in Figure 1 we see, inside the C-like shaped porosity, a blue region, meaning that it was "correctly" segmented as a fiber -yet, there is no fiber in it.

Model variations
Figures D.9 and 6 confirm that, no matter the model variation, the porosity is harder to detect.Although, the qualitative results are reasonable, and we underline that the Jaccard index is more sensitive on underrepresented classes because the size of the union will always be smaller (see Equation C.1).
Contrarily to our expectations, Figure 5a shows that the 2D model performed systematically better than the 3D (albeit the difference is admittedly small).We expected the 3D model to perform better because the morphology of the objects in the image are naturally three-dimensional; besides, other work [23] have obtained better results in binary segmentation problems.We raise two hypotheses about this result: (1) the set of hyperparameters is not optimal, and (2) the performance metric is biased because the annotation process uses a 2D algorithm.

Model ablation
Figure 5b contains a few interesting findings about the hyperparameters of the Modular U-Net: 1. using learnable up/down-sampling operations indeed gives more flexibility to the model, improving its performance compared to "rigid" (not learnable) operations; 2. separable convolutions slightly hurt the performance, but it reduces the number of parameters by 60%; 3. decreasing the U-depth, therefore shrinking the receptive field, improved the performance while reducing 75% of the model size; on the other hand, increasing the depth had the opposite effect, multiplying the model size by four, while degrading the performance; 4. batch normalization is essential for the training -notice that the version without batch normalization is out of scale in Fig. 5b, and its performance corresponds to the ZeroOC model (Table D.4); Model depth (item 3).This finding gives a valuable information for our future work because using shallower models require less memory (i.e.: bigger crops can be processed at once), making it possible to accelerate the processing time.We hypothesize that the necessary receptive field is smaller than the depth-three model's.Therefore, a spatially bigger input captures irrelevant, spurious context to the classification.

Learning curve
Figure 6 highlights the most promising finding in our results.Our model was capable of learning with only 1% of the training dataset (about ten z-slices) even with no "smart" strategy to select the layers in the experiment (the training volume was sequentially reduced along the z-axis), and even a single layer was sufficient to achieve nearly the same performance.

Conclusion
An annotated dataset of a three-phase composite material was presented, and a reinterpretation of the U-Net architecture as a modularized structure was proposed as a solution to scale up the segmentation of such images.Our models achieved satisfactory results showing a promising venue to automate processing pipelines of XCTs of this material with only a few annotated tomography slices.The Modular U-Net is a conceptually more compact interpretation of its precursor, providing a more abstract representation of this family of network architectures.
2D and 3D versions of the Modular U-Net were compared, showing that both were capable of learning the patterns in a human-made annotation, but the former was systematically better than the latter.An ablation study provided insights about the hyperparameters of our architecture, especially revealing that we might further accelerate the processing with smaller models.We also qualitatively analyzed the usability of our models on an image of a different specimen, confirming that it is a viable solution.
Our code is available on GitHub7 , and the data and trained models referred here are publicly available on Zenodo [26,30].

Future work
Carrying on the encouraging results found in this work, we envision scaling up the use of our models to process other volumes and 4D X-ray Computed Tomographys (XCTs).We also plan identifying better strategies to deal with ill-defined regions (e.g.: matrix/fiber blurred interfaces), an issue modestly mitigated in our work, to improve our approach's usability.Two possibilities are considered: (1) post-processing the class probabilities to detect bad predictions, for instance using the method proposed by [37]; (2) define a special class for uncertain regions.Finally, we might as well search for better ways to measure a prediction's consistency with respect to the objects' 3D morphology.

Acknowledgement
João P C Bertoldo would like to acknowledge funding from the BIG-MÉCA research initiative8 , funded by Safran and MINES ParisTech.Henry Proudhon would like to acknowledge beam time allocation at the Psiché beamline of Synchrotron SOLEIL 9 for the proposal number 20150371.idation, test) together (1900 z-slices), while the volumes (c) and (d) correspond to their last 300 z-slices.The values in the segmentation volumes (predictions and ground truth) are 0, 1, and 2, which respectively correspond to the phases matrix, fiber, and porosity.The volumes (e) and (f) correspond to the volume in Fig. 4. The .raw files have complementary .raw.info files containing metadata (volume dimensions and data type) about its respective volume.[25] with the volume (a). Figure 1 was generated in Avizo with volumes (a) and (d), which is derived from volumes (b) and (c).Figure 4b was generated in Avizo with volumes (e) and (f).All the supplementary videos were generated in Avizo.
Optimizer and Learning Rate.We used AdaBelief [42], an optimizer that combines the training stability and fast convergence of adaptive optimizers (e.g.: Adam [43]) and good generalization capabilities of accelerated schemes (e.g.: Stochastic Gradient Descent (SGD) [44]).We used a learning rate of 10 −3 for 100 epochs (10 batches each with batch size 10), followed by a linearly decaying rate until 10 −4 in another 100 epochs.Adam gave equivalent results but took longer (more epochs) to converge.
Data augmentation.In order to increase the variability of the data, random crops are selected from the data volume, then a random geometric transformation (flip, 90 • rotation, transposition, etc) is applied.As our training dataset is reasonably large, we used a simple data augmentation scheme, but richer transformations may be applied as long as the transformations result in credible samples.
Implementation and hardware.We trained our models using Keras [45] with TensorFlow's [12] GPU-enabled version 10 with CUDA 10.1 running on two NVIDIA Quadro P4000 11 (2x 8 GB).The implementation of our experiments is available on GitHub 12 .Table D.5 shows a report with classic classification metrics and the Jaccard index by class along with their macro/micro averages.As one can see, the accuracy, precision, and recall of the matrix and the fiber are all close to 100%, while the porosity's scores are considerably lower.
.9 shows that, as the matrix and the fiber phases have high scores, the mean Jaccard index is driven by the performance on the porosity detection, which can also seen in Fig. 6.Recall that the baseline model Bin-ZeroOC is expected to have a Jaccard index of 35% on the porosity and 76.2% of mean (Table D.4).
voxels, referred to as Train-Val-Test (Fig. 1 and Fig. A.7), was cropped to get rid of

2 2 x f 0 Figure 2 :
Figure 2: Modular U-Net: a generalization of the U-Net architecture.

Figure 4 :
Figure 4: Segmentation of the volume crack.Color code: blue represents the fiber and yellow represents the porosity.Supplementary video: youtu.be/rmBTZrcMrCk.(a) Two orthogonal planes inside the specimen; the fibers are rendered in 3D at the bottom of the volume, and the crack is rendered as a surface.(b) A crop from the vertical plane in a slice passing through the fracture.The fiber segmentation is hatched in blue and the porosity segmentation is contoured in yellow.

Figure 5 :
Figure 5: Modular U-Net variations comparison.On the x-axis, the number of parameters; on the yaxis the mean class-wise Jaccard indices.(a) The Modular U-Net 2D, 2.5D, and 3D versions are scaled with f 0 (in parentheses), the number of filters of the first convolution of the first ConvBlock.(b)Components were removed individually, or replaced by alternatives.Removals: dropout, gaussian noise, residual (skip connection), batch normalization (out of scale).Replacements: convolutions by separable ones, learnable DownBlock/UpBlock by rigid ones (Fig.3), and batch normalization by layer normalization.We also compare the effect of the U-depth, i.e. number of levels in the U structure.Notice that the data point of the batch normalization removal is out of scale in the y-axis for the sake of visualization.

Figure 6 :
Figure 6: Learning curve of the 2D model with default hyperparameters (Appendix B).

Figure A. 7
Figure A.7 was generated in Fiji[25] with the volume (a).Figure1was generated in Avizo with volumes (a) and (d), which is derived from volumes (b) and (c).Figure4bwas generated in Avizo with volumes (e) and (f).All the supplementary videos were generated in Avizo.

Figure D. 10
Figure D.9 shows that, as the matrix and the fiber phases have high scores, the mean Jaccard index is driven by the performance on the porosity detection, which can also seen in Fig.6.Recall that the baseline model Bin-ZeroOC is expected to have a Jaccard index of 35% on the porosity and 76.2% of mean (TableD.4).Figure D.10 presents detailed confusion matrices.The top one is expressed percentage of the number of voxel counts.The two bottom ones are normalized, respectively, by row and column, and their diagonals correspond to the recall and the precision of each class.

Figure D. 10 :
Figure D.10: Confusion matrices of the 2D model on the test set normalized in three different ways.(top) Normalized by the sum of all cells confounded.(bottom left) Normalized by the sum of ground truth labels (a.k.a.support) of each class (each line sums up to 100%); the diagonal corresponds to the recall values.(bottom right) Normalized by the sum of predicted labels of each class (each column sums up to 100%); the diagonal corresponds to the precision values.

Table A .
[26]ublished 3D volumes: all the data necessary to train and test the models presented in this paper are publicly available on Zenodo[26].A demo of how to read the data is available on GitHub.
pa66.test.error_volume.raw(d)Disagreement between the ground truth and the model's prediction on the test set: 1 means incorrect, 0 means correct.crack.zipcrack.raw (e) Data of the non-annotated volume containing a crack inside.crack.prediction.raw (f) Segmentation generated with the best 2D model on the crack volume.