Segmentation of tooth enamel microstructure images using classical image processing and U-Net approaches

Background: Tooth enamel is the hardest tissue in human organism, formed by prism layers in regularly alternating directions. These prisms form the Hunter-Schreger Bands (HSB) pattern when under side illumination, which is composed of light and dark stripes resembling ﬁngerprints. We have shown in previous works that HSB pattern is highly variable, seems to be unique for each tooth and can be used as a biometric method for human identiﬁcation. Since this pattern cannot be acquired with sensors, the HSB region in the digital photograph must be identiﬁed and correctly segmented from the rest of the tooth and background. Although these areas can be manually removed, this process is not reliable as excluded areas can vary according to the individual‘s subjective impression. Therefore, the aim of this work was to develop an algorithm that automatically selects the region of interest (ROI), thus, making the entire biometric process straightforward. Results: We used two diﬀerent approaches: a classical image processing method which we called anisotropy-based segmentation (ABS) and a machine learning method known as U-Net, a convolutional neural network. Both approaches were applied to a set of extracted tooth images. U-Net with some post processing outperformed ABS in the segmentation task with an Intersection Over Union (IOU) of 0.837 against 0.766. Conclusions: Even with a small dataset, U-Net proved to be a potential candidate for fully automated in-mouth application. However, the ABS technique has several parameters which allow a more ﬂexible segmentation with interactive adjustments speciﬁc to image properties.


Background
Tooth enamel is the most highly mineralized tissue (95% of mineral) in the organism and is able to withstand high temperatures, abrasion, aggressive environments (i.e.humidity, pressure), and degradation through time [1,2].These characteristics proved to be beneficial for human identification in massive disasters such as victims of carbonization in high temperatures, in which visual or fingerprint identification was not possible anymore [2].
Most mammalians, including humans, present tooth enamel with prism layers regularly arranged in alternating directions nearly perpendicular to each other [3].These prisms act like optic fibers when exposed to a direct source of light [4] and form the Hunter-Schreger Bands (HSB), an optical phenomenon that appears as light and dark stripes on tooth surface and can be seen with a magnifier when a low power light is placed sideways [5].A thorough analysis of HSB may provide information regarding species life history and taxon identification [6] since it is frequently preserved in tooth fossils dating from thousands to over 60 millions of years [5], can be used for inference about diet adaptation [7], and also provides data for biometric identification [8].
Biometric analysis is the process of personal identification using biological, physical or behavioral features of an individual.Currently, there are several biometric methods including facial recognition, iris, fingerprint, ear, and gait.These methods are applied using automated systems and software that can distinguish individuals reliably.Some desirable characteristics of biometrics data are to be highly unique, easily obtainable, time-invariant (no significant changes over time), easily transmittable, non-intrusively acquirable and distinguishable by humans without special training [9].Previous works show that a photograph of a tooth with side lighting enhances the HSB appearance [10] and the digital image of the HSB can be treated to highlight the bands [11].The light and dark patterns formed by the HSB resembles fingerprints and also seems to be unique for each tooth and between individuals [8].Moreover, tooth image acquisition is non-invasive, practical, and can be obtained with a photographic camera using a macro lens.
To be used as a human identification method, as opposed to fingerprints, the HSB features cannot be extracted directly from sensors because they are only visible with side illumination.A problem in image processing of HSB is the removal of background image areas, including areas of tooth with poorly defined bands and outside tooth areas.These regions can interfere with digital image filtering and produce false HSB patterns.Although these areas can be manually removed, this process is not reliable since selected areas can vary according to the individual's subjective perception.In addition, it is also a time-consuming procedure.Therefore, the aim of the present project is to develop a software that automatically selects the region of interest (ROI), that is the HSB region whose bands are reliable for biometric comparison, thus, making the entire process straightforward.
The workflow for the whole biometric process is illustrated in Figure 1.Segmentation approaches presented here were built to create a proper mask to segment the HSB region, which in turn will be enhanced by a HSB filter, and eventually pass through biometric feature extraction for either storage as template or comparison for personal identification.Note that for real application, tooth photographs should be taken at tooth crown level in the mouth.Therefore, anterior teeth are preferred for ease of access.In the present study, we evaluated the techniques for segmentation in extracted teeth, but with similar potential for inmouth applications.

Methods
In this section, we present the methods that were developed to automatically segment the HSB.We considered two different approaches for automatic segmentation, using classical image processing, and convolutional neural network.The first one is an image processing pipeline based on the anisotropic HSB property.The second one is known as U-Net, a convolutional network created for medical image segmentation [12].

Anisotropy-based segmentation (ABS)
This approach takes advantage of the well established orientation pattern of HSB, in addition to its relatively low variation in orientation.Once the tooth photograph is taken with the tooth vertically positioned, its bands consequently will appear overall horizontally.Therefore it is unnecessary to apply any rotation.The physical property of an object that presents different values according to measures along different axes is called anisotropy.This property arises mainly in regions of image containing HSB patterns and the harnessing of this characteristic for segmenting the area of interest is possible due to: 1) Anisotropic features around a tooth are very unlikely to exist, except for HSB of its neighbor teeth.2) The focus of the camera on the HSB pattern of the targeted tooth surface makes other regions blurred.3) The HSB appears only under side lighting, and the residual light effect on neighbor teeth is expected to be weaker.The ABS workflow is shown in Figure 2. A complete description follows below.All parameter values chosen were validated across the available dataset using a visualization method described in [13].

Smooth and resize
Since the initial and enhanced HSB images are noisy, high-resolution images and the segmentation process does not require these levels of detail, we start the process by smoothing the enhanced HSB image E into E s such as where W (27,27) represents a 27 × 27 integral kernel (whose sum equals one) and the star sign (⋆) represents a cross-correlation between the kernel and the image E. The expanded version of this expression is demonstrated below: Let a be a kernel with size (2h + 1) × (2w + 1) and B an image; the cross-correlation between them can be computed with The ROI seems to follow roughly the shape of the tooth crown.Since anterior teeth are preferable, we assumed its dimensions as default.Consider a front sight image of an anterior tooth vertically positioned, its frontal surface has vertical dimension greater than the horizontal dimension at tooth's neck, that is where HSB appears more clearly due to a thinner enamel layer, so the ROI usually is taller vertically and thinner horizontally (Figure 3).Hence, after visual comparisons of the algorithm results using proportions of width over height from 1 (square shape) to 1  5 , the chosen and validated proportion for ROI was set to 1  2 .In order to maximize performance and standardize the segmentation process, the HSB enhanced grayscale image E s with dimensions of M rows and N columns (M × N) is resized to image J of dimensions P × Q according to the following equations where parameter u was empirically determined by visually comparing results for different possible values and validated afterwards.It is also proportional to the new desired image size; we used u = 200 as default.The scalars k x = 61 and k y = 122 are related to a kernel filter size that will be used later, and their values were chosen accordingly to fit the ROI proportion kx ky = 1 2 .With those settings, the size of image J is fixed at J size = 1.5 × 10 6 pixels.Let image J b be the blurred version of J, computed as where W (3,6) is a 3 × 6 integral kernel.This horizontally elongated kernel reduces noise in the background that could mimic HSB, mainly in regions of the tooth with artifacts from reflected light, while causing virtually no disturbance to the real HSB.

Edge detection
After obtaining image J b , we decompose its features into x and y directions to initially locate the HSB region.Edge detection is performed using horizontal The filtered image G y highlights, among other possible undesirable horizontal features, the HSB region.While the filtered image G x highlights regions where vertical structures are predominant.To remove high frequency gradients mostly represented by noise, we clip the edge maps to be in [0, 255], such as

Blurring
Once the edges are detected, after blurring them the regions with concentrated stronger edges tend to form blobs, whereas sparse and weak edges regions tend to disappear.Considering that HSB is confined to a single region in G y , after blurring its edges a few blobs are created and the most intense values appear near the core of the ROI.To perform this, let K be a unit-integral filter kernel of dimensions k y × k x , such as Both images G x and G y are filtered again by three repeated cross-correlations with K, resulting respectively, in the blurred images The shape and size of kernel K is directly related to the level of blurring desired and, consequently, the capture of details on the ROI shape.Larger K result in heatmaps (described in next section) with more rounded borders, while smaller K result in more jagged borders (Figure 4).The more cross-correlation repetitions, the larger and the longer is the area, because the number of rows in K is greater than the number of columns.The number of repetitions, that is three, was chosen after visual comparison of general overlap between the ground truth (ROI) and the automated segmented images among all sets.

Subtraction
Blurred images B x and B y represent blurred edge maps with gradients for horizontal and vertical features respectively.Subtracting one from the other creates a heatmap where isotropic regions have values around zero, high values match locations of the first term high values, and low values match locations of the second term high values.Therefore, with B y as the first term, the highest values of the heatmap will indicate the ROI.Let the heatmap H 1 for HSB location at position (i, j) be where B x is subtracted twice to compensate for the J → J b (Equation 7).Then, from H 1 we remove negative values as follows in which H 2 pixel values are in the interval R p = [0, p], where p varies among different images, depending on the size of the ROI.

Equalize histogram
Since the range of gray level values vary among different H 2 heatmap samples, an equalized heatmap H 3 is created to normalize grayscale values for the automated process.Hence, let the function γ be where m b is the bth gray level, n b is the number of pixels in heatmap H 2 taking the value m b .The equalized-histogram heatmap H 3 consists of the transformation mapping H 3 : R p → {0, 255} for all b gray levels in H 2 .So, equalized-histogram heatmap H 3 at all positions with pixel value m b can be defined as

Define threshold
Since the image gradient of H 3 tends to increase in direction of the center of the ROI, descending this gradient leads to its borders (Figure 5A).The threshold for the lower bound determines to what extent the HSB are enclosed.Let β be an ordered set of gray levels m a in H 3 , such that We define the function f (.) as where β b is the bth value in β (Figure 5B).We find the adaptive threshold by looking for the last gap greater than one in the sequence of discrete values of β (Figure 5C).For that, the discrete derivative of f (b) is computed as To prevent premature selection of a high threshold and consequently segmentation of a small region by minor gaps among the β highest values, f ′ (b) is smoothed by the following linear cross-correlation where W (3) is a linear unit integral kernel of length 3, ℓ = 1, and f ′ s (b) ∈ N. To solve the threshold decision, let {ı, ..., ȷ} be an ordered set of all b that satisfy the inequality The final threshold θ to segment H 3 is computed as Obtain raw mask The raw mask M 1 can be defined as

Distance transform
At this point, a segmentation by M 1 mask might still produce foreground branches that reach outside ROI.Also, M 1 mask border might have sharp angles not useful for biometric comparisons due to bands discontinuity in concave areas.To mitigate this problem, we first obtain a distance map D using the Euclidean distance transform where (Y, X) is the set of pairs of foreground pixel coordinates and (V, U ) is the set of pairs of background pixel coordinates in M 1 .

Morphological growing
Following, a new background removal is performed on D, with threshold ϕ = max(D) • α, where α determines the reduction rate of D foreground area based on its maximum pixel value.By default, it was set to α = 0.6, resulting in the reduced foreground mask area The M r is repeatedly dilated by a circle-shaped morphological operator ρ c of radius k x • 0.5 until eventually resulting in the final mask M 2 when any foreground pixel of M r overlaps with any background pixel in M 1 .The pseudocode for this process is described in Algorithm 1.

Algorithm 1 Morphological growing mask
Require: M 1 , Mr ρc ← binary closed disk of radius kx • 0.5 while True do

Resize mask and segment image
In the end, the binary mask M 2 is resized back to M × N dimensions, resulting in M 3 mask and the final segmented image S E is defined as Convolutional Neural Network (U-net) The U-Net was initially created for precise segmentation of medical images and proved to be useful in diverse branches of medical imaging, including Magnetic Resonance Imaging (MRI) [14,15], Optical Coherence Tomography (OCT) [16], and microscopy [17,18].We harnessed the architecture of the U-Net to automate the segmentation of the HSB.The U-Net consists of two parts: encoder and decoder.The encoder has 4 blocks, each one composed of 2 consecutive convolutions and one max-pooling function.The decoder also has 4 blocks, each one composed by a deconvolution (up-conv), a concatenation with the intermediate output (copy and crop) from the respective encoder block, followed by 2 convolutions.Encoder and decoder parts are connected by a 'bridge' consisted of 2 convolutions (Figure 6).Our architecture differs only by image padding, which is the same throughout the network, instead of decreasingly; there is no cropping, and the initial image size is a single-channel gray scale of 480 × 320 pixels.
The whole dataset was divided into three groups: training, validation and test groups (15, 6, and 10 teeth, that is, 60, 24, and 40 images, respectively).To avoid result bias by the network seeing different images from the same tooth in training and test groups, we sorted the image such that all the 4 images per tooth went into the same group.The teeth were randomly assigned to each one of the three groups.Regarding the limited dataset size for such a deep network, data augmentation during training was performed to maximize variability, creating 5 variations from each image for training and one variation for validation group.These augmented images (and their masks) were created randomly over each epoch, being a total of 300 training and 24 validation images per epoch.The used parameters for augmentations were: rotation {-20, 20} degrees, height and width shift up to 20%, range in brightness up to 50%, zoom (in or out) up to 20%, vertical and horizontal flip.
Since the final output image is a binary mask, the final activation is a sigmoid function.The optimizer of choice was Adam with a learning rate α = 1.0 × 10 −5 , and the chosen loss was binary cross-entropy defined as where y i is the true label and p i is the predicted label for all N pixels in the image.The metric used for assessing performance was Intersection Over Union (IOU).The above described U-Net was trained over 300 epochs using different batch sizes (1, 5, 10, 15 and 20) and the weights of the epoch with best validation IOU scores were used as final trained model.

Evaluation
In order to assess the results obtained using the developed methods, images were manually segmented to generate a ground truth.The results were assessed comparing the manual and the automated segmented regions from the same image.The manual selection was performed in the preprocessed grayscale images (Figure 1B) as follows: 1 Selection of region visually greater than HSB area. 2 Application of HSB filtering algorithm ("Filtering" in Figure 1) and visual location of improper regions.3 Selection of a better fitting region if there are improper areas.4 Repetition of steps 2 and 3 until no change is needed.5 Final region is selected.Since this process gradually decreases the segmented region size, the manually selected region of interest (ROI) is considered to be the largest area with recoverable HSB, or more precisely, the largest area whose filtered output image from HSB filtering algorithm has no visually important deformations for biometric usage.

ABS experimental results
The above described ABS algorithm was tested using four sets of images, each from the same 31 extracted teeth.All photographs were taken with teeth vertically positioned, with the incisal portion upwards and cervical portion downwards.Two sets were taken with left side lighting (L1 and L2), and two other sets with right side lighting (R1 and R2).
The average Intersection Over Union (IOU) score between manual and automated segmented regions overall was 0.766±0.106SD (standard deviation).The residuals can be separated in Exclusive Manual-Segmented region Over Union (EMSOU), that is a potential region not selected by the algorithm; and Exclusive Automate-Segmented region Over Union (EASOU), which is a trespassed area outside the ROI.Once EASOU is considered an area with no recoverable HSB, its minimization is preferred compared to EMSOU.The average EMSOU was 0.191±0.127SD and the average EASOU was 0.042±0.044SD (Table 3 ABS all).The large SD compared to the average for EMSOU and EASOU indicates a non-parametric distribution for these results with sparse high errors in only a few images, which can be seen in Figure 7.Some examples of ABS performance are shown in Figure 8.

U-Net experimental results
According to the training results (Table 1), the hyper-parameter batch size of 5 produced the best result, with a loss of 0.0544 and an IOU score of 0.827 in the validation dataset during the training.This IOU score (IOUd) is computed using the predicted decimal values for every pixel in the interval [0, 1] without any thresholding.The training progression loss and IOU score are shown in Figure 9.Other result metrics were very close among all batch sizes.The learning curve for batch sizes of 5, 10 and 15 were more consistent with increasing overfitting (Figure 9).Whereas batch sizes of 1 and 20 produced high fluctuation values through the training, with lowest loss too early and too late, respectively.These cases of high variability metrics, e.g.IOU score, during the training are examples of a trade off between high frequency of parameter tuning per epoch without generalization by using small batch sizes (the former) and low frequency parameter tuning per epoch as an attempt for excessive generalizations when using too large batches (the latter).It has been shown that large batches tend to reach sharp minima whereas small batches move towards flat minima, reducing the generalization gap [19], that is the difference between validation and test dataset results.

Comparison
The results of evaluation of both models on the test dataset are shown in Table 2.The IOU score was computed using the threshold of 0.5 for every pixel with predicted value in [0, 1].The model with the best IOU score (0.8393) was the U-Net model trained with batch size of 15, followed closely by batch sizes of 20, 10 and 5.The least loss (0.747) was again the model with batch size of 5, with still large differences in the IOU ∈ [0, 1] compared to other batch sizes (≈ 0.10).The IOU score of ABS model was the lowest.For further analysis, we kept only the predicted masks by the U-Net model with batch size of 5, because it has the lowest loss and the lowest gap between IOUd and IOU score.The 40 test predicted masks underwent a post processing by the convex hull algorithm [20] (U-Net-CH) to also select regions in concave areas left out by the foreground mask.Since HSB pattern appears usually as a convex area, these concavities might handicap the subsequent biometric performance.The average IOU score between manual and automated segmented regions was 0.837±0.06SD, that is almost 0.08 above ABS's.The average EMSOU was 0.101±0.084SD and the average EASOU was 0.06±0.057SD.Except for the EASOU score, these results are better than the ones from ABS (Table 3).The U-Net-CH improved IOU and EMSOU compared to U-Net but also increased average EASOU by 0.009.Overall, the trained U-Net-CH was able to select a larger region inside ROI with a slightly greater overpass.The heatmaps for IOU, EMSOU and EASOU results are shown in Figure 10 for U-Net-CH and a detailed image-to-image comparison between the models ABS and U-Net-CH is shown in Figure 11 for only test dataset.Some examples of U-Net-CH performance on test dataset can be seen in Figure 12.

Discussion
Currently, both techniques for automated segmentation have acceptable performance regarding ex-vivo teeth image sets.While U-Net-CH approach resulted in better segmentation quality, it also had more results overpassing the ground truth boundaries.The ABS is more conservative regarding the boundary limits, resulting in smaller segmented regions, but also probably with more consistent biometric identification results, since less false HSB is selected.Due to a limited sample size for such a deep neural network, that is the U-Net, increasing the sample size for training the model probably would improve its predictions and might also reduce the most critical error score, that is the EASOU, of segmented areas.In addition, different backgrounds should be taken into consideration for training, to make sure they will not interfere with the HSB area selection in real application.

Conclusions
To summarize, both techniques showed to be useful for the proposed task.On one hand, the usage of the Anisotropy-Based Segmentation might be seamless with limited computational resources and readily available once it is implemented.Moreover, a semi-automated experience can be implemented by offering the user ABS parameters to tweak in order to adapt the algorithm to the nuances images might have without incurring errors caused by using e.g.free selection tool.On the other hand, U-Net trained with a larger dataset might achieve more robustness and better overall performance.

Figure 1 Figure 2
Figures

Figure 3
Figure 3 Original photography and enhanced HSB image of an anterior tooth with proportion measures.Green line is the vertical size of the tooth crown.Red line is the horizontal measure of the tooth frontal surface.Blue box represents approximately the extension of recoverable HSB (ROI).

Figure 6 U
Figure6U-Net architecture used adapted from the original paper[12].

Figure 7
Figure 7  Results of the relation between manual and automated segmentation for the four sets of images of 31 teeth.A) Intersection over union (IOU).B) Exclusive manual-segmented region over union (EMSOU).C) Exclusive automate-segmented region over union (EASOU).Note that IOU i,j + EMSOU i,j + EASOU i,j = 1, for (i, j) ∈ N 2 .

Figure 8
Figure 8 Examples of manual and ABS automated segmentation and their mask overlaps for the teeth with indices 29, 35, 111, 140, 174 and 187.The three images for each tooth photograph (L1, L2, R1, R2), from left to right, are HSB image with manual mask, HSB image with automated mask, and overlap of both masks (white is overlap, light gray is only manual mask, and dark gray is only automated mask).

Figure 9
Figure 9 Training progression through epochs.(A) Validation loss.(B) Validation IOUd score.(C) Training and validation datasets loss in the model with batch size of 5.

Figure 11
Figure 11 Detailed scores for the 10 teeth (40 images) in the test dataset for ABS and U-Net-CH.(A) IOU score.(B) EMSOU score.(C) EASOU score.

Figure 12
Figure12Examples of manual and U-Net-CH automated segmentation and their masks overlap for indexed teeth 29, 35, 36, 111, 174 and 187.The three images for each tooth photograph (L1, L2, R1, R2), from left to right, are HSB image with manual mask, HSB image with automated mask, overlap of both masks (white is overlap, light gray is only manual mask, and dark gray is only automated mask)

Table 1
Training metrics of the best epoch of validation dataset.

Table 2
Evaluation metrics for U-Net trained models and ABS model for all and only test images.

Table 3
Mean and standard deviation of model scores.