Deep Learning for Automatic Image Segmentation in Stomatology and Its Clinical Application

Deep learning has become an active research topic in the field of medical image analysis. In particular, for the automatic segmentation of stomatological images, great advances have been made in segmentation performance. In this paper, we systematically reviewed the recent literature on segmentation methods for stomatological images based on deep learning, and their clinical applications. We categorized them into different tasks and analyze their advantages and disadvantages. The main categories that we explored were the data sources, backbone network, and task formulation. We categorized data sources into panoramic radiography, dental X-rays, cone-beam computed tomography, multi-slice spiral computed tomography, and methods based on intraoral scan images. For the backbone network, we distinguished methods based on convolutional neural networks from those based on transformers. We divided task formulations into semantic segmentation tasks and instance segmentation tasks. Toward the end of the paper, we discussed the challenges and provide several directions for further research on the automatic segmentation of stomatological images.


INTRODUCTION
Imaging examinations, intraoral scanning, and other technologies are often required to assist diagnosis and treatment of diseases because of the complex structure of the oral and maxillofacial region and various types of diseases. Imaging examinations use dental X-rays, panoramic radiography, cone-beam computed tomography (CBCT), and multi-slice spiral computed tomography (MSCT). These are widely used in stomatology and produce large amounts of medical image data. Efficient and accurate processing of medical images is essential for the development of stomatology. The key task is image segmentation, which can realize the localization and the (qualitative and quantitative) analysis of lesions, help to design a treatment plan, and analyze the efficacy of the treatment. The traditional manual segmentation method is time-consuming and the segmentation effect depends on the experience of the doctor, leading to an unsatisfactory result. Therefore, the application of modern image segmentation technology to stomatology is very important.
Deep learning (DL) is a branch of machine learning and is a promising method of achieving artificial intelligence. Owing to the availability of large-scale annotated data and powerful computing resources, DL-based medical image segmentation algorithms have achieved excellent performance. These methods have successfully assisted the accurate diagnosis and minimally invasive treatment of brain tumors (1), retinal vessels (2), pulmonary nodules (3), cartilage, and bone (4). This paper reviews current DL-based medical image segmentation methods and their applications in stomatology. Existing automatic segmentation algorithms are classified according to the data source, the form of the automatic segmentation task, and the structure of the backbone network of the algorithm. The feasibility, accuracy, and application prospects of these algorithms are comprehensively analyzed, and their future research prospects are discussed.

STOMATOLOGICAL IMAGING DATA SOURCES AND COMPARISON
Common stomatological images can be categorized into five types: panoramic radiography, dental X-rays, CBCT, MSCT, and intraoral scanning (IOS). Each type is suitable for specific clinical applications, according to its unique imaging principles. Dental X-rays and panoramic radiography are mainly used for dental caries, alveolar bone resorption, and impacted teeth. CBCT is mainly used for the early diagnosis and comprehensive analysis of cracked teeth, diseases after root canal treatment, jaw lesions, and other diseases. CBCT can also assist the design of implant guides, orthodontic treatment, and maxillofacial disease treatment. MSCT is often conducted to assist the diagnosis, treatment, and postoperative efficacy analysis of soft and hard tissue lesions in the maxillofacial region. IOS is generally employed in chair-side digital restoration, digital orthodontics, and digital implant surgery. There is a structural overlap between dental X-rays and panoramic radiography because both produce 2D images. Without clinical experience in reading films, missed diagnoses and misdiagnoses may easily occur. Although CBCT and MSCT produce 3D images with clear layers, traditional empirical reading could also lead to missed diagnoses and misdiagnoses of early and minor lesions. Table 1 shows the imaging characteristics, advantages, and disadvantages of

AUTOMATIC SEGMENTATION ALGORITHMS FOR MEDICAL IMAGES
Image segmentation aims to simplify or change the representation of images, making them easier to understand and analyze. Image segmentation can be divided into the semantic segmentation task and the instance segmentation task. The semantic segmentation task focuses on differences between categories, whereas the instance segmentation task focuses on differences between individuals (Figure 1). In the semantic segmentation task, it is required to separate the teeth, jaws, and background, without distinguishing between individuals in each category ("Tooth" or "Jaw)." Conversely, in the instance segmentation task, both the category label and the instance label (within the class) are required; that is, the individuals in each category ("Tooth" or "Jaw)" must be distinguished. Traditional image segmentation algorithms (5-7) cannot be directly applied to complex scenes because of the limitations of their manually designed features. The emergence of DL has made it possible to segment medical images efficiently and effectively. Segmentation algorithms based on convolutional neural networks (CNNs) have already become the de facto standard in image segmentation tasks. Their excellent segmentation ability has been demonstrated experimentally and theoretically and can be further applied to medical images. In addition to the popularity of CNNs, the transformer structure (8), originating from the field of natural language processing, has become an active research topic in computer vision because of its excellent long-term modeling ability. Therefore, according to these different types of backbone networks, we divide automatic image segmentation methods into CNN-based methods and transformer-based methods. it is required to segment the teeth, jaws, and background, without the need to distinguish the individuals in the category "Tooth" or "Jaw." (C) Instance segmentation: not only the category label is required, but also the instance label among the same class is needed, i.e., separating the individuals in the category "Tooth" or "Jaw".
We have collected and summarized about 30 articles on image segmentation tasks. An overview of these methods is shown in Figure 2, and their evolution over time is depicted in Figure 3.

CNNs
The main ingredients of a modern CNN are the convolution layer, nonlinear activation layer, pooling layer, and fully connected layer, of which the convolution layer is the core component. The main principles of the convolution layer are its local receptive fields and weights-sharing strategy; the first refers to the limited range of data within a sliding window, and the second refers to the shared parameters of convolution kernels despite the sliding windows. The pooling layer can reduce the resolution of extracted features, to reduce the amount of calculation, and select the most robust features to prevent overfitting. The fully connected layer refers to the connection between all nodes in two adjacent layers; such a layer can realize the integration and mapping of input features and is usually used to output classification results. The nonlinear activation layer provides nonlinearity to the neural network so that it can approximate all continuous functions.
AlexNet (9), an early CNN model, adopted the ReLU activation function to accelerate network convergence and the dropout technique to prevent overfitting. VGG (10) achieved better performance than AlexNet by replacing the large 5 × 5 convolution kernel with two continuous 3 × 3 convolution kernels and increasing the network depth. GoogLeNet (11) used the Inception module to increase the width of the network while reducing the number of parameters. Its subsequent version (12) improved performance by convolution decomposition, batch normalization, label smoothing, and other techniques. ResNet (13) solved the problem of network degradation by using skip connections and has been one of the most popular feature extractors in many vision tasks. DenseNet (14) made full use of extracted features by establishing dense connections between different layers. Moreover, lightweight models [e.g., MobileNet (15) and ShuffleNet (16)] and models designed by neural architecture search (NAS) [e.g., EfficientNet (17)] have already received widespread attention in the DL community.

Transformers
The transformer structure, which originated from natural language processing (18)(19)(20), has recently attracted substantial attention from the vision community. The transformer (18) proposed the multi-head self-attention module and a feedforward network to model long-term relationships within input sequences; it also has an enhanced ability for parallel computing. Witnessing the power of transformers in natural language processing, some pioneering studies (21)(22)(23)(24) have successfully applied it to image classification, object detection, and segmentation tasks.
Vision Transformer (ViT) (21) split an image into patches and directly fed these patches into the standard transformer with positional embeddings, demonstrating that learning from large-scale data is better than inductive bias. Dataefficient image Transformers (DeiT) (22) achieved better performance by more careful training strategies and tokenbased extraction. Convolutional vision Transformer (CvT) (23) improved the performance and efficiency of ViT by introducing convolution into the ViT architecture. This was accomplished by two major modifications: the hierarchical transformer, containing convolutional token embedding, and a convolutional transformer block, using a convolutional projection. Swin-Transformer (24) is a hierarchical transformer whose features are calculated within a shifted window, providing higher efficiency by limiting the self-attention calculation to non-overlapping local windows and allowing cross-window connection; its computational complexity is linear with respect to image size. These features make Swin-Transformer compatible with a wide range of visual tasks, including image classification, object detection, and semantic segmentation.

Common Algorithms for Semantic Segmentation
The aim of the semantic segmentation task is to assign a unique category label to each pixel or voxel in the image. Semantic segmentation can both identify and mark the Meanwhile, it can be divided into the detection-based and the detection-free instance segmentation methods, the former is divided into the single-stage (YOLCAT, YOLO, and SSD) and two-stage methods (Mask R-CNN, PANet, Cascade R-CNN, and HTC), and the latter includes SOLO, DWT, and DeepMask. The Transformer-based methods, such as cell-DETR, ISTR, belong to detection-based methods.

CNN-Based Algorithms
As an iconic in image semantic segmentation, FCN (25) replaced all full connection layers with convolutional layers to predict the dense segmentation map. In contrast to FCN, SegNet (26) performed nonlinear upsampling according to the index of the max-pooling of the corresponding encoder, where the spatial information of the encoding stage was maintained. PSPNet (27) obtained the global context information by aggregating the context information, to improve the parsing performance for complex scenes.
The DeepLab series focused on enlarging the receptive field and integrating multi-scale feature information. DeepLabV1 (28) used dilated convolution and conditional random fields to obtain more informative feature maps. DeepLabV2 (29) featured the atrous spatial pyramid pooling module (ASPP), which performed the atrous convolution of different sampling rates to obtain multiscale feature representation. DeepLabV3 (30) achieved the effect of atrous convolutions, multi-grid, and ASPP. DeepLabV3+ (31) used the encoder-decoder structure to perform segmentation tasks, and the depthwise separable convolution from Xception was introduced into the ASPP module.
UNet (32) was one of the most influential segmentation models dedicated to biomedical fields. Compared with FCN, its major contributions lay in its U-shaped symmetric network and an elastic deformation-based data augmentation strategy. The U-shaped network consisted of symmetric compression paths and expansion paths, and the elastic deformation effectively simulated the normal changes in cell morphology. 3D UNet (33) implemented a 3D image segmentation task by replacing the 2D convolution kernel in UNet (32) with a 3D convolution kernel. VNet (34) used a new loss function, termed Dice loss, to handle the limited number of annotated volumes available for training. UNet++ (35) introduced a built-in ensemble of UNets of varying depths and had redesigned skip connections to enhance performance for objects of varying sizes.

Transformer-Based Algorithms
SETR (36) used a pure transformer to encode an image to a sequence of patches, without the need for a convolution layer or resolution reduction and showed the power of the transformer structure for segmentation tasks. In Segmenter (37), the global context relationship was established from the first layer and a pointwise linear decoder was employed to obtain the semantic labels. SegFormer (38) combined the hierarchical transformer structure with a lightweight multi-layer perceptron decoder, without the need for positional encoding or a complex decoder.
Swin-UNet (39) unified UNet with a pure transformer structure for medical image segmentation tasks, by feeding tokenized image blocks into the symmetric transformerbased U-shaped encoder-decoder architecture with skip connections, and local and global cues were fully exploited. The successful application of Swin-UNet to multi-organ and cardiac segmentation tasks demonstrated the potential benefits of the transformer structure to medical image segmentation. MedT (40) featured the gated axial-attention model, in which an additional control mechanism was introduced into the selfattention module. In addition, a local-global training strategy (LoGo) was proposed to further improve performance. UNETR (41) employed pure transformers as an encoder to capture global multi-scale information effectively. The effectiveness of UNETR in 3D brain tumors and spleen tasks (CT and MRI modalities) was validated by experiments on the MSD dataset. MBT-Net (42) applied a multi-branch hybrid transformer network, which was composed of a body branch and an edge branch, to the corneal endothelial cell segmentation task. Other transformer-based methods for medical image segmentation include TransUNet (43) and TransFuse (44).

Common Algorithms for Instance Segmentation
Depending on the backbone network used, instance segmentation methods can also be categorized into CNNbased and transformer-based methods. In addition, from the perspective of algorithms, instance segmentation methods can be divided into detection-based and detection-free methods. Detection-based methods can be regarded as extensions of object detection: they obtain bounding boxes by object detection methods and then perform segmentation within the bounding boxes. Moreover, the detection methods can be divided into single-stage and two-stage methods. Single-stage methods include You Only Look At Coefficient Ts (YOLCAT) (45), You Only Look Once (YOLO) (46), and Single Shot MultiBox Detector (SSD) (47). Two-stage methods include Mask R-CNN (48), PANet (49), Cascade R-CNN (50), and hybrid task cascade (HTC) (51). Detection-free methods first predict the embedding vector and then group the corresponding pixel points into a single instance by clustering; examples of such methods include Segmenting Objects by Locations (SOLO) (52), Deep Watershed Tranform (DWT) (53), and DeepMask (54).
Most existing transformer-based instance segmentation algorithms are built on a detection method, DETR (55), so they belong to the class of detection-based methods; such methods include Cell-DETR (56) and ISTR (57). We have collected 12 articles on instance segmentation tasks; the overall development is shown in Figure 5.

CNN-Based Algorithms
Detection-based instance segmentation methods follow the principle of detecting first and then segmenting. The performance of such methods is heavily dependent on the performance of the object detector, and so a better detector would improve the quality of instance segmentation. As discussed above, detection methods can be divided into singlestage and two-stage methods. A typical example of a single-stage method is YOLCAT (45), which first generated multiple prototype masks, and then used the generated coefficient to combine prototype masks, to formulate the object detection and segmentation results. In addition, a popular single-stage object detector [e.g., YOLO (46)] could accomplish the instance segmentation task by adding the same mask branch. A typical example of a two-stage method is Mask R-CNN (48), which used RoIAlign for feature alignment and added an object mask prediction branch to Faster R-CNN (58). PANet (49) further aggregated the underlying and high-level features on the basis of Mask R-CNN and performed fusion operations by adaptive feature pooling for subsequent prediction. Cascade R-CNN (50) achieved the purpose of continuously optimizing the prediction results by cascading several detection networks with different IoU thresholds. HTC (51) had a multi-task and multi-stage hybrid cascade structure and incorporated a branch for semantic segmentation to enhance spatial context.
Detection-free instance segmentation methods learn the affinity relation by projecting each pixel onto embedding space, pushing pixels of different instances apart, and pulling pixels of the same instance closer in the embedding space; finally, a postprocessing step such as grouping can formulate the instance segmentation result. SOLO (52) was an end-to-end detectionfree instance segmentation method, which could directly map the original input image to the required instance mask, eliminating the postprocessing requirements in detection. DWT (53) combined the traditional watershed transform segmentation algorithm with a CNN to perform instance segmentation. DeepMask (54) simultaneously generated a mask, indicating whether each pixel on a patch belongs to an object, and an objectiveness score, indicating the confidence of an object located at the center of the patch. Compared with detectionbased instance segmentation methods, the performance of these detection-free methods is limited, and there is scope for improvement.

Transformer-Based Algorithms
Applying the transformer structure to instance segmentation tasks is a relatively new research area. Cell-DETR (56) was one of the first methods to apply the transformer structure to instance segmentation tasks on biomedical data, achieving performance comparable with that of the latest CNN-based instance segmentation methods, but having a smaller number of parameters and a simpler structure. ISTR (57), an end-to-end instance segmentation framework, predicted low-dimensional mask embeddings and assigned them with ground-truth mask embeddings to compute the set loss, achieving a significant performance gain for instance segmentation tasks by conducting a recurrent refinement strategy.

Characteristics of Automatic Segmentation Algorithms
The characteristics of semantic segmentation algorithms are shown in Table 2, and those of instance segmentation algorithms are shown in Table 3. The code and the data of works of literature are shown in Appendix.

CLINICAL APPLICATION OF AUTOMATIC IMAGE SEGMENTATION IN STOMATOLOGY
The segmentation of teeth, jaws and their related diseases is usually considered as a preprocessing step to complete tooth matching (59)(60)(61)(62), tooth numbering (63)(64)(65), automatic marking of important anatomical markers, in addition to the intelligent diagnosis, classification, and prediction of diseases. Traditional methods for stomatological image segmentation include regionbased (66), threshold-based (67), clustering-based (68), edge tracking (69), and watershed (8) methods. With the development of DL, many DL-based methods for stomatological image segmentation have been developed; these mainly focus on teeth, jaws, and their related diseases.

Application to Teeth and Related Diseases
Automatic segmentation of teeth in stomatological images can contribute to the location of supernumerary teeth and impacted teeth, as well as digital restoration, digital orthodontics, and digital implant surgery. Automatic segmentation of caries and other related lesions is helpful for the early diagnosis of caries, particularly those that are easily missed, such as hidden caries and adjacent caries. At present, the types of medical image data that are commonly used for the segmentation of teeth and related lesions include panoramic radiography, CBCT, dental X-rays, and IOS.
Semantic segmentation is the most common type used for the DL-based automatic segmentation of teeth and their related diseases. This paper reviews 11 articles on semantic segmentation ( Table 4), finding that semantic segmentation can mark the boundary between teeth and jaws, but the boundary is unclean, particularly for a malposed tooth that overlaps adjacent teeth. However, instance segmentation can distinguish different teeth in six relevant articles ( Table 5). Compared with semantic segmentation, instance segmentation is better for marking the boundary of each tooth, but there are still some problems, such as the loss of data detail and small sample size, which may affect the accuracy of segmentation.

Semantic Segmentation in Teeth and Related Diseases
For 2D images, different models can be trained to segment different areas, such as all teeth (70)(71)(72) or adjacent caries (73), depending on the artificially defined foreground. Wirtz (70), Koch (71), and Sevagami (72) all used the UNet network for the automatic segmentation of teeth from panoramic radiography. Wirtz et al. (70) also used the method to segment teeth in complex cases such as tooth loss, defect, filling, and fixed bridge restoration, achieving a Dice similarity coefficient (DSC) of 0.744 on their dataset. Koch et al. (71) proved that UNet improves the segmentation performance by exploiting data symmetry, an ensemble of the network, test-time augmentation, and bootstrapping; they measured a DSC of 0.934 on the dataset created by Silva (86). Sevagami et al. (72) believed that UNet could

FCN (25)
The first full convolution network in semantic segmentation task.
Ignoring the global context information and having a relatively high usage of GPU memory.
SegNet (26) Improving the segmentation performance at boundary, reducing the number of model parameters and calculation cost PSPNet (27) Taking the global context information into consideration, improving the segmentation of small objects and co-occurrent categories.
UNet (32), 3D UNet (33) It is extremely suitable for segmenting medical images and can train from small-scale dataset with dedicated data augmentation.
VNet (34) It is a variant of UNet and suitable for 3D image analysis.

UNet++ (35)
An advanced UNet structure, improving the performance on objects of varying size by unifying a set of UNet with different depth.

SETR (36)
A novel and accurate Transformer-based model on semantic segmentation task, without the need for convolution layer and resolution reduction.
Segmenter (37) Applying Transformer structure to obtain global context information and achieving SOTA performance on ADE20K dataset.
SegFormer (38) Simplifying the design of Transformer-based model, a lightweight multilayer perceptron decoder is proposed to avoid the complex design of decoder, without the need for positional encoding.

Swin-UNet (39)
A combination of UNet and Swin-Transformer, which is carefully designed for medical image segmentation, achieving high performance with small number of parameters.

MedT (40)
A Transformer-based medical image segmentation network without pre-training.

UNETR (41)
Effectively capturing the global and multi-scale information and achieving high performance on 3D brain tumors and spleen tasks.

MBT-Net (42)
Fully exploiting the global and local context information by Transformer and CNN respectively and achieving high performance on segmenting corneal endothelial cells.
TransUNet (43) It combines the advantages of UNet and Transformer structure to make a strong method on many medical applications including multi-organ segmentation and cardiac segmentation tasks.  Cell-DETR (56) The first Transformer-based instance segmentation method for biomedical data and SOTA performance.

ISTR (57)
It is the first end-to-end Transformer-based framework in instance segmentation task, predicting low-dimensional mask embeddings, and then matching with ground truth mask embeddings for loss computing.
generative adversarial network to exploit comprehensive semantic information for tooth segmentation, with an IoU of 0.9042 on the LNDb dental dataset. Choi et al. (73) first aligned the teeth horizontally, generated the probability map of dental caries in periapical images with FCN, then extracted the crowns, and finally refined the caries probability map, to achieve automatic detection and segmentation of adjacent dental caries. The F1-score was 0.74 on their own dataset. These applications all perform semantic segmentation of 2D images; the only differences are the artificial definition of the foreground and the choice of semantic segmentation model. Most 3D images of teeth originate from CBCT data, and semantic segmentation of these 3D images requires a 3D semantic segmentation network, such as VNet (75), multi-task 3DFCN and marker-controlled Watershed transform (MWT) (76), modified UNet (77), and the symmetric fully convolutional residual network with DCRF (78). Ezhov et al. (79) proposed a coarse-fine network structure to refine the volumetric segmentation of teeth, with an IoU of 0.94. The segmentation results can be used for applications such as tooth volume prediction (75), panoramic reconstruction (75), digital orthodontic simulation (76), and dental implant design (77).
The gingival tissue cannot be shown on the panoramic radiography or CBCT image, but it is very important for clarifying the relationship between the tooth and gingiva for digital restoration, implant, and orthodontics. For this reason, another imaging method has emerged in stomatology, namely, IOS, which can obtain realtime 3D data (which are point cloud data) of teeth and soft tissues. Zanjani et al. (80) proposed an end-to-end learning framework for semantic segmentation of individual teeth and gingivae from IOS data. This method was based on PointCNN; it used a non-uniform resampling mechanism and a compatible loss weighting to improve performance; it achieved an IoU of 0.94 on the own dataset of the authors.
The performance of the above methods is shown in Table 4.

Instance Segmentation in Teeth and Related Diseases
Instance segmentation can mark both the boundaries between different categories in the image, such as the boundaries between teeth and jaws and the boundaries of different individuals in the same category, such as the boundaries between different teeth. Tooth instance segmentation from panoramic radiography is a common task in dentistry. Using the Mask R-CNN algorithm, Jader et al. (81) performed instance segmentation of teeth from panoramic images, using the transfer learning strategy to solve the problem of insufficient annotated data, and proposing a data augmentation method by separating the teeth; they achieved accurate segmentation, with an F1-score of 0.88 on the dataset (86). Silva et al. (65) analyzed the segmentation, detection, and tooth number performance of four end-to-end deep neural network frameworks, namely Mask R-CNN, PANet, HTC, and ResNeSt, on a challenging panoramic radiography dataset. Of these algorithms, PANet achieved the best segmentation performance, with an F1-score of 0.916 on the UFBA-UESC Dental Images Deep dataset. Gurses et al. (82) proposed a method for human identification from panoramic dental images using Mask R-CNN and SURF; they used two datasets [DS1: part of the dataset from (81), DS2: their own dataset], achieving an F1-score of 0.95.
The tooth structure from 3D data is clearer, which is an important clinical advantage in instance segmentation of teeth. Wu et al. (83) used a two-stage deep neural network, which included a global stage (heatmap regression UNet) to guide the localization of tooth ROIs together with a local stage (ROI-based DenseASPP-UNet) for fine segmentation and classification, to perform tooth instance segmentation from CBCT; they achieved a DSC of 0.962 on their own dataset. Cui et al. (84) proposed a two-stage automatic instance segmentation method (ToothNet), based on a deep CNN for CBCT images, which obtained a good result, with a DSC of 0.9264 on their own dataset. It exploited a novel learned edge map, similarity matrix, and the spatial relations between different teeth.
Tooth instance segmentation with IOS is also an important research direction. Zanjani et al. (85) proposed a model named Mask-MCNet, for instance segmentation of teeth from IOS data. This model positioned each tooth by predicting its 3D bounding box, and simultaneously segmented points belonging to the individual tooth without using voxelization or subsampling techniques. The model could also preserve the fine detail in the data, enabling the highly accurate segmentation required in clinical practice, and obtaining results in just a few seconds of processing time. On their own dataset, the mIOU achieved was 0.98.
The performance of the above methods is shown in Table 5.

Application in Jaws and Related Diseases
There are many types of jaw diseases; moreover, the number of benign and malignant samples is unbalanced, which may easily cause missed diagnoses and misdiagnoses. Minimally invasive and precise treatment of these diseases generally requires precise location of lesions through preoperative planning and accurate intraoperative image guidance. Patients with craniomaxillofacial malformations require intelligent 3D symmetry analysis. The analysis of postoperative efficacy requires both subjective evaluations by doctors and patients, in addition to the quantitative and objective evaluation of the progression and outcome of lesions. Precise segmentation of the jaw or related diseases is important for clinical diagnosis and treatment. In the past, manual segmentation of the jaw and its lesions was time-consuming and laborious. Since the development of DL, researchers have used DL methods to learn the features of the jaw and its lesions, realizing automatic segmentation. The main sources of medical image data used for automatic segmentation are panoramic radiography, CBCT, and MSCT. Six relevant articles showed that current DL-based automatic segmentation methods for jaws and related diseases mainly focus on semantic segmentation ( Table 6). There are two factors that affect the segmentation performance: (1) The space between the mandibular condyle and the temporal articular surface is very small and contains an articular disc, which often affects the accuracy of mandibular segmentation. (2) The segmentation accuracy of jaw or teeth in the occlusion and non-occlusion scenarios can be different because of the influence of the contact between upper and lower teeth.
For panoramic radiography, Kong et al. (87) adopted the UNet structure for rapid and accurate segmentation of the maxillofacial region, with an accuracy of 0.9928 on their own dataset. Li et al. (88) proposed the Deetal-Perio method to predict the severity of periodontitis from panoramic radiography. To calculate alveolar bone absorption, Deetal-Perio first segmented and indexed the individual tooth by using Mask R-CNN with a novel calibration method. It then segmented the contour of the alveolar bone and calculated a ratio of individual teeth, to represent the alveolar bone absorption. Finally, Deetal-Perio predicted the severity of periodontitis according to the ratios of all the teeth. They adopted two datasets, namely the Suzhou dataset and Zhongshan dataset, with DSCs of 0.868 and 0.852, respectively. Egger et al. (89) automatically segmented the mandible by using FCN and carefully evaluated the mandible segmentation algorithm. Ten  The performance of the above methods is shown in Table 6.

TRENDS AND FUTURE WORK
Integration and Improvement of Data Quality As a consequence of the need to protect patient privacy and the right to be informed, there are not always enough cases to establish a large-scale dataset dedicated to the segmentation of stomatological images. Moreover, the collected data are usually taken from diverse hospitals and machines, which further increases the difficulty of formulating universal benchmarks. Therefore, methods to effectively integrate, store, and safely share these data are of vital importance and are urgently required. The establishment of a shared dentomaxillofacial database can help to solve this problem, to some extent. Differences in the data acquisition settings and conditions (such as exposure time) used in each hospital lead to variation in the quality of image data (such as contrast and signal-to-noise ratio), which affects the accuracy and robustness of image segmentation. Further study should focus on standardizing and normalizing image data and improving data quality. Currently, most data used for image segmentation in stomatology are produced by a single modality (single CBCT or MSCT). In the future, multi-modality data of the same case can be employed collaboratively, to fully exploit the correlative and complementary essence among these modalities; this may further boost the performance.

Model Design in the Fully Supervised Case
The most common methods of image segmentation in stomatology are built on top of a CNN. However, the transformer structure has been gradually emerging in the field of computer vision because of its global modeling ability. It has been applied to mandible segmentation by researchers, outperforming the current CNN-based model. Transformer-based methods have the potential to obtain satisfactory results in medical image segmentation in the future. In addition, how to reduce the number of model parameters while ensuring accuracy is an active research topic, which is particularly important for the deployment of the medical image segmentation model and the promotion of related technology in clinical settings.

Model Design for Insufficient Data Annotation
The existing DL-based algorithm relies heavily on largescale data to learn discriminative features, but the process of labeling stomatological data is time-consuming and laborintensive; therefore, how to learn effectively from an insufficient and imperfect dataset is an active research topic. There are several ways to solve such problems. First, to reduce the burden of the time-consuming and labor-intensive annotation process, for complete annotation data, weakly supervised and semi-supervised methods can be adopted. Second, to handle the existence of noise in manual labeling, algorithms that learn from noisy labels can be employed. Third, to solve the problem that existing methods cannot be generalized to new categories, techniques such as transfer learning, domain adaptation, and few-shot learning can be considered. In addition, unsupervised learning and self-supervised learning could also be used to explore the structural properties inside the dental image itself, providing a better prior for downstream tasks.

Interpretability of Deep Learning
Although existing DL methods have shown excellent performance in stomatology, they have not been widely promoted because of the limitations of DL. In addition to the high computing cost and the need for large-scale datasets, the "black box" characteristic of DL methods is the main factor that hinders their application. To gain the trust of doctors, regulators, and patients, a medical diagnostic system must be transparent, interpretable, and explicable. Ideally, it should explain the complete logic of how a decision is made. Therefore, research on interpretability is most urgently needed for the application of DL to clinical diagnosis and treatment.

Clinical Application in Stomatology
First, from the perspective of what to segment, current studies focus mainly on teeth and teeth-related diseases, whereas little attention is paid to jaw and jaw-related diseases, particularly for soft tissue and related diseases. However, studies on the latter aspects have more clinical significance, so more studies of this type are required. Second, the accuracy of image segmentation is the key to whether the DL methods can be applied clinically. Therefore, more studies are needed to enhance the accuracy and precision of image segmentation, to promote its use in the clinic. Finally, the first step of digital surgical technologies, such as guide templates, surgical navigation, and augmented reality technology, is to segment important structures or lesions, for which the traditional manual segmentation method is currently mainly used. In the future, DL-based automatic segmentation methods could be integrated with these technologies, to assist clinical practice more accurately and efficiently.

CONCLUSION
This paper comprehensively reviews automatic segmentation algorithms based on DL and introduces their clinical applications. The review shows that DL has great potential to segment stomatological images accurately, and this can further promote the transformation of clinical practice from experientialism to digitization, precision, and individuation. In the future, more research is needed to further improve the accuracy of automatic image segmentation and realize intelligence.

AUTHOR CONTRIBUTIONS
DL, WZ, JC, and WT contributed to the conception and design of the study. DL wrote the first draft of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

FUNDING
This study was supported by Sichuan Province Regional Innovation Cooperation Project (2020YFQ0012).