AI-based diagnosis in mandibulofacial dysostosis with microcephaly using external ear shapes

Introduction Mandibulo-Facial Dysostosis with Microcephaly (MFDM) is a rare disease with a broad spectrum of symptoms, characterized by zygomatic and mandibular hypoplasia, microcephaly, and ear abnormalities. Here, we aimed at describing the external ear phenotype of MFDM patients, and train an Artificial Intelligence (AI)-based model to differentiate MFDM ears from non-syndromic control ears (binary classification), and from ears of the main differential diagnoses of this condition (multi-class classification): Treacher Collins (TC), Nager (NAFD) and CHARGE syndromes. Methods The training set contained 1,592 ear photographs, corresponding to 550 patients. We extracted 48 patients completely independent of the training set, with only one photograph per ear per patient. After a CNN-(Convolutional Neural Network) based ear detection, the images were automatically landmarked. Generalized Procrustes Analysis was then performed, along with a dimension reduction using PCA (Principal Component Analysis). The principal components were used as inputs in an eXtreme Gradient Boosting (XGBoost) model, optimized using a 5-fold cross-validation. Finally, the model was tested on an independent validation set. Results We trained the model on 1,592 ear photographs, corresponding to 1,296 control ears, 105 MFDM, 33 NAFD, 70 TC and 88 CHARGE syndrome ears. The model detected MFDM with an accuracy of 0.969 [0.838–0.999] (p < 0.001) and an AUC (Area Under the Curve) of 0.975 within controls (binary classification). Balanced accuracies were 0.811 [0.648–0.920] (p = 0.002) in a first multiclass design (MFDM vs. controls and differential diagnoses) and 0.813 [0.544–0.960] (p = 0.003) in a second multiclass design (MFDM vs. differential diagnoses). Conclusion This is the first AI-based syndrome detection model in dysmorphology based on the external ear, opening promising clinical applications both for local care and referral, and for expert centers.

Since 2012, the diagnosis of MFDM is established based on clinical features and the screening for a heterozygous pathogenic variant of the EFTUD2 gene (17q21.31)coding for the nuclear ribonucleoprotein component of 116 KDA U5 protein (8).This variant occurs frequently de novo (80%) (4,9).The main mechanism of disease is haploinsufficiency (10), caused in 18% of cases by a missense substitution, in 38% by a stop-gain EFTUD2 heterozygous pathogenic variation and in 43% by a splice site variation (4,11).No genotype-phenotype correlations in patients with EFTUD2 heterozygous pathogenic variations have been identified (8,12).
The main differential diagnoses of MFDM are other mandibulofacial dysostosesi.e., Nager type Acro-Facial Dysostosis (NAFD), Postaxial acrofacial dysostosis Miller type, and Treacher Collins (TC) syndromesand CHARGE syndrome (14,15).MFDM patients are often misdiagnosed within this spectrum.Distinguishing MFDM ears from CHARGE ears can sometimes be tricky, and EFTUD2 heterozygous pathogenic variation screening is recommended in patients with unusual forms of CHARGE syndrome (14).
Based on these clinical questions, the three objectives of this study were: (1) objectively determine the phenotype of pinna malformations in MFDM using geometric morphometrics and machine learning techniques vs. controls (design № 1), ( 2) compare the ears of MFDM patients with ears from the main differential diagnoses, with or without controls (respectively design № 2.1 and № 2.2) and (3) compare phenotypes from the different genotypes causing MFDM (design № 3).

Training set
We included pictures from the photographic database of the Maxillofacial surgery and Plastic Surgery department and from the Medical genetics department of Hôpital Necker-Enfants Malades (Assistance Publique-Hôpitaux de Paris), Paris, France.This database contains 594,000 photographs from 22,000 patients followed in the department since 1981.All photographs were taken by a professional medical photographer using a Nikon D7000 device in standardized positions.
We included retrospectively and prospectively, from 1981 to 2023, all profile pictures of patients diagnosed with MFDM, TC, NAFD and CHARGE syndromes, with a visible pinna (Figure 1).The photographs were not calibrated.All patients had genetic confirmation of their syndrome.We excluded patients with ear reconstruction surgery.Multiple photographs per patient corresponded to different ages.Duplicate photographs were excluded.
Non-syndromic children were selected among patients admitted for wounds, trauma, infection and various skin lesions, without any record of chronic conditions.More precisely, followup for any type of chronic disease was considered as an exclusion criterion.The reports were retrieved using Dr Warehouse (16).For each patient, right and left sides were included.
The study was approved by the CESREES (Comité Ethique et Scientifique pour les Recherches, les Etudes et les Evaluations dans le domaine de la Santé, № 4570023bis) and by the CNIL (Commission Nationale Informatique et Libertés, № MLD/MFI/ AR221900).Informed and written consents were obtained from the legal representatives of each child, or from the patient himself if he was of age.
We also retrieved ear photographs of these syndromes of interest from the databases of the Maxillofacial surgery and/or Genetics departments of the University Hospitals of Lille (France), Montpellier (France), Nantes (France) and the King Chulalongkorn Memorial Hospital in Bangkok (Thailand).None of the patients in the validation set were present twice, and none were from the training set.For the control group, we selected a group of photographs from our local database, without any redundancy with the training set, using similar inclusion criteria.
We extracted data on age at the time of the photograph and gender.We excluded patients with no information on the contralateral ear to take into account asymmetry or severity.
All photographs in the validation group were manually annotated by two independent raters (QH and MD), blinded for the diagnosis.The ICC (Intraclass Correlation Coefficient) was computed.ICC values greater than 0.9 corresponded to excellent reliability of the manual annotation (27).

Landmarking
We used an available template (28) based on 55 landmarks placed on the outer helix, the antihelix, the lobe, the tragus, the antitragus, the helix, the crus helicis, and the concha.We developed an automatic annotation model trained on 1,592 manually annotated ear photographs following a pipeline including: (1) a Faster R-CNN (Convolution Neural Network) to detect ears on the pixels of lateral face photographs and (2) a patch-AAM (Active Appearance Model), to automatically place landmarks.
The Fast RCNN model ( 29) was trained on 5,154 ear photographs after data augmentation (1,718 images and their +10°and −10°rotations), with a learning rate of 0.001, a batch size of 4, a gamma of 0.05 and 2,000 iterations.The patch-AAM was trained on 1,221 ear photos, after 50 iterations, with a Lucas-Kanade optimization (30).The Faster R-CNN was developed in Pytorch on Python 3.7 (31).The patch-AAM was developed using the menpo library on Python 3.7 (32).These two methods and the choice of hyperparameters have been described in a previous report by our team (33).
Each automatically annotated photograph was checked by the first author (QH) and landmarks were manually re-positioned when necessary, using landmarker.io(34).
To ensure a uniform distribution of landmarks along the curves of the ear (outer helix, inner helix, antihelix, concha), anatomical landmarks were transformed into sliding semi-landmarks using the geomorph package on R (35).Landmarks corresponding to the antihelix were removed because Hennocq et al. (33) showed that they were not reproducible between two annotators.
Ears were finally annotated based on 41 anatomical landmarks and semi-landmarks, placed automatically and double-checked manually.

Geometric morphometrics
We performed Generalized Procrustes Analysis (GPA) (36) on all landmark clouds using the geomorph package on R. Since the data were uncalibrated photographs, ear sizes were not available: shape parameters only were assessed and not centroid sizes.
Procrustes coordinates were processed using Principal Component Analysis (PCA) for dimension reduction (37): 8 principal components (PC) accounting for more than 90% of the global variance were retained.
To take into account associated metadata (age and gender) and the fact that we had included more than one photograph per patient (that is the non-independence of the data), a mixed model was designed for each principal component.The variable where age:b 1,i corresponded to a random slope for age per individual, and 1 i,j was a random error term.We did not use an interaction term between age and gender as it did not increase the likelihood of the model.Age, gender and ethnicity are significant factors in dysmorphology because they influence the diagnosis, and must therefore be taken into account (38).

Asymmetry and severity of microtia
Accounting for the heterogeneity of external ear anomalies was difficult.We graded microtia in stages I-IV according to the Marx classification (39).Only grade I ears could be annotated, as the main anatomical structures were missing in grades II, III et IV.However, the frequency of ears >grade I had to be considered for each disease group as it was a potential diagnostic feature.Information on the left/right asymmetry was also included as it could have been variable according to syndromes.
The overall severity for each patient was defined as the sum of microtia grades on each ear.Asymmetry was quantified using a mixed scale ranging from 0 to 3, corresponding to the subtraction of the left and right microtia grades.A high score corresponded to high left/right asymmetry.For bilateral grade I ears, we computed an asymmetry index based on fluctuating asymmetry (40,41), normalized between 0 and 1.A patient with two grade II ears had a symmetry score of 0. A patient with one grade III ear and one grade I ear had a symmetry score of 2. A patient with two grade I ears had an asymmetry score corresponding to his normalized asymmetry index, ranging between 0 and 1.
The severity and asymmetry scores were compared between different groups using mixed linear models to take into account repeated data per patient.The model coefficients for each group were compared to 0 by Student's t tests.The significance level was set at p < 0.05.

Uniform manifold approximation and projection (UMAP) representations
The residuals 1 i,j were represented using UMAP (42), a nonlinear dimension reduction technique for data visualization.Each design was plotted with and without the severity and asymmetry scores.A k (local neighborhood size) value of 15 was used.A cosine metric was introduced to compute distances in high dimensional spaces: the effective minimal distance between embedded points was 10 À6 .The three conditions of UMAP, namely uniform distribution, local constancy of the Riemannian metric and local connectivity were verified.UMAP analyses were performed using the package umap on R (43).

Machine learning models and metrics
The landmark clouds were superimposed with the previous generalized Procrustes analysis and PCA.With the metadata (age and gender), the residuals 1 i,j were reported for each PC and each ear of the validation group.The inputs to the model were the residuals from the linear models described above.
We used XGBoost (eXtreme Gradient Boosting), a supervised machine learning classifier, for all the analyses (44).We set a number of hyperparameters to improve the performance and effect of the machine learning model: learning rate = 0.3, gamma = 0, maximum tree depth = 6.We separated the dataset into a training set and a testing set, and a 5-fold cross-validation was used to define the ideal number of iterations to avoid overfitting.The model with the lowest logloss-score was chosen for analysis.The chosen model was then used on the independent validation set to test performances, by plotting accuracy, sensitivity, specificity, F1-score, precision and recall, AUC (in a one vs.all design).The ROC (Receiver Operating Characteristics) curves were plotted in R using the plotROC package (45).

Training set
The training set contained 1,592 ear photographs, corresponding to 550 patients; 52% of patients were female and the mean age was 7.2 ± 5.9 years, ranging from 0 to 60.7 years.
We included 1,296 photographs of control ears, corresponding to 471 patients; 53% of controls were female, with a mean age of 7.2 ± 5.4 years.
The MFDM group included 105 photographs from 31 patients, all genetically confirmed (EFTUD2 heterozygous pathogenic variations); 52% were female and the mean age was 9.2 ± 9.8 years.Regarding ear aplasia, 92% of the ears were normal or grade I, 3% were grade 22, 5% were grade III, and 0% was grade IV.
The NAFD group included 33 pictures from 9 patients, all genetically confirmed (SF3B4), with 56% females, and a mean age of 11.8 ± 8.8 years.All ears were normal or grade I.
We included 70 photographs corresponding to 15 patients in the TC group.The mean age was 5.5 ± 4.2 years and 40% were female.All had genetic confirmation (TCOF1 or POLR1D).Eighty percent of the ears were normal or grade I, 17% grade II, 3% grade III, and 0% grade IV.
The CHARGE group included 88 photos from 24 patients; 42% were female and mean age was 5.1 ± 5.9 years.All were genetically confirmed (CHD7).All ears were normal or grade I (Table 1).
In the MFDM group, 11 out of 31 patients (35%) had a heterozygous pathogenic variation in a splice site of EFTUD2.One of these patients had a Lys620Asn variant (1860G > C) which could be considered as a splice site variation and not as missense (35).Nine out of 31 patients (29%) had a frameshift EFTUD2 heterozygous pathogenic variation, 7/31 (23%) a nonsense variation, and 4/31 (13%) an intragenic deletion.No patient had a missense variation (Supplementary Table S1).
Average models per group were designed after Procrustes transformation, and compared (Figures 2, 3).Ears in the MFDM group had a clockwise rotation and a vertical shift of the concha (Figure 2) when compared to controls.Previously described features-thickened helix, enlarged and square lobe-were also reported.

Validation set
We extracted a total of 48 patients completely independent of the training set, with only one photograph per ear per patient.Severity and asymmetry scores were computed and only one side was then randomly selected.The validation set included 11 MFDM patients (23%), 2 NAFD (4%), 6 TC (13%), 8 CHARGE (17%) and 21 controls (44%) (Supplementary Table S3).We did not have access to the other ear for NAFD patients in the validation set and therefore the asymmetry and severity scores were not obtained.ICC was 0.991 between the two annotators and the reliability of the annotation was therefore considered as excellent (27).

Severity and asymmetry
Severity and asymmetry scores were compared between groups.In design № 1, TC ears were statistically more severely affected (p < 0.001).CHARGE and control groups had lower severity grades (p = 0.027 and p < 0.001, respectively), compared to MFDM.Control ears were less asymmetric (p < 0.001) than MFDM ears.CHARGE ears were less asymmetric than MFDM ears in design № 2.2 (Supplementary Table S2).

Design № 1
The best performances were obtained without integrating the asymmetry and severity parameters, after 114 iterations.The AUC was 0.985 in the training set (Figure 5A).Patients could be classified into MFDM or control groups in the validation set with a balanced accuracy of 0.969 [0.838-0.999](p < 0.001) and   an AUC of 0.975 (Table 2).Only one patient was misclassified (Table 3).

Design № 2.1
The best performances were obtained by integrating the asymmetry and severity parameters.The classification into MFDM, TC, CHARGE and control groups in the validation set was optimized after 76 iterations.The AUC was 0.912 for MFDM, 1.000 for controls, 0.855 for CHARGE, 0.772 for NAFD and 0.846 for TC in the training set (Figure 5B).On the validation data, the overall balanced accuracy was 0.811 [0.648-0.920](p = 0.002).The balanced accuracy was 0.769 for the classification into MFDM, 0.721 for TC, 0.752 for CHARGE and 0.938 for controls.AUC in the validation set was 0.837 for MFDM, 1.000 for controls, 0.857 for CHARGE and 0.500 for TC (Tables 4, 5).The best performances were obtained by integrating the asymmetry and severity parameters.The classification into MFDM, TC and CHARGE groups in the validation set was optimized after 91 iterations.The AUC was 0.974 for MFDM, 0.889 for CHARGE, 0.801 for NAFD and 0.914 for TC in the training set (Figure 5C).On the validation data, the overall balanced accuracy was 0.813 [0.544-0.960](p = 0.003).With this classifier, the balanced accuracy was 0.944 for the classification into MFDM, 0.873 for CHARGE and 0.500 for TC.AUC in the validation set was 1.000 for MFDM, 0.969 for CHARGE and 0.500 for TC (Tables 6, 7).

Design № 3
AUC was 0.602 [0.483-0.734](p = 0.370) on the training set.This classification was not statistically significant and was therefore not tested on the validation set.The UMAP representation did not find any clusters based on EFTUD2 heterozygous pathogenic variation type and site (Supplementary Figure S1).

Discussion
Applications of machine learning are increasing in healthcare (46)(47)(48)(49).The field of dysmorphology has been transformed by the framework for genetic syndrome classification called DeepGestalt (50), produced by the Face2Gene group.Publications comparing human performances to DeepGestalt performances are flourishing (51)(52)(53)(54), and some authors state that digital tools provide better results than human experts in terms of diagnosis.We do not believe that Artificial Intelligence (AI) algorithms can fully replace the experience of an expert practitioner, but AI-based tools can considerably increase diagnostic performances, and also contribute to the diffusion of specialized expertise.However, as in all deep learning approaches, DeepGestalt predictions are tricky to explain (50): the phenotypic traits leading to diagnosis cannot be traced.Moreover, only the frontal facial pictures are considered within this framework, that does not take into account the profile pictures and external ears.To our knowledge, we report the first machine learning classifier based on external ear shape.Even though the diagnosis of a given syndrome is never fully based on ear anomalies, this anatomical region is a major source of distinctive phenotypic features in a large array of syndromes (42)(43)(44).
Ear phenotype in MFDM has been previously reported.Guion-Almeida et al.  57) described similar abnormal pinnae.We could not find any information in the literature on the frequency of grade >I ear involvement in MFDM, or on the asymmetry of microtia.
In TC, Katsanis & Jabs (58) reported absent or small, malformed, sometimes rotated ears.Abdollahi Fakhim et al. (59) compared NAFD and TC without mentioning ears.Bernier et al. (18) described pinnae malformations in NAFD without providing further details.We did not find detailed phenotypic descriptions of the external ear in TC and NAFD in the literature.
In contrast, Davenport et al. (60) described the ear phenotype of CHARGE ears in greater details.CHARGE ears were small, wide and 'looked as if they were stretched or bent' (60).The most distinctive feature according to these authors was the triangular shape of the concha and a discontinuity between the antihelix and the antitragus.Davenport et al. (60) also explained that many patients had small or absent lobes, with significant left/ right asymmetry.
We thus report new features for MFDM ears: clockwise rotation and vertical shift of the concha (Figure 2).We confirm previously described features such as helix thickening, and enlarged and squared lobes.MDFM ears were also more asymmetric than controls.These overall features were shared with the NAFD and TC groups.Microtia grades were nevertheless higher in TC.CHARGE ears had a specific shape, with a triangular concha, a smaller but wider overall size with a thinner helix and a smaller lobe.In brief, the shape of the pinna can be considered as a relevant feature to differentiate MFDM from CHARGE.The classification algorithm from design № 1 provides an accuracy of 96.9% for distinguishing MDFM from controls, with only 1 patient misclassified in the validation set.with poorer results when using multi-class classification, which provides an overall balanced accuracy of 81.1% in design № 2.1 (MFDM and its differential diagnoses + controls) and 81.3% in design № 2.2 (MFDM and its differential diagnoses).These results account for the difficulty to diagnose MFDM from NAFD and TC.On the other hand, our results were satisfactory for detecting CHARGE ears, with an AUC reaching 85.7% in design № 2.1, and 96.9% in design № 2.2.We could not detect any genotype-phenotype correlations (design № 3).
The clinical use of automatic ear-based diagnosis can be highlighted based on a preliminary case study.A non-premature female child aged 9 days was admitted in fetal pathology with bilateral choanal atresia, inner ear malformations, agenesis of the acoustic-facial bundle and cerebellopontine hypoplasia.She had died within a few days after birth.CHARGE syndrome was confirmed post-mortem by a heterozygous de novo pathogenic variation in the CHD7 gene (c.4,353 + 1G > A).The patient also carried a heterozygous de novo variation of unknown significance in the EFTUD2 gene (c.1954G > A, p.Asp652Asn).Our earbased model on the ears of this patient (with a XGBoost classifier) proposed: CHARGE syndrome 84%, control patient 11%, MFDM 3%, NAFD 2% or TC 1% (Figure 6), supporting the diagnosis of CHARGE syndrome, and showing little tendency towards MFDM ear.As systematic EFTUD2 heterozygous pathogenic variation screening being currently recommended in unusual CHARGE cases [9], our model, with further clinical validation, could be used as a clinical support for directing genetic investigations.
Here we report the first attempt of automatic ear-based diagnosis in craniofacial dysmorphology.The algorithms we propose have been tested on independent and international validation sets involving rare disease centers in Europe and Asia.Validation data was nevertheless limited for NAFD, highlighting the need for data sharing when designing machine learning-based clinical tools.AI-based automatic facial diagnostic algorithms, including profile and ear analysis, are powerful approaches in supporting practitioners in diagnostic processes.

FIGURE 3
FIGURE 3Comparison of average MFDM (red) and the main differential diagnoses: NAFD (green) (A, B), TC (purple) (C, D) and CHARGE (yellow) (E, F), after Procrustes transformation.Vectors (A, C, E) represent distances between MFDM mean landmarks and other groups mean landmarks.
described 4 Brazilian children with small ears, a large lobe, and preauricular skin tags in years 2000 (55) and 2006 (1).In 2009 (2), the same team described small and cup-shaped ears with atretic external auditory canal in two other cases.Smigiel et al. (56) reported three MFDM cases with asymmetric microtia, a thickened helix, and protruding ear lobes.Lehalle et al. (17) described abnormalities of the external ear in 100% out of 34 MFDM cases, with minor abnormalities in 29/34 cases (squared, flattened and externally deviated ear lobe), asymmetric ears in 24% of cases and preauricular tags in 33% of cases.Voigt et al. (6), Huang et al. (4), Lines et al. (8) et Yu et al. (

TABLE 1
Description of the training set population.

TABLE 2
Classification results on the validation set for design № 1.

TABLE 3
Confusion matrix on the validation set for design № 1.

TABLE 4
Classification results on the validation set for design № 2.1.

TABLE 5
Confusion matrix on the validation set for design № 2.1.

TABLE 6
Classification results on the validation set for design № 2.2.

TABLE 7
Confusion matrix on the validation set for design № 2.2.