Impact of data synthesis strategies for the classification of craniosynostosis

Introduction Photogrammetric surface scans provide a radiation-free option to assess and classify craniosynostosis. Due to the low prevalence of craniosynostosis and high patient restrictions, clinical data are rare. Synthetic data could support or even replace clinical data for the classification of craniosynostosis, but this has never been studied systematically. Methods We tested the combinations of three different synthetic data sources: a statistical shape model (SSM), a generative adversarial network (GAN), and image-based principal component analysis for a convolutional neural network (CNN)–based classification of craniosynostosis. The CNN is trained only on synthetic data but is validated and tested on clinical data. Results The combination of an SSM and a GAN achieved an accuracy of 0.960 and an F1 score of 0.928 on the unseen test set. The difference to training on clinical data was smaller than 0.01. Including a second image modality improved classification performance for all data sources. Conclusions Without a single clinical training sample, a CNN was able to classify head deformities with similar accuracy as if it was trained on clinical data. Using multiple data sources was key for a good classification based on synthetic data alone. Synthetic data might play an important future role in the assessment of craniosynostosis.


Introduction
Craniosynostosis is a group of head deformities affecting infants involving the irregular closure of one or multiple head sutures and its prevalence is estimated to be between four and ten cases per 10,000 live births [1].As described by Virchow's law [2], depending on the affected suture distinct types of head deformities arise.Genetic mutations have been identified as one of the main causes of craniosynostosis [3,4], which has been linked to increased intracranial pressure [5] and decreased brain development [6].The most-performed therapy is surgical intervention consisting of resection of the suture and cranial remodeling of the skull.It has a high success rate [7] and is usually performed within the first two years of age.Early diagnosis is crucial and often involves palpation, cephalometric measurements, and medical imaging.Computed tomography (CT) imaging is the gold standard for diagnosis, but makes use of harmful ionizing radiation which should be avoided, especially for very young infants.Black-bone magnetic resonance imaging (MRI) [8] is sometimes performed, but requires sedation of the infants to impede moving artifacts.3D photogrammetric scanning enables the creation of 3D surface models of the child's head and face and is a radiation-free, cost-effective, and fast option to quantify head shape.It can be employed in a pediatrician's office and has potential to be used with smartphone-based scanning approaches [9].
Due to the low prevalence, craniosynostosis is included in the list of rare diseases by the American National Organization for Rare Disorders.Beside the few data, strict patient data regulations, and difficulties in anonymization (photogrammetric recordings show head and face), there are no publicly available clinical datasets of craniosynostosis patients available online.Synthetic data could potentially be used as a substitute to develop algorithms and approaches for the assessment of craniosynostosis, but only one synthetic dataset based on a statistical shape model (SSM) from our group [10] has been made publicly available so far.Scarce training data and high class imbalance due to the different prevalences of the different types of craniosynostosis [4] call for the usage of synthetic data to support or even replace clinical datasets as the primary resource for deep learning (DL)-based assessment and classification.The inclusion of synthetic data could facilitate training due to the reduction of class imbalance and increase the classifier's robustness and performance.Additionally, synthetic data may also be used as a cost-effective way to acquire the required training material for classification models without manually labeling and exporting a lot of clinical data.Using synthetic data for classification studies in a supporting manner or as a full replacement for clinical data has gained attraction in several fields of biomedical engineering (e.g.[11,12]), especially if clinical data is not abundant.While classification approaches of craniosynostosis on CT data [13], 2D images [14], and 3D photogrammetric surface scans [15,16,17] have been proposed, the dataset sizes were below 500 samples (e.g.[17], [15], and [13]) and contained a high class imbalance.The usage of synthetic data is a straightforward way to increase training size and stratify class distribution.
However, although the need for synthetic data had been acknowledged [15], synthetic data generation for the classification of head deformities has not been systematically explored yet.With the scarce availability of clinical data and multiple options of synthetic data generation available, we aim to test the effectiveness of multiple data synthesis methods both individually and as multimodal approaches for the classification of craniosynostosis.Using synthetic data as training material facilitates not only the development of larger and more robust classification approaches, but also makes data sharing easier and increases data availability.A popular approach for 3D data synthesis is statistical shape modeling.It describes the approach to model 3D geometric shape variations by means of statistical analysis.With the application of head deformities, they have been employed to distinguish clinical head parameters [18], to evaluate head shape variations [19], to assess therapy outcome [20], and to classify craniosynostosis [16].Although their value in the clinical assessment of craniosynostosis has been shown, the impact of SSM-based data augmentation for the classification of craniosynostosis has not been evaluated yet.With the introduction of a conversion of the 3D head geometry into a 2D image, image-based convolutional neural network (CNN)-based classification [17] can be applied on low-resolution images.Generative adversarial networks (GANs) [21] have been suggested as a data augmentation tool [15] and have been able to increase classification performance for small datasets [22].
The goal of this work is to employ a classifier based on synthetic data, using three different types of data synthesis strategies: SSM, GAN, and image-based principal component analysis (PCA).The three modalities are systematically compared regarding their capability in the classification of craniosynostosis when trained only on synthetic data.We will demonstrate that the classification of craniosynostosis is possible with a multi-modal synthetic dataset with a similar performance to a classifier trained on clinical data.Additionally, we propose a GAN design tailored towards the creation of low-resolution images for the classification of craniosynostosis.Both the GAN, the different SSMs, and PCA, were made publicly available along as all the 2D images from the synthetic training, validation and test sets.

Dataset and Preprocessing
All data from this study was provided from the Department of Oral and Maxillofacial Surgery of the Heidelberg University Hospital, in which patients with craniosynostosis are routinely recorded for therapy planning and documentation purposes.The recording device is a photogrammetric 3D scanner (Canfield VECTRA-360-nine-pod system, Canfield Science, Fairfield, NJ, USA).We used a standardized protocol which had been examined and approved by the Ethics Committee Medical Faculty of the University of Heidelberg (Ethics number S-237/2009).The study was carried out according to the Declaration of Helsinki and written informed consent was obtained from parents.
Figure 1: Landmarks provided in the dataset, used for the alignment for statistical shape modeling and the coordinate system creation of the distance maps [17].The three landmarks on the right exist for both left and right part of the head.Each data sample was available as a 3D triangular surface mesh.We selected the 3D photogrammetric surface scans from all available years (2011-2021).If multiple scans for the same patient were available, we selected only the last preoperative scan to avoid duplicate samples of the same patients.All patient scans had been annotated by medical staff with their diagnosis and 10 cephalometric landmarks.Fig. 1 shows the available landmarks on the dataset.We retrieved patients with coronal suture fusion (brachycephaly and unilateral anterior plagiocephaly), sagittal suture fusion (scaphocephaly), and metopic suture fusion (trigonocephaly), as well as a control group with the dataset distribution displayed in Fig. 2.Besides healthy subjects, the control group also contained patients suffering from mild positional plagiocephaly without suture fusion.Subjects with positional plagiocephaly in the control group were treated with helmet therapy or laying repositioning.In contrast, all patients suffering from craniosynostosis required surgical treatment and underwent remodeling of the neurocranium.The four head shape resulting from craniosynostosis are visualized in Fig. 3.
We used the open-source Python module pymeshlab [23] (version 2022.2) to automatically remove some recording artifacts such as duplicated vertices and isolated parts.We also closed holes resulting from incorrect scanning and removed irregular edge lengths by using isotropic explicit re-meshing [24] with a target edge length of 1 mm.In an earlier work [17], we defined a 2D encoding of the 3D head shape ("distance maps", displayed in Fig. 3, bottom row) which was also included in the pre-processing pipeline with the default parameter of [17].

Data subdivision
We did not use the full clinical dataset (validation and test set according to Fig. 4) as training data for the data generation models (GAN, SSM, and PCA) since the statistical information of the test set would be included in the synthetic data sources, leading to leakage (an overestimation of the model performance due to statistical information "leaking" into the test set).Instead, we chose the schematic displayed in Fig. 4. We used a stratified 50-50 split of the clinical data and used one half of the samples as the validation set and the other half as the test set.
The test set was separated from the validation set, only to be used for the final evaluation of the classifier.Following this approach, the test set did neither have any influence on the synthetic data, nor was it incorporated in the validation set and should therefore be a true representation of unknown data to the classifier.The validation set was used to select the best network during training and for hyperparameter tuning, but not as training material.Additionally it was used as the original (training) data on which we built the synthetic image generators.The synthetic training set was then created from the validation set according to the three data synthesis approaches described below: SSM, GAN, and PCA.The three approaches operated on different domains: While the SSM was applied directly on the 3D surface scans, the GAN and the PCA used the 2D distance map images.All images were created as 28×28-sized craniosynostosis distance maps which was sufficient for good classification in an earlier study [17].We describe each of the three individual approaches SSM, GAN, and PCA below.

Statistical shape model
The pipeline for the SSM creation (similar to [25]) consists of initial alignment, dense correspondence establishment, and statistical modeling to extract the mean shape and the principal components from the sample covariance matrix (see also Fig. 5).For correspondence establishment, we employed template morphing.
Figure 5: The statistical shape model pipeline employed in this study.The target scan is colored green with the deforming template in white.We used the mean shape of our previously published SSM [10] as a template which would be morphed onto each of the target scans.Procrustes analysis was employed on the ten cephalometric landmarks to obtain a transformation including translation, rotation, and isotropic scaling from the template to each target according to the cephalometric landmarks on the face and ears.For correspondence establishment, we employed the Laplace-Beltrami regularized projection (LBRP) approach [26] to morph the template onto each of the targets.We used two iterations: a high stiffness fit (providing a now landmark-free transformation from template to the target, improving the alignment also from the back of the head not covered with the landmarks) and a low stiffness fit (allowing the template to deform very close to the targets [27]).The deformed templates were then in dense correspondence, sharing the same point IDs across all scans and were used for further processing.
Generalized Procrustes analysis (GPA) was performed to remove both rotational and translational components on all the morphed templates so that the mean shape could be determined and removed.The remaining zero mean data matrix served as a basis for the principal component analysis.To counterbalance higher point density in the facial regions, we used weighted PCA instead of ordinary PCA for the statistical modeling.The weights were assigned according to the surface area that each point encapsulated and computed using the area of each triangle of the surface model.We created one SSM for each class, ensuring that the models were independent from each other and did not contain influences from the other classes.We cut off the coefficient vectors after 95 % of the normalized variance to remove noise and ensured only the most important components were included in the SSMs.The synthesis of the model instances could then be performed as with s denoting the mean shape, V the principal components, Λ the sample covariance matrix, and α the shape coefficient vector.We created 1000 random shapes of each class using a Gaussian distribution of the shape coefficient vector and created craniosynostosis distance maps for each sample.

Image-based principal component analysis
We used ordinary PCA as the last modality to generate 2D image data.While the SSM also made use of PCA in the 3D domain, image-based PCA operated directly on the 2D images.This was a computationally inexpensive and less sophisticated alternative to both GANs and SSMs since neither extensive model training and hyperparameter tuning, nor 3D morphing and correspondence establishment was required.We employed ordinary PCA for each of the four classes separately and we again created 1000 samples for each class.Since SSM is related to PCA, the data synthesis could be performed as with ī denoting the mean image in vectorized shape, V again the principal components, Λ the sample covariance matrix, and α the coefficient vector of the principal components.We again drew 1000 random vectors from a Gaussian distribution and transformed them back into 2D image-shape.

Generative adversarial network
The GAN combines multiple suggestions from different GAN designs and was designed as a conditional [28] deep convolutional [29] Wasserstein [30] GAN with gradient penalty [31] (cDC-WGAN-GP).The design in terms of the intermediate image sizes is visualized in Fig. 6.For the full design including all layers, consult Appendix A.
We opted for a design including a mixture between transposed, interpolation, and normal convolutional filter kernels, which prevented checkerboard artifacts and large patches.The combination of interpolation layers and transposed convolutional layers lead to better images than each of the approaches alone (see also in Appendix B Fig. 12) present in our previous approach [32].The conditioning of the GAN was implemented as an embedding vector controlling the image label that we wished to synthesize.We trained the GAN for 1000 epochs using the Wasserstein distance [30] which is considered to stabilize training [33].Instead of the originally proposed weight clipping, we used a gradient penalty [31] of λ = 1.We used 10 critic iterations before updating the generator and a learning rate of α = 3 • 10 −5 for both networks.The loss L can be described as follows [31]: with x denoting the generator samples G(z|y) and x = ϵx + (1 − ϵ)x with ϵ denoting a uniformly distributed random variable between 0 and 1 [31].

Image assessment
We used structural similarity index measure to closest clinical sample (SSIM cc ) as the basis for a metric to assess the similarity of the synthetic images to the clinical images and defined the SSIM cc for each synthetic sample by using the minimum SSIM cc with respect to each clinical sample of the same class N : It has to be noted that the SSIM cc itself did not assess the quality of the synthetic images, but was rather designed to evaluate the similarity to the clinical images.With this approach, we tried to quantify a "good" data generator: The data should not be very similar to the original data (because then we could simply use the original data), but also not too different (because then they might not be a true representation of the underlying class anymore)."Good" images should not be "too close" to 1, but also not "too low".

CNN Training
Resnet18 was used as a classifier since it showed the best performance on this type of distance maps [17].We used pytorch's [34] publicly available, pretrained Resnet18 model and fine-tuned the weights during training.During training, all images were reshaped to a size of 224×224 to match the input size of Resnet18.We performed a different run of CNN training on all seven combinations of the synthetic data.The CNN was trained only on synthetic data (except for the clinical scenario which was trained on clinical data for comparison).During training, we evaluated the model on both the (purely synthetic) training data and the (clinical) validation set (see also Fig. 8).The best-performing network was chosen according to the maximum F1-score on the validation set.The test set was never touched during training and only evaluated in a final run after training.
When multiple data sources were used, the models had a different number of training samples (see Fig. 9) and all synthetically-trained models were trained for 50 epochs.Convergence was achieved usually already during the first ten epochs, indicating that there was sufficient training material for each model.We used Adam optimizer, cross entropy loss, a batch size of 32 with a learning rate of 1 • 10 −4 , weight decay of 0.63 after each 5 epochs.To evaluate the synthetically-trained models against a clinically trained model, we additionally employed one CNN trained on clinical data, trained with the same parameters except a higher learning rate of 1 • 10 −3 .
We used the following types of data augmentation during training: Adding random pixel noise (with σ = 1/255), adding a random intensity (with σ = 5/255) across all pixels, horizontal flipping, and shifting images left or right (with σ = 12.44 pixels).All those types of data augmentation corresponded to real-world patient and scanning modifications: Pixel noise corresponded to scanning and resolution errors, adding a constant intensity was equal to a rescaling of the patient's head, horizontal flipping corresponded to the patient as if they were mirrored in real life, and shifting the image horizontally modeled an alignment error in which the patient effectively turns their head 20 • left or right during recording.All the clinical 2D data, the GAN, and the statistical models were made publicly available1 .We included a script to create synthetic samples for all three image modalities to allow users to create a large number of samples.The synthetic and clinical samples of this study are available on Zenodo [35].From the quantitative comparison (see Fig. 11), ordinary PCA images were substantially and consistently more similar to the clinical images than the other two modalities (differences of the medians larger than 0.02), while SSM and GAN images were less similar, with the SSM images being the most dissimilar for the coronal class.

Classification results
All comparison presented here were carried out on the untouched test set.According to the classification results for the synthetic training in Tab. 1, the SSM was the best single source of synthetic data with an F1-score higher than 0.85.All combinations of synthetic models showed F1-scores higher than 0.8.The classifier on the clinical data scored an accuracy above 0.96, but was surpassed by the combination of GAN and SSM.F1-score was highest for the clinical classification (0.9533), but the combination of SSM and GAN scored a very close F1-score (0.9518).Including a second data source always improved the F1-score compared to a model with a single data source (adding PCA to GAN by 0.29, adding SSM to PCA by 0.16, adding SSM to GAN by 0.1).

Discussion
Without being trained on a single clinical sample, the CNN trained from the combination of the SSM and the GAN was able to correctly classify 95 % of the data.Classification performance on the synthetic data proved to be equal to or even slightly better than training on the clinical data, at least for the data generated using the SSM and the GAN (and optionally also PCA).This suggests that certain combinations of synthetic data might be indeed sufficient for a classification algorithm to distinguish between types of craniosynostosis.
Compared with classification results from other works, the purely syntheticdata-based classification performs in a similar range and sometimes even better than other approaches on clinical data [15,17,16,36,13].
The SSM appeared to be the data source contributing the most to the improvement of the classifier: Not only did it score highest among the unique data sources, but it was also present in the highest scoring classification approaches.One reason for this might be that it was also the least similar data source for most of the classes.Due to the inherent modeling of the geometric shape in 3D, the created 2D distance maps are always created from 3D samples, while PCA and the GAN could, in theory, create 2D images which do not correspond to a 3D shape.In contrast, the GAN-based classifiers only showed a good classification performance when combined with a different data modality and its synthesized images seemed to show less pronounced visual features than the other two modalities.However, the SSIM cc based metric shows no substantial difference between the GAN images and the other two modalities.However, one possible reason might be that the GAN learned features of multiple classes and the images might still contain features which are derived from images from other classes.The PCA images were neither required, nor detrimental for a good classification performance.According to the SSIM cc , the PCA images were the most similar images to its clinical counterparts.
Overall, a combination of different data modalities seemed to be the key element for achieving a good classification performance.Both SSM and PCA model the data according to a Gaussian distribution, while the GAN uses an unrestricted distribution model.The different properties of modeling the underlying statistical distribution of a Gaussian distribution (SSMs and PCA) on the one hand, and without an assumed distribution (GAN) on the other hand might have lead to a compensation of their respective disadvantage increasing overall performance for the combinations.One limitation of this study is the small dataset.As the clinical classification uses the same dataset for training and validation, this might make it prone to overfitting.However, the resulting classification metrics achieved in this study were similar to a classification study on clinical data alone [17] which suggests that over-fitting has not been an issue.

Conclusion
We showed that it is possible to train a classifier for different types of craniosynostosis based solely on artificial data synthesized by a SSM, PCA, and a GAN.Without having seen any clinical samples, a CNN was able to classify four types of head deformities with an F1-score higher than 0.95 and performed comparable to a classifier trained on clinical data.The key component in achieving good classification results was using multiple, but different data generation models.Overall, the SSM was the data source contributing most to the classification performance.For the GAN, using a small image size and alternating between transposed convolutions and interpolations were identified as key elements for suitable image generation.The datasets and generators were made publicly available along with this work.We showed that clinical data is not required for the classification of craniosynostosis paving the way into cost-effective usage of synthetic data for automated diagnosis systems.

B Failed GAN attempts
We show artifacts arising from only using transposed convolutional layers (ConvTranspose2d), using only up-scaling interpolation layers (Interpolate), or from large gradient penalties which prohibits training in Fig. 12.

Figure 2 :
Figure 2: Pie chart of the class ratios in the clinical dataset (control 56 %, coronal 5 %, metopic 14 %, sagittal 25 %).The legend in the center shows the absolute number of samples in the dataset (in total 496 samples).

Figure 3 :
Figure 3: The four classes of the dataset with their distinct head shapes and their resulting distance maps representation.Top row: frontal view, middle row: top view, bottom row: 2D distance maps.

Figure 4 :
Figure 4: Data subdivision for the synthetic-data-based classification and the creation of synthetic data.The test set was separated initially from the dataset, while the validation set was used to produce the synthetic samples on which the CNN was trained.Green: data, blue: 3D-2D image conversion, dark red: generative models.

Figure 6 :
Figure 6: Visualization of the intermediate image sizes from the used GAN model.Left: generator, right: critic (discriminator).The filter kernel sizes are described in the Appendix A.

Figure 7 :
Figure 7: Image development of the GAN generator during different stages of training visualized as a 2×2 grid.

Figure 8 :
Figure 8: Classification training using the synthetic data, the validation data, and the test set.The CNN classifier using clinical data uses the validation data as a training set.Green: data, blue: violet: classification models.

Figure 9 :
Figure 9: Number of training samples in each classification scenario.The clinical scenario has less than 500 samples while all synthetic scenarios have at 4000, 8000, or 12000 samples.

Figure 10 :
Figure 10: Images of all three data modalities and clinical samples.From top to bottom the image modalities: SSM, GAN, PCA, clinical.From left to right the four classes: Control, coronal, metopic, sagittal.

Fig. 10
Fig. 10 shows image of each of the different data synthesis types compared with the clinical images.From a qualitative, visual examination, the synthetic images

Figure 11 :
Figure 11: Boxplots of SSIM cc (structural similarity index measure to closest clinical sample) of each class for each of the synthetic data generators.

Figure 12 :
Figure 12: Failed GAN images arranged in a 2×2 grid with artifacts arising from poor network design or bad training conditions.From left to right: Deconvolution artifacts, interpolation artifacts, and noise artifacts.

Table 1 :
CNN-Classification comparison on the test set trained on different synthetic data sources.Boldface: best results among the data source.