Automated T1 and T2 mapping segmentation on cardiovascular magnetic resonance imaging using deep learning

Introduction Structural and functional heart abnormalities can be examined non-invasively with cardiac magnetic resonance imaging (CMR). Thanks to the development of MR devices, diagnostic scans can capture more and more relevant information about possible heart diseases. T1 and T2 mapping are such novel technology, providing tissue specific information even without the administration of contrast material. Artificial intelligence solutions based on deep learning have demonstrated state-of-the-art results in many application areas, including medical imaging. More specifically, automated tools applied at cine sequences have revolutionized volumetric CMR reporting in the past five years. Applying deep learning models to T1 and T2 mapping images can similarly improve the efficiency of post-processing pipelines and consequently facilitate diagnostic processes. Methods In this paper, we introduce a deep learning model for myocardium segmentation trained on over 7,000 raw CMR images from 262 subjects of heterogeneous disease etiology. The data were labeled by three experts. As part of the evaluation, Dice score and Hausdorff distance among experts is calculated, and the expert consensus is compared with the model’s predictions. Results Our deep learning method achieves 86% mean Dice score, while contours provided by three experts on the same data show 90% mean Dice score. The method’s accuracy is consistent across epicardial and endocardial contours, and on basal, midventricular slices, with only 5% lower results on apical slices, which are often challenging even for experts. Conclusions We trained and evaluated a deep learning based segmentation model on 262 heterogeneous CMR cases. Applying deep neural networks to T1 and T2 mapping could similarly improve diagnostic practices. Using the fine details of T1 and T2 mapping images and high-quality labels, the objective of this research is to approach human segmentation accuracy with deep learning.


Introduction
Cardiovascular diseases are the leading cause of death worldwide, claiming approximately 18 million lives a year (1). Its high morbidity burden further underscores the need for improving our methods to address these diseases. Recent advances in cardiovascular imaging has enhanced our capability to better understand, prevent, diagnose and stratify patients (2). Cardiovascular Magnetic Resonance (CMR) is the gold standard imaging method for cardiac anatomy and function assessment, as well as enabling non-invasive tissue characterization. Besides late gadolinium enhancement, parametric cardiac mapping is a widely used technique to quantitatively measure the properties of the myocardial tissue. It provides direct visualization of tissue-specific MR properties like T1, T2 and T2*. Parametric mapping provides pixel-by-pixel representations of the numerical T1 and T2 myocardial tissue properties defined in units of time (milliseconds). In recent years mapping values have been established as valuable descriptors of myocardial alterations (3). These variations reflect on specific intracellular and/or extracellular tissue changes linked to specific pathologies. For example, intracellular accumulation of glycosphingolipid in Anderson-Fabry disease, extracellular fibrosis in cardiac amyloidosis or extra-and intracellular oedema in acute myocardial inflammation or injury.
Currently, the 2020 Society of Cardiovascular Magnetic Resonance (SCMR) guideline recommends using manually drawn regions of interest (ROI) within the mid-ventricular or basal septal segment to assess the global tissue character (4). This contouring practice reduces variability caused by motion artifacts, primarily affecting the lateral wall. On the other hand, manual ROI segmentation enables reader subjectivity and might increase interobserver variability. Moreover, several pathological alterations disproportionately affect the lateral wall of the left ventricle (LV), such as myocardial injury caused by acute myocarditis, necessitating reliable tissue quantification in all segments of the left ventricular (LV) myocardium (5). In contemporary clinical practice, labor-intensive manual segmentation is the prerequisite to extracting key clinical measures of the tissue character, such as global and segmental T1 and T2 mapping values. This task is not only time-consuming but also prone to several subjective practices, preventing the generalization of the results.
In recent years artificial intelligence driven methods have transformed the segmentation and reporting practices in CMR imaging. This effort, motivated by the time-consuming manual contouring for precise quantification of volumes-and function, has led to a multitude of automated tools for cine movie segmentation (6). In contrast, reliable tools for automated mapping contour generation is still scarce and did not reach clinical applicability.
In general, segmentation tools developed within the frameworks of large, generally very healthy cohorts such as the UK Biobank (UKB) Imaging substudy are mainly exposed to healthy anatomy (7); therefore, the performance of these algorithms is lower within clinical cohorts. Tools developed within these cohort studies are further limited by the specific set of acquired slices. For example the UKB contains only one mid ventricular T1 mapping slice per participant, which prevents algorithm training in hard-to-learn basal or apical slices. Moreover, a special type of mapping imaging sequence was used, which is not widespread in the clinical routine, potentially limiting clinical implementation of such algorithms.
In this study, we aimed to develop an automated segmentation tool for mapping images within a diverse clinical cohort to promote the generalizability of automated mapping segmentation tools.

Related work
There are several CMR mapping sequences suitable for tissue quantification. The most common variants based on inversionrecovery, saturation recovery or the combination of both (3). In this work, we are focusing on the most commonly used T1 mapping sequence in the clinical routine: the modified Look-Locker inversion recovery (MOLLI) (8) which permits measurement of T1 times in a single breath hold fashion over 17 heart beats.
Deep learning (DL) has become one of the most widely used approaches in cardiac imaging in several imaging modalities (e.g., Magnetic Resonance Imaging, X-Ray, Computed Tomography, Ultrasound). These modalities enable non-invasive qualitative and quantitative assessment of anatomical structures, functions and support the radiologist/cardiologist for diagnosis, follow-up, and prognosis as well.
In general DL, can be used within two fields of CMR, a.) at k-space field, when machine learning reconstruction approaches were used to learn non-linear optimization and improve the CMR reconstruction when an MR-based acceleration technique (e.g., Compressed Sensing) was used (9). In another perspective DL can be applied at k-space to suppress the artifacts before the image reconstruction (10).
The second field, b.) when DL could be useful is the image domain. A huge number of studies concentrate on cardiac MRI segmentation, because those provide an efficient way for segmenting tissues (e.g., left ventricle, right ventricle, vessels, etc.).
Tran (11) applies DL to segment the left ventricle, right ventricle and myocardium on short-axis images. Khened et al. (12) applied a U-net for segmentation with large anatomical variability. The majority of ventricle and myocardial segmentation studies used 2D networks instead of 3D networks, due to the low resolution, motion artifacts and the limited availability of 3D dataset. Fully Convolutional Neural Networks (FCNs) could be useful at atrial segmentation (13), scar segmentation/classification (14) or whole heart segmentation (15). Hybrid segmentation is also worth mentioning. In this case the DL-based algorithm is combined with the traditional segmentation approaches (e.g. deformable models (16) or atlas-based methods (17). These algorithms provide the segmentation accuracy. Chen

Materials and methods
In this chapter, we introduce the data we utilize in this study, its acquisition method, preprocessing and annotation methodology.
Then we present a deep convolutional network based automated segmentation method, including the network's architecture and its training.

Experimental data: participants
Overall, we included 262 participants in our study (71% male, mean age 46 + 16 years). To permit the consideration of a wide variety of cardiovascular phenotypes we consecutively included all clinical patients who were referred to CMR examination between September and November 2021 at Semmelweis University Heart and Vascular Center. Furthermore, we randomly selected healthy volunteers from the Center's CMR database.
Diagnostic labeling was provided for each case by two independent, experienced CMR readers with more than four and five years of experience (EACVI exams completed).
In case of uncertainty a third, EACVI level III reader was contacted for consensus. We labeled each participant as per the current clinical guidelines and expert recommendations. Please note, that this dataset is curated to produce a fair representation of the diversity of clinical datasets and to provide a real-life segmentation scenario for algorithm training. If the results of the CMR examination were not consistent with any specific underlying disease we provided the phenotypic label best describing the scan, including left ventricular hypertrophy, aspecific gadolinium enhancement, aspecific cardiomyopathy indicating further assessment. This approach will allow for the appreciation of the diversity of the data. For simplicity, we pulled together valvular diseases into one diagnostic label, as mitral prolapse, aortic stenosis and aortic insufficiency are often present at the same time. Diagnosis groups are shown in Figure 1. The proportion of healthy volunteers, athletes and pathologies in our dataset was reported using cardiovascular magnetic resonance (CMR) imaging. "No myocardial alterations" were reported, when the patient was referred to CMR due to a specific diagnostic question, but no pathology was found. CMR examinations were performed on a 1.5T MR scanner (Magnetom Aera, Siemens Healthcare, Erlangen, Germany) using a spine and an 18-channels phased array coil. The development and testing data contain 638 CMR pre-contrast (native) T1 maps where the typical imaging parameters are: TR ¼ 2:7 ms; TE ¼ 1:12 ms; pixelsize ¼ 1:5 Â 1:5 mm; slicethickness ¼ 8 mm; flipangle ¼ 35 ; FOV ¼ 323 Â 360; and 720 T2-maps with the following parameters: TR ¼ 2:4 ms; TE ¼ 1:06 ms; pixelsize ¼ 1:5 Â 1:5 mm; slicethickness ¼ 8 mm; flipangle ¼ 70 ; FOV ¼ 315 Â 360. MOLLI technique was used at both maps. All images were short-axis views of the left ventricular (LV) myocardium. We acquired mapping sequences in three short-axis slices according to the anatomical landmarks: in the basal, midventricular and apical part of the heart. Two-fold accelerated parallel imaging technique (GRAPPA) was used to shorten the breath-hold.
Conventional MOLLI sequence was used to produce T1 and T2 mapping images as output maps from individual images ascertained at predefined intervals. T2 maps were generated by using a T2-prepared steady-state free precession sequence (19) with three different echo times (TE): TE 1 ¼ 0 ms, TE 2 ¼ 25 ms, TE 3 ¼ 55 ms (20). In order to generate T2 maps, several images (in our case 3) with varying T2 sensitivity are acquired at the same cardiac phase over multiple cardiac cycles. Then, a mono-exponential curve is fit at each pixel on the T2-weighted images and finally a T2 map is generated (20). For T1 maps generation, multiple (in our case 8) T1 weighted images are acquired at different times after the inversion pulse and then the data can be fit to an equation: A À B exp ( À t=T1), where A and B are fitting parameters, t is the time after the preparation pulse and T1 is the T1 relaxation time (21). To increase the training and validation sample of our algorithm we included each image individually as well as T1 and T2 parametric maps.
We split the dataset into three partitions: training, validation and testing with 65%-20%-15% ratios of all study participants respectively. This partitioning is fixed for all trainings to help reproducibility and comparability of models. We perform partitioning on a case level (instead of for images) to avoid any information from the test set being included in the train and validation partitions. I.e., all data of a patient is strictly included in only one of the training, validation, or test partitions. This partitioning results in 832, 287, 239 T1 and T2 maps (172, 50, 40 cases) in the training, validation and test sets respectively. A summary of the number of samples per slice in both the training and test sets can be found in Table 1.

Experimental data: ground truth segmentation, and interobserver subset
Ground truth segmentation was performed by two experienced observers using the Medis Suite Software (Medis Medical Imaging Software, The Netherlands). We draw the endo-and epicardial contours manually on the motion-corrected images. After visual quality control, endo-and epicardial contours were manually drawn in basal, midventricular and apical slices covering the LV myocardium. To minimize partial volume effect care was taken to exclude the border zones between myocardium and the blood pool during segmentation.
In the test set (15% of the complete dataset, n ¼ 40 patients), we evaluate the interobserver variability of the manual segmentation approach. Three readers with one, five and four years of experience in CMR reporting, respectively provided the segmentation. They were blinded to clinical label and demographic characteristics of the patients.

Implementation: model architecture and training details
We apply a widely used U-Net (22) segmentation network to the cardiac mapping segmentation problem. Our motivation for choosing this architecture is to provide a method that can be trained quickly and achieve a strong baseline for segmentation. To simplify and shorten the training of this network we initialize the U-Net's encoder with weights which were previously pretrained on a large and diverse general image dataset, ImageNet (23). Even though properties and distribution of data from the ImageNet dataset differ from MRI images, such transfer-learning approach is often used successfully in medical image analysis problems (24). As opposed to the encoder, we do not initialize the decoder from pretrained weights, instead train its weights for the mapping segmentation problem from random initialization.
A U-Net-like encoder-decoder model can be constructed of multiple encoder and decoder architectures. Pretrained weights are available for many encoder architectures. We opted for a ResNet-50 (25) encoder due to its frequent use in computer vision literature for segmentation tasks. The encoder and the decoder consist of convolutional, pooling and upsampling layers, with 50 and 8 convolutional layers in the encoder and the decoder, respectively. The reason for choosing a smaller decoder is to introduce less randomly initialized weights into the segmentation network, thereby reducing its training time. The detailed architecture of our model is shown on Figure 2.
Our experts provide ground truth segmentation labelling as epiand endocardial contours. These are represented as ordered twodimensional point sets described by {(x, y) [ R} pixel coordinates for epicardial and endocardial left ventricle contours. Although this representation is more accurate and efficient than segmentation masks, the latter is required in order to train neural networks for segmentation. We convert contour point sets to segmentation masks, i.e., pixel-level classification labels, with three classes: left ventricular cavity (area enclosed by the endocardial contour), left ventricular myocardium (area between the endo-and epicardial contour), and "background," which is the class assigned for pixels that do not belong to the first two classes, i.e. all pixels outside the epicardial contour. Finally, we generate segmentation masks to match the resolution of the input samples (thereby achieving pixel-level classification). We apply random augmentations to the input samples during training. These are cropping, resizing, contrast and intensity perturbations. We use fixed cropping and resizing to match the input size of the network during inference (which is 224 Â 224 pixels).
We train the segmentation network using the Adam gradientbased optimization algorithm (26). The objective function for segmentation training is the Jaccard loss (27). We train the network for 150 epochs, which takes 50-60 min on a single NVIDIA TESLA V100 GPU. Based on empirical analysis, we apply a periodic cosine annealing learning rate schedule (see Figure 3).
During the first 70 epochs we include both T1 and T2 parametric maps and the images used to construct them in the training sample set, however, during the last 80 epochs we limit FIGURE 2 Deep neural network architecture: U-Net with ResNet-based encoder and convolutional decoder. Learning rate schedule used throughout the training as proposed by Loshchilov and Hutter (28). 3.5. Implementation: hardware and software architecture For training our automated segmentation method, we used a single Nvidia V100 16 GB GPU, 10 Intel(R) Xeon(R) Silver 4114 CPU cores (2.20 GHz) and 39 GB RAM. We implement our model and training using PyTorch Lightning and utilize methods from multiple open source repositories. 1,2,3,4 Training our deep learning model on this system takes cca. 50-60 min, while inference on 100 images takes 12.3 sec. We use the same hardware for inference on the test set.

Evaluation metrics
This chapter presents the metrics and comparison baselines we used to evaluate the proposed segmentation method.
We evaluate the performance of our segmentation method based on two metrics, Dice score and Hausdorff distance (DH) computed between corresponding predicted contours and expert annotations. For a single sample with contours from multiple annotators we compare the same prediction to each expert's contour. For computing the Dice score we convert ground truth contours to segmentation masks and compare these against predicted segmentation masks. Computing the Hausdorff distance requires predicted contours, which we obtain by fitting an epi-and endocardial contour on the predicted segmentation mask. We explain the applied contour fitting pipeline in the next paragraph. In addition to these metrics, we calculate myocardial T1 and T2 mapping values based on the contours predicted by our model and compare to those derived from expert-determined contours.
For fitting contours on predicted segmentation masks, we rely on an algorithm by Teh and Chin (29). However, to acquire accurate and robust contours, we apply further processing steps, as illustrated on Figure 4. Inaccurate predictions can include multiple distinct regions of the same class, which produce multiple epi-or endocardial contours on a single mapping. From these we select the ones which are "not too elongated" (have a less than 1:3 with-height ratio) and select the one with the largest area. On some mapping images the myocardium appears very thin, resulting in C shaped endocardial contours, which we correct by replacing the contour with its convex hull. Note, that we only perform this replacement if the contour is not fully closed ("C-shaped"), which we determine based on the difference between the contour's and its convex hull's area. As a last step, we interpolate the fitted contour with a B-spline (30) to improve the accuracy of the Hausdorff distance and Mean Surface Distance metric computation.

Implementation: experiment design
We compute evaluation metrics on T1 and T2 mapping images of our test set for which contours are available from 3 expert annotators. The train and the test partitions are disjunct sets. As our aim is to develop a segmentation method for T1 and T2 parametric maps, we only use these during evaluation, while during training we also use images that were used to compute the parametric maps. We compute metrics on each slice and mapping type (T1 or T2) separately and report the distributions and cumulate results of these.
In order to evaluate the statistical significance of differences in segmentation metrics of our model and the agreement of experts, we employed one-way Analysis of Variance (ANOVA) with Welch's correction and the Tukey Post-Hoc Test, considering a significance level of p , 0:05. We used Kruskal-Wallis non-parametric one-way ANOVA test to evaluate the statistical significance of differences in T1 and T2 values and used Dwass-Steel-Critchlow-Fligner test for post-hoc pairwise comparisons. We indicate the results of these tests in Tables 2 and  3 to 5 with identical letters in superscript for results that are not significantly different. We used the jamovi statistical software for these tests (31). Figure 5 and Table 2 show our primary results. On Figure 5, both metrics on epi-and endocardial contours indicate that our automated segmentation method performs comparable to how expert annotators agree, as the distributions and even standard deviation bands overlap. The Dice score and IoU score both show 5% difference between our model and experts' agreement (see Table 2).

General segmentation accuracy
The similar epi-and endocardial Hausdorff distance distribution of expert's agreement results indicates that experts can label epi-and endocardial contours equally accurately. This observation is in-line with clinical experience. In contrast, the epicardial Dice score shows approximately 15% lower agreement between experts compared to endocardial results. This can be explained by the difference in how we compute epi-and endocardial Dice score. We compute the epicardial Dice score on the myocardium, which is a region with ring-like topology, while the endocardial Dice is Frontiers in Cardiovascular Medicine computed on the closed disk-like region of the left ventricle. The sensitivity of the Dice score to similar errors on ring-like regions is higher than on closed disks, which can explain the lower epicardial Dice results. Therefore, we argue that our automated tool performs comparable on epicardial contours as well.
In Table 3, we present our results separately for T1 and T2 mapping images. All metrics indicate that our model performs better on T1 mapping than on T2 mapping, while metrics for experts indicate greater agreement on T2 mapping. Based on our comparison of expert agreements with our model, we observe that the model is closest to expert accuracy on T1 mapping, while the gap between model and expert accuracy is greater on T2 mapping. It can be explained by the fact that T1 mapping images are often noisier than T2 mapping images, which makes it more difficult for experts to determine epi-and endocardial contours with accuracy, but our method can learn to invariantly locate them despite these noises.   (30), then sampling it between the initial points. This step improves the numeric accuracy of contour-based metric computation (e.g. Hausdorff distance) and also enforces a smooth property on the fitted contours, similarly as the ones of expert annotators. On subfigure (A), steps indicated with dashed outline are only performed if necessary: -Filtering of contours is only necessary if the segmentation network incorrectly predicts multiple disjoint epi-or endocardial regions. We select the contour with the largest area and an approximately round shape (have a less than 1:3 with-height ratio) -Correcting a contour using its convex hull is necessary only in rare cases, when the segmentation mask includes a C-shaped region (e.g. the light blue region on (D)) In such case he fitted contour will inherit this C-shaped topology (see the red crosses on the figure). We can simply improve such contour by replacing in with its convex hull (green line on subfigure D)). For each metric, differences are statistically significant (p , 0:001, also denoted with different lettering in superscripts). * indicates higher-the-better metrics, while metrics + are lower-the-better. For each metric and mapping type (T1 or T2), differences are statistically significant (p , 0:001, also denoted with different lettering in superscripts). * indicates higher-the-better metrics, while metrics + are lower-the-better.  Figure 6 show examples for apical, midventricular and basal slices respectively. We include Figure 7 to illustrate the different levels of agreement between two experts as a reference. Figure 8 and Table 4 shows results separately for the three slices used in this study, and Table 1 the number of training and test samples for each slice. Both the Dice score and Hausdorff distance distributions show that the proposed automated method performs comparable to how experts agree on basal and midventricular slices. However, the lower apical Dice score highlights that apical slices are more challenging even for experienced annotators. Also, our deep learning method shows a higher accuracy degradation on apex slices than experts do.

T1 and T2 mapping values
Parametric T1 and T2 mapping images allow medical practitioners to diagnose cardiac diseases based on quantitative measures. The primary practical aim of our research is to develop methods that automate the identification of epi-and endocardial contours and thereby simplify the process of measuring myocardial T1 and T2 values. To assess our method's ability to aid T1 and T2 value measurement, we calculate myocardial T1 and T2 mapping values based on the contours predicted by our model and compared to those derived from expert-determined contours ( Table 5). Our model's median T1 and T2 values were 1041.73 ms (IQR: 110.53 ms) and 56.24 ms (IQR: 13.2 ms) respectively. These values were comparable to those obtained by each observer (median T1 and T2 values across all observers were 1026.41 and 50.96 ms respectively). The relatively close interquartile ranges (IQRs) suggest good overall agreement between our model and human observers, indicating the robustness of our method. This is further supported by median T1 values not differing significantly between our automated method and experts.
The intraclass correlation coefficient for the average ratings across observers was 0.90 and 0.81 for T1 and T2 mapping, respectively, indicating good reliability. The 95% confidence interval for the ICC was 0.87 to 0.93 for T1 and 0.75 to 0.86 for T2.

Network architecture
To investigate the effectiveness of our proposed network architecture, we compare our results to a U-Net that follows the original architecture proposed by Ronneberger et al. (22). Data for training and testing is identical to that used in the rest of our work. Table 6 shows that the proposed architecture (using a residual encoder and a lightweight decoder) outperforms the original U-Net architecture, while also offering faster training and inference times.

Pretrained encoder
We conducted experiments with initializing the encoder's weights randomly and from a pretrained ImageNet model. Table 7 shows that initializing the encoder's weights from an ImageNet pretrained encoder provides 5% in terms of Dice score FIGURE 5 Violin plots showing the results of pairwise comparison of our automated method (pink halves) to expert annotators (green halves). Horizontal dashed lines show median and Q1, Q3 quartiles. Violin plots are cut at the bounds of underlying data. Note the logarithmic y-axis on the Hausdorff Distance. We plot epi-and endocardial results separately to allow for accurate reporting of Dice score results (plotting the mean of epi-and endocardial dice score would hide the difference in these two metrics). over random initialization. All further inspected metrics were also superior for the pretrained approach.

Post-processing
Ablation experiments are conducted to verify that each post-processing step we propose in Section 3.6 improves the accuracy of the fitted contours. Following the contour fitting algorithm (29) we add each post-processing step one-by-one and report the mean Hausdorff distance results of the same model in Table 8. Additionally, we include the distribution of Hausdorff distance results over all samples for each postprocessing step in the Appendix, Figure A1. Since the Dice score is independent of the contour-fitting pipeline, we report only the Hausdorff distance results. We observe that each post-processing step improves the fitted contour accuracy. The most significant improvement is achieved by the interpolation step. In particular, this is due to the fact that we calculate the Hausdorff distance for the discrete points of the predicted and ground truth contours. Due to the interpolation, Hausdorff distance computations are considerably more accurate. In cases where the predicted segmentation map includes additional incorrectly labelled pixels, the fitted contours must be filtered to select the one that is most likely the correct one. This is accomplished through the filtering step, which results in a significant improvement in the Hausdorff distances. Finally, the correction step based on the convex hull of the predicted contour improves contours in a few cases and therefore improves the overall accuracy of the fitted contours. Predicted and ground truth contours with low, intermediate and high Hausdorff distance results, selected separately for apical, midventricular and basal T1 maping images.

Discussion
This single center study based on 262 heterogeneous CMR cases may have important implications into clinical practice by providing high accuracy automatic segmentation for clinical use. Our algorithm can be implemented into the everyday clinical workflow after rigorous external validation. Automated contour prediction using our tool takes 12.3 sec for 100 slices on average, FIGURE 7 Example contours from two experts with different levels of agreement (measured by Hausdorff distance), plotted for two observers as a reference. All subfigures show apical T1 maps. Per-slice violin plots comparing the agreement of our model with experts to agreement among experts. Table 1 shows the number of mapping images we used for training and evaluation (testing). Violin plots were generated on the test set. Frontiers in Cardiovascular Medicine while this requires approx. 100 min for experts. The automated segmentation algorithm will contribute to the optimization and time efficiency of mapping post-processing, which is still unsolved by most commercially available tools. Our segmentation method provided reasonably good results for apical segments which are notoriously challenging. This heterogeneous data source will help maintain the stability of the results across CMR vendors and mapping sequences. Future research looking into the clinical validity of global mapping values derived from the implementation of this code will later permit automated segmentation of long axis images. Our study's novelty resides in three significant aspects. Firstly, we leveraged a U-Net based deep learning architecture, which, while widely used in cine CMR image segmentation tasks (6), has not been explored in T1 and T2 mapping segmentation. Our work optimizes the U-Net architecture for this specific application, along with tailored data processing techniques applied in the model training phase. Secondly, the segmentation of T1 and T2 mapping images presents unique challenges, as it contains more detailed information about tissue properties and, thus, slight discrepancies in segmentation can result in notably different measurements. Our study is one of the first to address these distinct challenges of mapping image segmentation (7). Finally, we augmented the robustness and generalizability of our segmentation tool by training and validating it on a diverse dataset comprising both clinical patients and healthy volunteers. This heterogeneous dataset ensures that our model is effective and applicable across various cardiovascular conditions. Our tool holds significant implications for diagnosing and managing a variety of cardiovascular diseases by enabling faster quantitative analysis of critical diagnostic markers, potentially saving clinicians and researchers significant time and labor. Envisioned as an integral part of existing clinical workflows, our model's key goal is to serve as an inline segmentation tool on CMR scanners. This integration would allow real-time analysis, facilitating a more streamlined diagnostic workflow, and enabling radiographers and physicians to adjust scanning protocols onthe-go if necessary, thus significantly enhancing patient care.
Our method shows very similar mid-ventricular and basal performance and the notably worse apical segmentation performance. We would like to emphasize multiple factors contributing to these results, as they contradict findings of previous studies reporting cine segmentation tools (6). First, our training dataset contains a lower number of labelled Apical slices (see Table 1). Second, the segmented region is the smallest on apical slices and the resolution of these areas For each metric and each slice, differences are statistically significant (p , 0:001, also denoted with different lettering in superscripts). * indicates higher-the-better metrics, while metrics + are lower-the-better.  The decoder is always initialized randomly. * indicates higher-the-better metrics, while metrics + are lower-the-better.

Hausdorff distance (mm)
No post-processing (only contour-fitting by (29) is the worst, which makes this region challenging both for our model and experts. Third, in our center we take particular care during mapping slice planning, and we aim to ensure that the basal slice does not contain the aortic outflow tract. This planning approach reduces the complexity of the basal slices in our dataset, allowing our algorithm to perform more effectively in these regions. Finally, the blood-myocardium contrast in mapping images is inherently different from that in cine images, which are commonly used for most segmentation tool development. Our algorithm addresses the unique challenges posed by this contrast difference, demonstrating its ability to handle these characteristics and provide efficient, accurate segmentation. The potential of our tool to be incorporated in clinical practice is also highlighted by our results on estimating T1 and T2 values values. Non-significant differences between T1 values obtained using our tool and those obtained by experts indicated that our model's accuracy is comparable to that of experts on T1 mapping images.
Our automated model yielded a slightly elevated median T2 value of 56 ms, which can be attributed to multiple factors. Notably, our training and testing datasets encompassed a diverse set of patients with varying cardiac conditions, including myocarditis, post-heart transplant, and hypertrophic cardiomyopathy. Each of these conditions is known to potentially elevate T2 values due to the presence of myocardial edema or fibrosis. The borderline increased T2 values observed in both our training (median 52 ms, IQR 7 ms) and testing sets (median 51 ms, IQR) as provided by expert annotation confirms this assertion. It is important to clarify that our model was designed with an emphasis on segmentation accuracy, rather than specifically trained on predicting T2 values. Despite this, the input data, with its broad pathological range, may have indirectly influenced the resulting T2 values.
One of the challenges with automated models is the potential for segmentation errors. Even minor discrepancies in segmentation can significantly influence the derived T2 values. In our case, if the model tends towards over-segmentation, it might include areas affected by partial volume effect that is excluded by the expert readers, thereby inflating the computed T2 values. Our finding highlight the need for stringent quality control procedures when deploying automated tools in clinical scenarios. Such measures can help ensure segmentation accuracy, providing more reliable T2 mapping values. Furthermore, it underscores the necessity of ongoing research to understand these discrepancies better, to continuously refine our model (for example through further optimizing for mapping values), and to enhance its clinical utility in a diverse patient population. This is a relatively small, single center study using CMR data from a single vendor and scanner, which prohibits the generalizability of our results. Of note all study participants are of Caucasian ancestry, therefore we could not test the fairness of our algorithm in terms of a more diverse ethnic background. Moreover, the dataset is biased in gender ( 71% male vs 29% women), which might also affect our outputs. To maximally preserve patient privacy, we removed all metadata, such as sex, age, ethnicity and body size measure, from our dataset. As a result, we did not analyze our method's segmentation accuracy with respect to variables. We believe that different types of diseases (e.g., HCM, DCM) have a much greater influence on the efficiency of our segmentation model, however, to ensure optimal clinical implementation, future works should investigate the effect of these variables on the accuracy of our model.
Deep learning models could be improved in many cases by adding more data and building bigger models, according to the power law of deep learning (32). For much more general use, the inclusion of additional data (from other sites, scanners, field strength) would be essential. It is the theoretical best error, that can be reached by exploiting the information content of the data, namely T1 and T2 mapping images in this case. It is our belief that the consensus labeling is a good approximation of this region. Although the current discrepancy between the consensus label and the AI-based model is small, it may be further reduced by incorporating more data in the training phase.
Moreover, the limitation of the deep learning approach are the following: † Anamnesis or any additional information describing the patients are not used by our segmentation method, therefore it cannot incorporate these into its predictions. † Our approach is based on a simple 2D segmentation network, therefore it is unable to learn and reason based on 3D structure of the hearth. † In this study we do not conduct confidence estimation, or apply an explainability method to the applied automated method. † We ran experiments with a single model. Introducing other architectures (e.g. vision transformers (33)) and massive hyperparameter optimization (34) is the subject of future research. † Producing segmentation masks from ground truth contours results in some information loss, because contours are given FIGURE 9 Distribution of achievable best Hausdorff distance values. We measure achievable lowest contour accuracy for each mapping by converting ground truth contours to mask, fitting contours to these, then computing Hausdorff distance between these and the initial ground truth contour. Frontiers in Cardiovascular Medicine more accurately than the resolution of the segmentation masks. This leads to noticeable inaccuracy of contours fitted ground truth segmentation masks, which presents the best achievable accuracy for our model. We can measure this value for each mapping by converting ground truth contours to mask, fitting contours to these, then computing Hausdorff distance between these and the initial ground truth contour. The distribution of the best possible values is plotted on Figure 9.

Conclusion
In this study, we successfully developed and demonstrated the feasibility of a deep learning model for the segmentation of cardiac T1 and T2 mapping images. Our AI-based model, trained on over 7,000 raw mapping images from 262 participants, showed minimal discrepancies when compared with expert consensus, affirming its accuracy and reliability. The challenges encountered during the development of our model underlined the complexity of CMR T1 and T2 mapping segmentation. Our research underscored the need for future work to not only focus on achieving accurate segmentation, but also on ensuring the accurate measurement of tissue properties. The promising initial results from our model suggest potential for its integration directly into scanner systems. This prospect could significantly streamline clinical routines and enhance the diagnosis and management of cardiovascular diseases. Ultimately, our work represents a significant step towards the development of real-time, automated tools for CMR mapping analysis.

Data availability statement
The source code can be accessed on the following location: https://github.com/BME-SmartLab/CardiacMappingSeg. Raw data are available upon request to the corresponding author.

Ethics statement
The studies involving human participants were reviewed and approved by National Public Health Center in accordance with the ethical standards laid out in the 1964 Declaration of Helsinki and its later amendments (approval ID 55262-6/2020/EUIG). All participants gave their written informed consent. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.  Predicted and ground truth contours with low, intermediate and high Hausdorff distance results, selected separately for apical, midventricular and basal T2 maping images.

FIGURE A3
Example contours from two experts with different levels of agreement (measured by Hausdorff distance), plotted for two observers as comparison baseline to Figure A2. All subfigures show apical T2 maps.