Improving the generalizability of convolutional neural network-based segmentation on CMR images

Convolutional neural network (CNN) based segmentation methods provide an efficient and automated way for clinicians to assess the structure and function of the heart in cardiac MR images. While CNNs can generally perform the segmentation tasks with high accuracy when training and test images come from the same domain (e.g. same scanner or site), their performance often degrades dramatically on images from different scanners or clinical sites. We propose a simple yet effective way for improving the network generalization ability by carefully designing data normalization and augmentation strategies to accommodate common scenarios in multi-site, multi-scanner clinical imaging data sets. We demonstrate that a neural network trained on a single-site single-scanner dataset from the UK Biobank can be successfully applied to segmenting cardiac MR images across different sites and different scanners without substantial loss of accuracy. Specifically, the method was trained on a large set of 3,975 subjects from the UK Biobank. It was then directly tested on 600 different subjects from the UK Biobank for intra-domain testing and two other sets for cross-domain testing: the ACDC dataset (100 subjects, 1 site, 2 scanners) and the BSCMR-AS dataset (599 subjects, 6 sites, 9 scanners). The proposed method produces promising segmentation results on the UK Biobank test set which are comparable to previously reported values in the literature, while also performing well on cross-domain test sets, achieving a mean Dice metric of 0.90 for the left ventricle, 0.81 for the myocardium and 0.82 for the right ventricle on the ACDC dataset; and 0.89 for the left ventricle, 0.83 for the myocardium on the BSCMR-AS dataset. The proposed method offers a potential solution to improve CNN-based model generalizability for the cross-scanner and cross-site cardiac MR image segmentation task.


I. INTRODUCTION
Automatic cardiac segmentation algorithms provide an efficient way for clinicians to assess the structure and function of the heart from cardiac magnetic resonance (CMR) images for the diagnosis and management of a wide range of abnormal heart conditions [1]. Recently, convolutional neural network (CNN)-based methods have become state-of-the-art techniques chen.chen15@imperial.ac.uk for automated cardiac image segmentation [2,1]. However, related work [3] has shown that the segmentation accuracy of a CNN may degrade if the network is directly applied to images collected from different sites or scanners. For instance, CMR images coming from different sites may comprise different population demographics in terms of cardiovascular diseases, resulting in the clinically appreciable difference not only in cardiac morphology but also in image quality (e.g. irregular heartbeat can affect image quality) [4,5,6]. Thus, a CNN learned from a limited dataset may not be able to generalise over subjects with heart conditions outside of the training set. In addition, images from different scanners using different acquisition protocols can exhibit differences in terms of noise levels, image contrast, and resolution [7,8,9]. All these differences pose challenges for deploying CNN-based image segmentation algorithms in real-world practice, as illustrated in Fig. 1.
In general, a straightforward way to address this problem is to fine-tune a CNN learned from one dataset (source domain) with additional labelled data from another dataset (target domain). Nevertheless, collecting sufficient pixel-wise labelled medical data for every scenario can be difficult, since it requires domain-specific knowledge and intensive labour to perform manual annotation. To alleviate the labelling cost, unsupervised deep domain adaptation (UDDA) approaches have been proposed [10]. Compared to fine-tuning, UDDA does not require labelled data from the target domain. Instead, it only uses either feature-level information [11,12,13] or imagelevel information [13] to optimize the network performance on the target domain. However, these methods usually require hand-crafted hyper-parameter tuning for each scenario, which may be difficult to scale to highly heterogeneous datasets. Therefore, it is of great interest to explore how to learn a network that can be successfully applied to other datasets without the requirement of additional model tuning.
In this paper, we investigate the possibility of building a generalisable model for cardiac MR image segmentation, given a training set from only one scanner in a single site. Instead of fine-tuning or adapting to get a new model for arXiv:1907.01268v2 [eess.IV] 3 Jul 2019 Fig. 1: Illustration of a cross-domain CMR image segmentation application A CNN model which has been trained using a dataset collected from site A (source domain) is deployed onto data from other sites (target domains) to segment the left ventricle, the myocardium and the right ventricle from CMR images. In general, the model can perform well on test images from the same domain. However, whether this model can generalise well onto other sites is unknown. For example, the training set may have limited pathological cases, which may cause the model not be able to generalise over subjects with heart conditions outside of the training set. In addition, images from different scanners may have different image appearance because of different imaging acquisition protocols. Both these differences pose challenges to applying a CNN-based cardiac segmentation model to everyday clinical practice. each particular scenario, our goal is to find a generalisable solution that can analyse 'real-world' test images collected from multiple sites and scanners. These images consist of various pathology and cardiac morphology that may not be present in the training set, reflecting the complexity of a realworld clinical setting. To achieve this goal, we choose the U-Net [14] as the fundamental CNN architecture and apply it to segment the cardiac anatomy from CMR images (shortaxis view), including the left ventricle (LV), the myocardium (MYO), and the right ventricle (RV). An image pre-processing pipeline is proposed to normalise images across sites before feeding them to the network in both training and testing stages. Data augmentation is employed during the training to improve the generalization ability of the network. Although there has been a number of works [15,16] which have already applied data normalization and data augmentation in their pipelines, these methods are particularly designed for one specific dataset and the importance of applying data augmentation for model generalization ability across datasets is less explored. Here we demonstrate that the proposed data normalization and augmentation strategies can greatly improve the model performance in the cross-dataset setting (section IV-B). The main contributions of the work are as follows: • To the best of our knowledge, this is the first work to explore the generalizability of CNN-based methods for cardiac MR image multi-structure segmentation, where the training data is collected from a single scanner, but the test data comes from multiple scanners and multiple sites. • The proposed pipeline which employs data normalization and data augmentation (section III-C) is simple yet efficient and can be applied to training and testing of many state-of-the-art CNN architectures to improve the model segmentation accuracy across domains without necessarily sacrificing the accuracy in the original domain. Experiment results show that the proposed segmentation method is capable of segmenting multi-scanner, multivendor and multi-site datasets (section IV-C and IV-D). • Our work reveals that significant cardiac shape deformation caused by cardiac pathologies (section IV-E), low image quality (section IV-E), and inconsistent labelling protocols among different datasets (section V) are still major challenges for generalising deep learning-based cardiac image segmentation algorithms to images collected across different sites, which deserve further study.
II. RELATED WORK There have been a great number of works which develop sophisticated deep learning approaches to perform CMR image segmentation tasks on a specific dataset [16,1,15,3]. While these models can achieve overall high accuracy over the samples from the same dataset, only a few have been explored in cross-dataset settings. Table I shows a list of related works that demonstrate the segmentation performance of their proposed method by first training a model from one set (source domain) and then testing it on other datasets (target domain). However, these approaches requires re-training or fine-tuning to improve the performance on the target domain in a fully supervised fashion. To the best of our knowledge, there are few studies reported in the literature which investigate the generalization ability of the cardiac segmentation networks that can directly work across multiple sites. One work [18] in this line of research has been recently presented, which integrates training samples from multiple sites and multiple vendors [18] to improve segmentation performance across sites. Their results show that the best segmentation performance on their multi-scanner test set was achieved when the data used for training and testing are from the same scanners. Nevertheless, their solution requires collecting annotated data from multiple vendors and sites. For deployment, this may not always be practical because of the high data collection and labelling costs as well as data privacy issues.
Another direction to improve model generalization is to optimize the CNN architecture. In [17], the authors proposed a novel network structure with residual connections to improve the network generalizability. They proposed that networks with a large number of parameters may easily suffer from over-fitting problem with limited data [17]. They showed that their light-weight network trained on a limited dataset outperformed the U-Net [14], achieving higher accuracy on LV, myocardium, and RV. Moreover, model generalization was demonstrated by testing it (without any re-training or finetuning) on the LV-2011 dataset [19]. As a result, this model produced comparable results to those from a network that had been trained on the LV-2011, achieving a high mean Dice score for the myocardium (0.84). However, because of the lack of RV labels in their test set, their network's performance on the RV is unclear. Segmenting the RV is considered to be harder than the LV and the myocardium because the RV has a more complex shape with higher variability across individuals, and its walls are thinner, making it harder to delineate from its surroundings.
In this study, we evaluate the generalizability of the proposed method not only on the cardiac left ventricle segmentation but also on the right ventricle segmentation. Different from [18,17], the proposed method demonstrates model generalizability in a more challenging but realistic setting: our training data was collected from only one scanner (most of them are healthy subjects) while test data was collected from various unseen sites and scanners, which covers a wide range of pathologies, reflecting the spectrum of clinical practice.

A. Data
Three datasets are used in this study and the general descriptions of them are summarised in Table II. UK Biobank dataset The UK Biobank (UKBB) is a largescale data set that is open to researchers worldwide who wish to conduct a prospective epidemiological study. The UKBB study covers a large population, which consists of over half a million voluntary participants aged between 40 and 69 from across the UK. Besides, the UKBB study performs comprehensive MR imaging for nearly 100,000 participants, including brain, cardiac and whole-body MR imaging. An overview of the cohort characteristics can be found on the UK Biobank's website [20]. All CMR images we used in this study were collected from one 1.5 Tesla scanner (MAGNE-TOM Aera, syngo MR D13A, Siemens, Erlangen, Germany). Detailed information about the imaging protocol can be found in [21]. Pixel-wise segmentations of three essential structures (LV, MYO and RV) for both end-diastolic (ED) frames and end-systolic (ES) frames are provided as ground truth [22]. Subjects in this dataset were annotated by a group of eight observers and each subject was annotated only once by one observer. After that, visual quality control was performed on a subset of data to assure acceptable inter-observer agreement.
ACDC dataset The Automated Cardiac Diagnosis Challenge (ACDC) dataset is part of the MICCAI 2017 benchmark dataset for CMR image segmentation. This set is composed of 100 patients that were evenly divided into 5 classes: dilated cardiomyopathy (DCM), hypertrophic cardiomyopathy (HCM), myocardial infarction with altered left ventricular ejection fraction (MINF), abnormal right ventricle (ARV) and patients without cardiac disease (NOR). Detailed information about the classification rules and the characteristics of each group can be found in [1] and the ACDC website [23]. All images were collected from one hospital in France. The LV, MYO and RV in this dataset have been manually segmented for both ED frames and ES frames. Images in this dataset were labelled by two cardiologists with more than 10 years of experience [24].
BSCMR-AS dataset The British Society of Cardiovascular Magnetic Resonance Aortic Stenosis (BSCMR-AS) dataset from [25] consists of CMR images of 599 patients with severe aortic stenosis (AS), who had been listed for surgery. Images were collected from six hospitals across the UK. Specifically, these images were collected from 9 types of scanner (see Table II). Although the primary pathology is AS, several other pathologies coexist in these patients (e.g. coronary artery disease, amyloid) and have led to a variety of cardiac phenotypes including left ventricular hypertrophy, left ventricular dilatation and regional infarction [25]. A significant amount of diversity in image appearance and image contrast can be observed in this dataset. Different from the above two data sets, images in this dataset are partially labelled. Only the left ventricle in ED frames and ES frames, as well as the myocardium in ED frames, have been annotated manually. The contours on each slice were refined by an expert. In this study, following the same data splitting strategy in [3], we split the UKBB dataset into three subsets, containing 3975, 300 and 600 subjects for each set. Specifically, 3975 subjects were used to train the neural network while 300 validation subjects were used for tracking the training progress and avoid over-fitting. The remaining 600 subjects were used for evaluating models' performance. In addition, we directly test this trained network on the other two unseen datasets without any further re-training or fine-tuning process. The diversity of pathology observed in the ACDC dataset and the diversity of scanners and cardiac morphologies in the BSCMR-AS make them ideal test sets for evaluating the proposed method's segmentation performance across sites.

B. Network Architecture
In this paper, the U-Net architecture [14] is adopted to perform the cardiac multi-structure segmentation task since it is the most successful and commonly used architecture for biomedical segmentation. The structure of our network is illustrated in Fig. 2. The network structure is as same as the one proposed in the original paper [14], except for two main differences. One is that we apply batch normalization (BN) [26] after each hidden convolutional layer to stabilise the training. The other difference is that we apply dropout regularization [27] after each concatenating operation to avoid over-fitting and encourage generalization.
While both 2D U-Net and 3D U-Net architectures can be used to solve volumetric segmentation tasks [28,15], we opt for 2D U-Net for several reasons. Firstly, performing segmentation tasks in a 2D fashion allows the network to work with images even if they have different slice thickness or have severe respiratory motion artefacts between the slices (which is not uncommon). Secondly, 3D networks require much more parameters than 2D networks. Therefore, it is more memoryconsuming and time-consuming to train a 3D network than a 2D one. Thirdly, the manual annotation for images in the three datasets were done in 2D (slice-by-slice) rather than 3D. Thus, it is natural to employ a 2D network rather than a 3D network to learn segmentation from those 2D labels.

C. Training and Testing Pipeline
Since training images and testing images in this study were collected from various scanners, it is vital to normalise the input images before feeding them into the network. Fig. 3 shows an overview of the pipeline for image pre-processing during training and testing. Specifically, we employ image resampling and intensity normalization to normalise images in both the training and testing stages while online data augmentation is applied for improving the model generalization ability during the training process.
1) Image Resampling: Since the size of the heart in images with different resolution can vary significantly, it is essential to perform image resampling both in the training and testing phases before cropping such that the proportion of the heart and the background is relatively consistent for segmentation. The pixel spacing in the test image sets ranges from 0.78 to 2.33 mm whereas our training dataset from a single scanner has a uniform pixel spacing of 1.8 mm. Therefore, we choose a median value of 1.25 mm for resampling. All images are resampled to 1.25 × 1.25 mm across short-axis slices. After image resampling, data augmentation is applied to increase the variety of the training set in order to avoid over-fitting and encourage model generalization.
2) Data Augmentation: Data augmentation has been widely used when training convolutional neural networks for computer vision tasks on natural images. While different tasks may have different domain-specific augmentation strategies, the common idea is to enhance model's generalization by artificially increasing the variety of training images so that the training set distribution is more close to the test set population in the real world.  In this study, the training dataset is augmented in order to cover a wide range of geometrical variations in terms of the heart pose and size. To achieve this goal, we apply: • random horizontal and vertical flips with a probability of 0.5 to increase the variety of image orientation. • random rotation to increase the diversity of the heart pose. The range of rotation is determined by a hyperparameter search process. As a result, each time, the angle for augmentation is selected from [−30, +30]. • random image scaling with a scale factor s: s ∈ [0.7, 1.4] to increase variations of the heart size. • random image cropping. The random cropping implicitly perform random shifting to augment data context variety without black borders. This operation will also crop images to acceptable sizes required by the network structure. Note that cropping is done after all other image augmentations. Finally, all images are cropped to the same size of 256 × 256 before being sent to the network.
We also experimented with contrast augmentation [29] (random gamma correction where the gamma value is randomly chosen from a certain range) to increase image contrast variety, but only minor improvements were found in the experiments. Therefore, it is not included in the pipeline. For each cropped image, intensity normalization with a mean of 0 and a standard deviation of 1 is performed.
3) Training: After pre-processing, batches of images are fed to the network for training. To track the training progress, we also use a subset (validation set) from the same dataset to validate the performance of the segmentation and to identify possible over-fitting. Specifically, we apply the same augmentation strategy on the validation set and record the averaged accuracy (mean intersection of union between predicted results and ground truth) on it for each epoch. The model with the highest accuracy is selected as the best model. This selection criterion works as early stopping and has the benefit of allowing the network to explore if there is further opportunity to generalise better before it reaches to the final epoch. 4) Testing: For testing, 2D images extracted from volume data are first re-sampled and centrally cropped to the same size as the one of the training images. Again, intensity normalization is performed on each image slice which is then passed into the network for inference. After that, bilinear upsampling is performed on the outputs of the network to recover the resolution back to the original one. Finally, each pixel of the original image is assigned to the class that has the highest probability among the four classes (background, LV, myocardium, RV). As a result, a final segmentation map for one input image is generated.

D. Implementation Details
During training, a random batch of 20 2D short-axis slices were fed into the network for each iteration after data preprocessing. The dropout rate for each dropout layer is set to be 0.2. In every iteration, cross entropy loss was calculated to optimize the network parameters through back-propagation. Specifically, the stochastic gradient descent (SGD) method was used during the optimization, with an initial learning rate of 0.001. The learning rate was decreased by a factor of 0.5 every 50 epochs. The method was implemented using Python and PyTorch. We trained the U-Net for 1,000 epochs in total which took about 60 hours on one NVIDIA Tesla P40 GPU using our proposed training strategy. During testing, the computation time for segmenting one subject is less than a second.

E. Evaluation Metrics
The performance of the proposed method was evaluated using the Dice score (3D version) which was also used in the ACDC benchmark study [1] and [3]. The Dice score evaluates the overlap between automated segmentation A and manual segmentation B, which is defined as: Dice = 2|A∩B| |A|+|B| . The value of a Dice score ranges from 0 (no overlap between the predicted segmentation and its ground truth) to 1 (perfect match).
We also compared the volumetric measures derived from our automatic segmentation results and those from manual ones (see section IV-F), since they are essential for cardiac function assessment. Specifically, for each manual ground truth mask and its corresponding automatic segmentation mask, we calculated the volumes of LV and RV at ED frames and ES frames, as well as the mass of myocardium estimated at ED frames. The myocardium mass around the LV is estimated by multiplying the LV myocardial volume with a density of 1.05 g/mL. After that, Bland-Altman analysis and correlation analysis for each pair were conducted. Of note, for Bland-Altman analysis, we removed the outlying mean values that fall outside the range of 1.5×IQR (interquartile range) in order to avoid the standard deviation of mean difference being biased by extremely large values. These outliers are often associated with poor image quality. As a result, < 3% subjects were removed in each comparison.

IV. RESULTS
We compared the proposed method with the method in our previous work [3] in terms of segmentation accuracy across three sets: the UKBB test set, the ACDC set, and the BSCMR-AS set. In [3], a fully convolutional neural network (FCN) was trained using the same UKBB training set and then tested on the same UKBB test set. This method was specifically designed to automatically segment a large scale of scans for the same cohort study with maximum accuracy whereas the proposed method focuses on improving the robustness of the neural network-based segmentation method (using the same UKBB training set as training data) for data from different domains (e.g. non-UKBB data). The comparison results are shown in Table III. While both models achieve very similar Dice scores on the UKBB test set with high accuracy, the proposed method significantly outperforms the approach proposed in [3] on the two cross-domain datasets: ACDC set and BSCMR-AS set. Compared to the results predicted by [3] on the ACDC data, the proposed method achieves higher mean Dice scores for all of the three structures: LV (0.90 vs 0.81), myocardium (0.81 vs 0.70), and RV (0.82 vs 0.68). On the BSCMR-AS dataset, the proposed method also yields higher average Dice scores for the LV cavity (0.89 vs 0.82) and the myocardium (0.83 vs 0.74). Fig. 4 compares the distributions of Dice scores for the results obtained by the proposed method and the previous work. From the results, the boxplots of the proposed method are shorter than those of the previous method and have higher mean values, which suggests that the proposed method achieves comparatively higher overall segmentation accuracy with lower variance on the three datasets.  The myocardium segmentation performance on the BSCMR-AS set was only evaluated on ED frames because of the lack of annotation at ES frames, whereas the performance on the other two datasets was evaluated on both ED and ES frames. For simplicity, Dice scores for the myocardium on the BSCMR-AS in the following tables were calculated in the same way without further illustration.
In order to identify what contributes to the improved performance, we further compare the proposed method with [3] in terms of methodology. Two main differences are spotted: • Network structure and capacity. Compared to the U-Net we used in this study, FCN in [3] has a smaller number of filters at each level. For example, the number of convolutional kernels (filters) in the first layer of FCN is 16 whereas the one in the U-Net is 64. In addition, in the decoder part, FCN directly upsamples the featuremap from each scale to the finest resolution and concatenates all of them, whereas the U-Net adopts a hierarchical structure for feature aggregation. • Training strategy in terms of data normalization and data augmentation. Compared to the image preprocessing pipeline in the previous work, the proposed pipeline adopts image resampling and random image flip augmentation in addition to the general data augmentation based on affine transformations. In order to study the influence of the network structure as well as the data normalization and augmentation settings on model generalizability, extensive experiments were carried out and the results are shown in the next two sections.

A. The Influence of Network Structure and Capacity
To investigate the influence of network structure on model generalization, we trained three additional networks: • FCN-16: the FCN network presented in [3] which has 16 filters in the first convolutional layer. • FCN-64: a wider version of FCN where the number of filters in each convolutional layer is increased by 4 times. • UNet-16: a smaller version of U-Net where the number of filters in each convolutional layer is reduced by four times. Same as FCN-16, it has 16 filters in the first layer. All of them were trained using the same UKBB training set and with the same training hyperparameters. These networks were then compared to the proposed network (UNet-64). Table IV compares the performances of the four different networks over the three different test sets. It can be seen that while there is no significant performance difference among the four networks on the UKBB test set, small networks: UNet-16 and FCN-16 perform much more poorly than their wider versions: UNet-64 and FCN-64, on the ACDC set (see red numbers in Table IV). This may indicate that in order to accommodate more variety of data augmentation for generalization, the network requires a larger capacity. It is also worth noticing that UNet-64 outperforms FCN-64 on all of the three test sets, while UNet-64 contains fewer parameters than FCN-64. This improvement may result from U-Net's special architecture: skip connections with its step-by-step feature upsampling and aggregation. The results indicate that the network structure and capacity can affect the segmentation model generalizability across datasets.

B. The Influence of Different data normalization and Data Augmentation Techniques
In this section, we investigate the influence of different data normalization and augmentation techniques on the generalizability of the network, including image resampling (data normalization), scale, flip and rotation augmentation (data augmentation). We focus on these four operations because convolutional neural networks are designed to be translationequivariant [33] but they are not rotation-equivariant, nor scale and flip-equivariant [34,35]. This means that if we rotate the input, the networks cannot be guaranteed to produce the same predictions with the corresponding rotation, indicating that they are not robust to geometrical transformations on images. Current methods to improve these networks' ability to deal with rotation/flip/scale variations still heavily rely on data augmentation while intensity-level difference might be addressed by further doing domain adaptation techniques such as style transfer or adaptive batch normalization [36].
To investigate the influence of these four operations on model generalization, we trained additional three U-Nets using the UKBB training set, each of them was trained with the same settings except that only one operation was removed. To save the computational time for this ablation study, each network was trained for 200 epochs, which still took 10 hours for each network since the training set from the UKBB dataset was considerably large (3,975 subjects). The test results on the UKBB test set, the ACDC dataset, and the BSCMR-AS dataset are shown in Table V. It can be observed that while the results on the test data from the same domain (UKBB) with different settings do not vary much, there are significant differences on the other two test sets, demonstrating the importance of the four data augmentation operations. For example, image resampling increases the averaged Dice score from 0.673 to 0.783 for the RV segmentation on the BSCMR-AS set, whereas augmentation by scaling improves the mean Dice score from 0.596 to 0.750 for the RV on the ACDC set. The best segmentation performance over the three sets is achieved by combining all the four operations. These results suggest that  increasing variations regarding pixel spacing (image scale augmentation), image orientation (flip augmentation), heart pose (rotation augmentation) as well as data normalization (image resampling) can be beneficial to improve model generalisabilty over unseen cardiac datasets. While one may argue that there is no need to do image resampling if scale augmentation is performed properly during training, we found that image resampling can significantly reduce the complexity of realworld data introduced by heterogeneous image pixel spacings, such that training and testing data are more similar to each other, bringing benefits to both model learning and prediction. In the following sections, we will use 'UKBB model' to refer to our best model (the U-Net which was trained using the UKBB training set with our proposed training strategy) for the sake of simplicity.

C. Segmentation Performance on Images from Different Types of Scanners
In this section, UKBB model's segmentation performance is analysed according to different manufacturers (Philips and Siemens) and different magnetic field strengths (1.5 Telsa and 3 Telsa). The results on the two datasets (BSCMR-AS and ACDC) are listed in Table VI. For ACDC data, only the results regarding scans imaged using different magnetic strengths are reported since these scans are all from Siemens. Furthermore, results in the ACDC dataset with Dice scores below 0.50 are not taken into account for this evaluation. This is because the number of subjects from a 3T scanner in the ACDC is so small (33 subjects) that the averaged performance can be easily affected given only a few cases with extreme low Dice scores. Here, six subjects were excluded. The final results show that the model trained only using 1.5T Siemens data (UKBB data) could still produce similar segmentation performance on other Siemens and Philips data (top two rows in Table VI). Similar results are found on those images acquired from 1.5T scanners and those acquired from 3T scanners (see the bottom four rows in Table VI). This indicates that the proposed method has the potential to train a model capable of segmenting images across various scanners even if the training images are only from one scanner.

D. Segmentation Performance on Images from Different Sites
We also evaluate the performance of the UKBB model across seven sites: one from ACDC data, six sites from BSCMR-AS data. Results are shown in Table VII. From the results, no significant difference is found when evaluating the LV and the myocardium segmentation performances among the seven sites (A-G) while the generalization performance for RV segmentation still needs further investigation when more data with annotated RV becomes available for evaluation.

E. Segmentation Performance on Images belonging to Different Pathologies
We further compare the segmentation performance of the proposed method on five groups of pathological data to the group of normal subjects (NOR), see Table VIII. Surprisingly, the UKBB model achieves satisfying segmentation accuracy over the healthy group as well as DCM images and those images diagnosed with AS, indicating the model is capable of segmenting not only those with normal cardiac structures but also some abnormal cases with the cardiac morphological variations in those HCM images and AS images, see Fig. 5. However, the model fails to segment some of the other pathological images, especially those in the HCM, MINF, and   . Each block contains a slice from ED frame and its corresponding ES one for the same subject. Row 1: Ground truth; row 2: predicted results by the UKBB model. This figure shows that the UKBB model produced satisfying segmentation results not only on healthy subjects but also on those DCM and AS cases with abnormal cardiac morphology. The AS example in this figure is a patient with aortic stenosis who previously had a myocardial infarction.
ARV pathology groups where lower Dice scores are observed. For example, the mean Dice score for LV segmentation on HCM images is the lowest (0.84). Fig. 6 demonstrates some of the worst cases produced by our method. The first column in Fig. 6, shows a failure case where the UKBB model underestimated the myocardium and overestimated the LV when a thickened myocardial wall is present in a patient with HCM. Also, the model struggles to segment cardiac structure on a patient with MINF which contains the abnormal myocardial wall with non-uniform thickness (the second column in Fig. 6). Compared to images in the other four groups, images from patients with ARV seem to be more difficult for the model to segment as the model not only achieves a low mean Dice score on the RV (0.79) but also a low averaged value on the myocardium (0.74).
One possible reason for these unsatisfactory segmentation results might be the lack of pathological data in the current training set. In fact, the UKBB data only consists of a small amount of subjects with self-reported cardiovascular diseases, and the majority of the data are healthy subjects in middle and later life [22,37,3]. This indicates that the network may not be able to 'learn' the range of those pathologies that are seen in everyday clinical practice, especially those abnormalities which are not currently reprepresented in the UKBB dataset. Failure Mode Analysis We also visually inspected the images where the UKBB model produces poor segmentation masks. In general, there are two main failure modes we identified, apart from the failure found on the abnormal pathological cases which we have discussed above: • Apical and basal slices. These slices are more errorprone than mid-ventricle slices, which has also been reported by [1]. Segmenting these slices is difficult because apical slices have extremely tiny objects which can be hard to locate and segment (see Fig. 7 (a)) whereas basal slices with complex structures increase the difficulty of identifying the contour of the LV (see Fig. 7 (b)). • Low image quality. Images with poor quality are found both in 1.5T and 3T images (see Fig. 7 (c) and (d)). As reported in [7,8], 1.5T images are more likely to have low image contrast than 3T images due to the low signalto-noise (SNR) limits, whereas 3T images can have more severe imaging artefact issues than 1.5T images. These artefacts and noise can greatly affect the segmentation performance.

F. Statistical Analysis on Clinical Parameters
We further compare the proposed automatic method with manual approach on five clinical parameters, including the end-diastolic volume of LV (LV EDV ), the end-systolic volume of LV (LV ESV ), the left ventricular mass (LV M ), the enddiastolic volume of right ventricle (RV EDV ), and the endsystolic volume of RV (RV ESV ). Figure 8 shows the Bland-Altman plots for the five clinical parameters on the three datasets. The Bland-Altman plot is commonly used for analysing agreement and bias between two measurements. Here, each column shows the comparison results between automated measurements and manual measurements for one particular parameter, including the mean differences (MD) with corresponding standard deviation (SD) and the limits of agreement (LOA). In addition, we also conducted the Bland-Altman analysis for the automatic method (FCN) in our previous work [3], for ease of comparison.
From the first two columns in the Fig. 8, one can see that both FCN and the proposed method achieve excellent agreements with human observers on the UKBB dataset, indicating both of them can be used interchangeably with manual measurements. For the other two datasets, by contrast, the proposed method achieves much better agreement than FCN, as the LOA between the proposed method and manual results is narrower. For example, for LV M on the ACDC dataset, the LOA between the proposed method and the manual approach is from 5.07 to -39.93 (MD =-17.43) while the LOA between the FCN and the manual method is from 3.45 to -64.66 (MD = -30.61), see Fig. 8 (n) and Fig. 8 (m), respectively.
Finally, we calculate the Spearmanr's rank correlation coefficients (r 2 ) of the five clinical parameters derived from the automatic segmentation using the proposed method and the manual segmentation, which are reported in Table IX. From the results, it can be observed that the clinical measurements . This sample is from the BSCMR-AS dataset where only the LV endocardial annotation is avaiable. (c) Failure to recognize the LV due to a stripe of high-intensity noise around the cardiac chambers in this 1.5T image. This sample is an ES frame image from the BSCMR-AS dataset. (d) Failure to estimate the LV structure when unexpected strong dark artefacts disrupt the shape of the LV in this 3T image. Note that this image is an ED frame image from the BSCMR-AS dataset where RV was not annotated by experts.
based on the LV segmentation and the myocardium segmentation derived by our automatic model are highly positively correlated with the manual analysis (≥ 0.91), although the RV correlation coefficients on the ACDC dataset are relatively lower.

V. DISCUSSION
In this paper, we developed a general training/testing pipeline based on data normalization and augmentation for improving the generalizability of neural network-based CMR image segmentation methods. We also highlighted the importance of the network structure and capacity (section IV-A) as well as the data normalisation and augmentation strategies (section IV-B) for model generalizability. Extensive experiments on multiple test sets were conducted to validate the effectiveness of the proposed method. The proposed method achieves promising results on a large number of test images from various scanners and sites even though the training set is from one scanner, one site (section IV-C, IV-D). Besides, the network is capable of segmenting healthy subjects as well as a group of pathological cases from multiple sources although it had only been trained with a small portion of pathological cases.
The limitation of the current method (the UKBB model) is that it still tends to underestimate the myocardium especially when the size of the myocardium becomes larger (see points in the right part of Fig. 8 (p)). Again, we conclude this limitation is mainly due to the lack of pathological cases in the training set.
Besides, we found that the difference (bias) between the automatic measurements and the manual measurements in the cross-domain test sets: ACDC and BSCMR-AS, are more significant than the difference in the intra-domain set: UKBB test set. The larger bias may be caused by not only those challenging pathological cases we have discussed above, but also inter-observer bias and the inconsistent labelling protocols used in the three datasets. The evident inter-observer variability when delineating myocardial boundaries on apical and basal slices in a single dataset has been reported in [19]. In this study, however, there are three datasets which were labelled by three different groups of observers. Each group followed an independent labelling protocol. As a result, significant variations of RV labels and MYO labels on the basal planes among the three datasets are found. This interdataset inconsistency of the RV labels on basal planes has been reported in [38]. The mismatch of RV labels can partially account for the negative MD values for the RV measurements in the ACDC dataset (see Fig. 8 (s)). The differences in the labelling protocols together with inter-observer variability in different datasets pose challenges to evaluate the model generalizability across domains accurately.
In the future, we will focus on improving the segmentation performance of the neural network by increasing the diversity of the training data in terms of pathology. A promising way of doing it, instead of collecting more labelled data, is to synthesize pathological cases by transforming existing healthy subjects with pathological deformations. A pioneering work [39] in this direction has successfully transported pathological deformations from certain pathological subjects (i.e. HCM, DCM) to healthy subjects, which can help to increase the number of pathological cases. This deformation-based data augmentation can be easily integrated in the proposed training pipeline without significant modifications.

VI. CONCLUSION
In this paper, we proposed a general training/testing pipeline for neural network-based cardiac segmentation methods and  Fig. 8: Agreement of clinical measurement from automatic and manual segmentations. Bland Altman plots (automaticmanual) are shown regarding the three sets. In each Bland-Altman plot, the x-axis denotes the average of two measurements whereas the y-axis denotes the difference between them. The solid line in red denotes the mean difference (bias) and the two dashed lines in green denote ±1.96 standard deviations from the mean. The title of each plot shows the mean difference (MD) and its standard deviation (SD) for each pair of measurements. FCN: the automatic method in our previous work [3], LV/RV: left/right ventricle, EDV/ESV: end-diastolic/systolic volume, LVM: left ventricular mass. revealed that a proper design of data normalization and augmentation, as well as the design of network, play essential roles in improving its generalization ability across images from various domains. We have shown that a neural network (U-Net) trained with CMR images from a single scanner has the potential to produce competitive segmentation results on multi-scanner data across domains. Besides, experimental results have shown that the network is capable of segmenting healthy subjects as well as a group of pathological cases from multiple sources although it had only been trained with the UK Biobank data which has only a small portion of pathological cases. Although it might still have the limitations in segmenting images with low quality and some images with significant pathological deformations, higher segmentation accuracy for these subjects could be further achieved by increasing the diversity of training data regarding image quality and the pathology in the future. ETHICS APPROVAL AND CONSENT TO PARTICIPATE UK Biobank has approval from the North West Research Ethics Committee (REC reference: 11/NW/0382). The ACDC data has been fully anonymised and handled within the regulations set by the local ethical committee of the Hospital of Dijon (France). The BSCMR-AS data has approval from the UK National Research Ethics Service (REC reference:13/NW/0832), and has been conformed to the principles of the Declaration of Helsinki. All patients included in the BSCMR-AS study gave written informed consent.
CONSENT FOR PUBLICATION Not applicable.

COMPETING INTERESTS
The authors declare that they have no competing interests.
AVAILABILITY OF DATA AND MATERIAL Imaging data and manual annotations of the UKBB dataset were provided by the UK Biobank Resource under Application Number 2964. Researchers can apply to use the UK Biobank data resource for health-related research in the public interest [40]. The ACDC data is open to the public and can be downloaded from its website https://acdc.creatis.insa-lyon. fr/#challenges after registration. The BSCMR-AS dataset is available upon reasonable request. The code for training and testing the segmentation network will be available at https: //github.com/cherise215/CardiacMRSegmentation. The code is used for data pre-processing, data augmentation, and the segmentation network training and testing.