MOCOnet: Robust Motion Correction of Cardiovascular Magnetic Resonance T1 Mapping Using Convolutional Neural Networks

Background: Quantitative cardiovascular magnetic resonance (CMR) T1 mapping has shown promise for advanced tissue characterisation in routine clinical practise. However, T1 mapping is prone to motion artefacts, which affects its robustness and clinical interpretation. Current methods for motion correction on T1 mapping are model-driven with no guarantee on generalisability, limiting its widespread use. In contrast, emerging data-driven deep learning approaches have shown good performance in general image registration tasks. We propose MOCOnet, a convolutional neural network solution, for generalisable motion artefact correction in T1 maps. Methods: The network architecture employs U-Net for producing distance vector fields and utilises warping layers to apply deformation to the feature maps in a coarse-to-fine manner. Using the UK Biobank imaging dataset scanned at 1.5T, MOCOnet was trained on 1,536 mid-ventricular T1 maps (acquired using the ShMOLLI method) with motion artefacts, generated by a customised deformation procedure, and tested on a different set of 200 samples with a diverse range of motion. MOCOnet was compared to a well-validated baseline multi-modal image registration method. Motion reduction was visually assessed by 3 human experts, with motion scores ranging from 0% (strictly no motion) to 100% (very severe motion). Results: MOCOnet achieved fast image registration (<1 second per T1 map) and successfully suppressed a wide range of motion artefacts. MOCOnet significantly reduced motion scores from 37.1±21.5 to 13.3±10.5 (p < 0.001), whereas the baseline method reduced it to 15.8±15.6 (p < 0.001). MOCOnet was significantly better than the baseline method in suppressing motion artefacts and more consistently (p = 0.007). Conclusion: MOCOnet demonstrated significantly better motion correction performance compared to a traditional image registration approach. Salvaging data affected by motion with robustness and in a time-efficient manner may enable better image quality and reliable images for immediate clinical interpretation.


INTRODUCTION
Quantitative T1 mapping is a novel approach in cardiovascular magnetic resonance (CMR) for myocardial tissue characterisation (1). Native and post-contrast T1 mapping offer quantitative, pixel-wise measures to detect tissue changes in myocardial composition (2) and have been used in the assessment of myocardial inflammation (3), oedema (4,5), infiltration (6), diffuse fibrosis (7), and other pathologies (8). Stress T1 mapping has the potential to assess coronary artery disease without the need for gadolinium-based contrast agents (9)(10)(11).
T1 mapping is obtained from pixel-wise exponential recovery curve fitting of multiple T1-weighted images. With advances made from the original Look-Locker spectroscopic method (12), current mapping techniques employ intermittent image acquisition using electrocardiographic gating during multiple heartbeats (2). The shortened modified Look-Locker inversion recovery (ShMOLLI) (13) allowed shorter breath-holds with 9 heartbeats with high precision and reproducibility. Although acquiring multiple T1-weighted images at the same cardiac phase largely reduces the influence of cardiac motion, undesired respiratory motion still poses significant challenges (14). Uncorrected and unrecognised respiratory motion artefacts may cause errors in T1 estimation and incorrect diagnoses (13).
Retrospective motion correction (MOCO) on the multiple T1-weighted images can significantly improve the robustness and clinical utility of mapping techniques (15). Such correction is accomplished by aligning the T1-weighted images before reconstruction. The main challenge is the variation in image contrast and signal nulling of the multiple T1-weighted images acquired at different inversion times. Model-driven registration methods for MOCO were developed to circumvent this limitation with promising results (16)(17)(18)(19). However, careful inspection for uncorrected residual motion or distortions from failures in motion correction is still needed (20). Although visual assessment in CMR is still the clinical standard for image interpretation (21), constant and long manual labour is prone to error due to inconsistency and operator fatigue, as well as slow clinical workflow if handling a large volume of images.
With the advent of deep learning, convolutional neural networks (CNN) have enabled unprecedented progress in image processing, shifting the paradigm from predefined, hand-crafted rules to automated learning procedures aided by large data. The rapid adaptation of deep learning approaches within CMR provides fast, consistent, and accurate pipelines primarily for image segmentation and analysis (22) significantly reducing physician labour hours. The field of clinical image registration with deep learning is also primed to replace iterative registration methods, with potential to improve accuracy, time efficiency and quality control (23), and applicability to cover the unmet need of MOCO in T1 maps. We hypothesised that a datadriven method for myocardial motion correction would suppress motion artefacts with more robustness and generalisability to serve large clinical datasets.
In this work, we present MOCOnet, a novel deep learning approach for myocardial motion correction developed using CMR T1 mapping from the UK Biobank (24). We adapted an encoder-decoder architecture with warping layers to aid the learning of such deformation in a coarse-to-fine manner. Given a set of T1-weighted images, MOCOnet can predict the deformation required to correct any present motion artefacts in a time-efficient manner. MOCOnet was tested for its motion correction performance against a well-validated multimodal image registration method, using multiple blinded expert observers to validate the motion correction effectiveness.

Cardiac T1 Mapping and Motion Artefact
Cardiac ShMOLLI T1 mapping is calculated by fitting exponential recovery curves to 7 inversion recovery-weighted (IRW) images with multiple inversion times ( Figure 1A) and acquired within a short 9-heartbeat single breath-hold (13). The reconstructed T1 map ( Figure 1B) enables pixel-wise quantification of T1 values. The associated map of coefficient of explained variance (R 2 map; Figure 1C) allows quality monitoring of the curve fitting in reference to a monoexponential T1 relaxation recovery model. A closer proximity to the reference displays a uniform white appearance of relevant regions of interest in the R 2 map, whereas motion in the IRW images ( Figure 1D, arrowed) decreases the T1 map interpretability (Figure 1E, arrowed), corresponding to the dark bands at the motion-affected areas in the R 2 map (Figure 1F, arrowed). Besides motion artefacts, the R 2 map is also sensitive to off-resonance, fat inclusion, mistriggering, and other artefacts (5, 25).

Non-rigid Registration Approach
Given that a T1 map with motion artefacts is composed of 7 unaligned IRW images, a motion-corrected T1 map can be achieved by aligning the IRW images. The motion artefact can be synthesised as a deformation of aligned IRW images with a displacement vector field (DVF). The non-rigid registration problem is then solved by estimating the inverse DVF of a given set of unaligned IRW images.

Multi-Scale Registration Neural Network
The proposed learning-based model corrects a T1 map by estimating the inverse DVF in each of its 7 IRW images to enable a non-rigid registration between them, before the T1 map reconstruction. The multi-scale registration CNN (Figure 2) adopts an encoder-decoder U-Net-like structure (26) and employs warping layers (27) between the contracting and expansive paths at each scale. The feature maps are downsampled with a series of 3 × 3 convolutional layers followed by a batch normalisation layer, a leaky rectified linear unit and a max-pooling layer, and similarly up-sampled with a transposed convolutional layer. The warping layers speed up the training by imposing a loss function on a multi-scale manner and increase the registration accuracy by correcting motion starting from coarse levels and passing the residual motion to higher resolution layers for fine motion correction.
The IRW images are first fed in a sequence of convolution and downsampling operations to produce features at multiple scales on a per-channel basis. The features, from low to high resolutions, are then used as input of convolution modules to produce DVFs. Each convolution module takes as input the features from the previous step, the DVF at the previous scale, and the warped features from the downsampling stage. Applying warping at each of the 4 scales enables the use of residual motion information to be corrected and refined in the next scale. Hence, the neural network generates the DVFs in a coarse-tofine manner and adds more details with higher resolution in each subsequent level, with a loss function defined at each scale to further supervise the learning manner.

Imaging Data and Inclusion Criteria
The imaging data comprised of over 5,000 CMR native T1 maps from the UK Biobank Imaging Component (24), acquired at the mid-ventricular short-axis view using the ShMOLLI T1 mapping sequence (13). For quality control, a trained human operator (EL), with over 10 years of experience in CMR image analysis, assessed the presence of any artefact in the left ventricular myocardium in the 7 IRW images for each T1 map. A total of 1,536 T1 maps were scored strictly as good quality with no artefact. The remaining data were marked to have either mild to severe motion or other imaging artefacts and were excluded from the training dataset. This strict quality control ensured that the neural network learnt to align the images accurately with no distraction from residual motion artefacts in the training data, i.e., with images that did not require any motion correction.

Training Procedure
The quality-controlled imaging data were used to generate a training dataset with 10% of the data preserved for validation. Artificial DVFs were generated as previously described (28) and applied to the IRW images without motion artefacts to synthesise random non-rigid motion without requiring segmentations (28). Specifically, 7 DVFs were generated with random parameters preserving anatomical topology. Mean displacement value at each pixel was calculated and removed from all 7 DVFs to focus on relative displacement between images. The generated DVFs were applied to each of the IRW images, respectively to produce deformed IRW images. The proposed model was trained to predict 7 inverse DVFs from 7 deformed IRW images with the synthetic, inverse DVFs as ground truth ( Figure 3A).

Testing Procedure
Once trained, MOCOnet reads a given set of 7 IRW images with or without motion artefacts and estimates the deformation required to correct any present motion (Figure 3B), without ground truth. The T1 map is then reconstructed offline using motion-corrected images with an open source library for CMR parametric mapping (29). The last warping layer generates the inverse distance vector field (DVF), i.e., the deformation required to correct the motion artefacts, in a groupwise manner.

Implementation Specification
All images were zero-padded to the same size of 384 × 384 pixels and image intensities were pre-processed with quantile normalisation to ensure generalisability (30). The multi-scale loss was calculated as the average mean square errors of the predicted DVFs at each scale and resolution. The neural network was optimised using the Adam method (31) with an initial learning rate of 0.001 and a learning rate scheduler to reduce the learning rate during the training, and mini-batch size of 4. Training was stopped once the validation loss did not decrease for 50 epochs. The network was trained for approximately 48 h until the training curve converged with low bias and variance using a NVIDIA TITAN XP GPU and implemented in TensorFlow (32). After the training, correcting motion for each set of 7 IRW images took less than 1 s on GPU or a modern CPU.

Baseline Deformable Image Registration Method
The proposed method's performance was compared against a well-validated multi-modal image registration algorithm (33) as the baseline method. The registration method alleviated the problem of artificial motion discontinuities by combining a bilateral filter with an additional deformation field-based filter and a diffusion regularisation algorithm, serving as an excellent registration approach without requiring a prior image segmentation task as conventional methods. The baseline method, implemented in C, used the first IRW image as a reference image for all subsequent pairwise registrations and took approximately 30 s per T1 map on a modern CPU.

Test on Respiratory Motion With Human Observer Scores
A multi-observer experiment was designed to evaluate the effectiveness and robustness of motion correction, and potential noise introduced to cases originally with no motion. From the UK Biobank, a test set of 200 real acquired T1 maps with various degree of motion artefacts was selected based on the existing quality scores by an experienced human observer. Specifically, 50 samples presented severe motion artefacts affecting all myocardial segments, 100 presented moderate motion affecting individual segments, and 50 presented mild to no motion.
The extent of motion on the test set was assessed in a 5-point categorical scale: 'no motion' , 'mild motion' , 'moderate motion' , 'severe motion' , and 'very severe motion' , with a numerical scale between 0 to 100% behind the interface, to ensure both intuitiveness for human operators and practicality for statistical analyses. The baseline and proposed methods were applied to all samples unselectively, giving in total 400 motion-corrected samples. One hundred and twenty only samples (20%) were randomly chosen from the mixed 600 samples and duplicated to evaluate intra-observer variability. Three trained human observers (IP, MB and MS) were instructed to score the resultant 720 samples for the extent of motion. All observers were blinded to the original artefact scores and which motion correction method was applied. To reduce the variance of the human scores X i , the weighted average score X of the three observers (i = 3) was calculated as X = W i X i / W i . The weights W i were calculated by the inverse of intra-observer variance σ i (34,35) based on the duplicated 20% cases, i.e., W i = 1/σ 2 i for the i-th observer. The expected standard error of the weighted average scores was SE(X) = W i −1 .

Statistical Analysis
Quality scores were reported as mean ± standard deviation. Non-parametric Wilcoxon signed-rank test was used to assess the statistical difference between the data with and without motion correction by the baseline and proposed methods. Given the modest number of repeated comparisons within each group the statistical significance threshold was kept at standard p < 0.05 (36). Statistical analysis was performed using the Python programming language.

RESULTS
The results of human observer validation on the 200 cases from the UK Biobank are reported in Table 1. Intra-observer variabilities of the three observers on the 20% duplicated cases were 10.6, 17.3 and 21.9, respectively. Standard error of the final weighted-average scores that were used to compare the motion correction methods was 8.3 at a scale from 0 to 100%. Overall, both methods significantly reduced the motion artefacts, from an average motion score of 37.1 ± 21.5 to 15.8 ± 15.6 (baseline method) and 13.3 ± 10.5 (MOCOnet; both p < 0.001). MOCOnet was significantly more effective at reducing motion artefacts than the baseline method for the subgroups with severe motion (N = 50, p = 0.006) and moderate motion (N = 100, p = 0.04). For the subgroup with mild to no motion (N = 50), both methods significantly further reduced the motion artefacts (both p < 0.001), and neither added noise, nor was significantly different from each other (p = 0.2). Overall, MOCOnet suppressed motion artefacts to a higher extent and in a more consistent way compared to the baseline method, as evidenced by its lower maximum score and variability (N = 200, p = 0.007). The boxplot of motion scores (Figure 4) further The quality scores are inverse variance-weighted scores of three human observers and reported in mean ± SD (maximum value). The best results are highlighted in bold. illustrates the above dependencies in non-parametric terms. This demonstrates that MOCOnet achieved a tighter span of perceived motion estimates, with better perceived robustness to outliers. MOCOnet successfully learnt from synthetic random motion to predict the required DVFs to correct the motion of IRW images ensuring a motion-corrected T1 map in real acquired data. Figure 5 exemplifies the robustness of the method. One training sample was falsely considered to have no motion artefacts, as evidenced by the overlaid contours of both myocardium and stomach but this did not overfit the learning or affect the final results. The data-driven process aided the learning of the general rule, as MOCOnet managed to correct the error in this training sample, instead of replicating it.

DISCUSSION
In this work, MOCOnet, a novel end-to-end motion correction neural network for CMR T1 maps, was developed using a large-scale dataset and validated with expert human analysts.
MOCOnet was able to automatically predict the deformation required to correct real motion artefact cases. The proposed method has a fast-processing speed of <1 s per T1 map and does not require modification of image acquisition sequences, external hardware, or user intervention, enabling direct implementation to clinical practise.
Although the principle of estimating the required DVFs on a given set of images to correct their mutual alignment was tested on myocardial ShMOLLI T1 maps, the problem formulation and solution are not limited to this mapping method or region of interest. The deformation estimation is alleviated by considering the images 'as is' with a data-driven procedure (37) without heavily relying on the differences in contrast, the specific inversion recovery times or a prior user input. This principle can be directly applied to other T1 mapping methods that require multiple T1-weighted images to be aligned in a groupwise manner to ensure an accurate exponential recovery curve fitting (38), to other organs that are evaluated with parametric mapping, such as the brain (39) and liver (40), and to other imaging modalities with varied image contrast (41). The potential clinical impact of the method is promising. A large portion of the UK Biobank T1 mapping data analysed in this study presented mild to severe motion, hampering the diagnostic utility of T1 mapping. Although recent progress on automated motion artefact detection methods (42) may alleviate the quality monitoring process, rescanning to ensure a free-of-motion T1 map would increase scan times and reduce patient throughput. The presented data-driven MOCOnet approach provides an attractive solution to retrospectively suppress the motion using most of the acquired data to enhance T1 map quality, which is expected to salvage data corrupted by motion, reduce the need for rescanning and improve diagnosis. MOCOnet also holds promise for stress T1 mapping applications (9)(10)(11)38) which may be subject to greater motion artefact. With the rapidly evolving field of deep learning, further research can be done to assess potential benefits of incorporating a more diverse variety of learning-based registration methods (23, 43) into a quality-control driven pipeline (44)(45)(46) to verify the registration accuracy on-the-fly including the R 2 maps. With further work, MOCOnet together with T1 protocol quality assurance (47,48) and automated myocardial segmentation (45) could ultimately lead to a comprehensive framework for robust T1 mapping for clinical use.
Despite a good performance in motion correction, as evidenced with the large improvement in the motion score, it is revealed by human observer experiments that MOCOnet could still fail in correcting images with severe motion. The challenge is not only due to difficulty in motion correction, but also through-plane motion, resulting in fitting of T1 values using signals at different tissue location. Breath holding remains crucial in acquiring good quality T1 maps. Future work will include validation on a multi-vendor, multi-centre population, expansion to other regions of interest, and direct implementation onto the scanner for robust inline motion artefact correction to generate good quality and reliable images for immediate clinical interpretation.

CONCLUSION
MOCOnet is an effective and robust convolutional neural network for correction of artefacts from myocardial motion. The technique can be readily deployed for post-processing of T1 mapping to restore T1 values in images affected by motion artefacts. This non-rigid registration solution can be further extended to other mapping methods, for generating good quality and reliable images for immediate clinical interpretation. MOCOnet can eventually enhance parametric mapping methods paving the way towards more reliable quantitative CMR medical imaging.

DATA AVAILABILITY STATEMENT
The imaging data were provided by the UK Biobank under the technical access agreement. Data access must be obtained directly from the UK Biobank.

AUTHOR CONTRIBUTIONS
RG, QZ, BP, and KW contributed to the design of the study. RG drafted the manuscript. QZ curated the database and analysed the results. EL quality controlled the UK Biobank data. IP, MB, and MS scored the motion extent. VF and SP provided research guidance and conceived the study. All authors critically edited and revised the article.