Recent Advances in Fibrosis and Scar Segmentation From Cardiac MRI: A State-of-the-Art Review and Future Perspectives

Segmentation of cardiac fibrosis and scars is essential for clinical diagnosis and can provide invaluable guidance for the treatment of cardiac diseases. Late Gadolinium enhancement (LGE) cardiovascular magnetic resonance (CMR) has been successful in guiding the clinical diagnosis and treatment reliably. For LGE CMR, many methods have demonstrated success in accurately segmenting scarring regions. Co-registration with other non-contrast-agent (non-CA) modalities [e.g., balanced steady-state free precession (bSSFP) cine magnetic resonance imaging (MRI)] can further enhance the efficacy of automated segmentation of cardiac anatomies. Many conventional methods have been proposed to provide automated or semi-automated segmentation of scars. With the development of deep learning in recent years, we can also see more advanced methods that are more efficient in providing more accurate segmentations. This paper conducts a state-of-the-art review of conventional and current state-of-the-art approaches utilizing different modalities for accurate cardiac fibrosis and scar segmentation.


INTRODUCTION
Necrosis regions found in the heart (including left atrium (LA) pre-ablation fibrosis, LA postablation scar and left ventricle (LV) infarction), depending on the location and size, can have various implications on the cardiac conditions of the patients. For example, ventricular scars can be signs of earlier episodes of myocardial infarction (MI) (Choi et al., 2001;Krittayaphong et al., 2008;Wu et al., 2008;Larose et al., 2010). Locating and quantifying the fibrosis and scars have also been demonstrated as a valuable tool for the treatment stratification of patients with atrial fibrillation (AF) (Allessie, 2002;Boldt, 2004) or ventricular tachycardia (Ukwatta et al., 2015) and provide guidance information for the surgical or ablation based procedures (Vergara and Marrouche, 2011). Imaging of post-ablation scars may also give valuable information on treatment outcomes (Peters et al., 2007;Badger et al., 2010). Cardiovascular magnetic resonance (CMR) has been one of the modern imaging techniques, which is widely used for qualitative and quantitative evaluation of cardiac conditions and to support diagnosis, monitoring disease progression and treatment planning (Kim et al., 2009). In particular, Late Gadolinium enhancement (LGE) CMR has been an emerging technique for locating and quantifying regions of fibrosis and scars across the LA and the LV (Peters et al., 2007;McGann et al., 2008;Oakes et al., 2009;Akkaya et al., 2013;Bisbal et al., 2014). LGE CMR has also been shown to improve ablation strategy planning, treatment stratification and prognosis by pre-ablation fibrosis quantification via clinical validations (Akoum et al., 2011). It also enabled computationally guided and personalized targeted ablation in treating AF in clinical practices (Boyle et al., 2019).
Many algorithms have been developed for the segmentation of cardiac scarring regions, and a few challenges have benchmarked some of the high-performing methodologies ( Table 1). Among these, 2-SD (standard deviation) has been advocated by the official guidelines (Kramer et al., 2013), while the full width at half maximum (FWHM) technique has been advocated as the most reproducible method to segment ventricular scars (Flett et al., 2011) (see Section 3.2 for descriptions of 2-SD and FWHM methods). As these algorithms are usually based on successful segmentation of the corresponding anatomical regions beforehand as an accurate initialization, there has also been rising attention to the automated segmentation of LA and LV anatomy from the LGE CMR images (Table 1).
With the development of artificial intelligence techniques, we can observe a rising number of various deep learning models using convolutional neural networks [e.g., fully connected neural network (FCNN) (Szegedy et al., 2016) and U-Net (Ronneberger et al., 2015)], which have demonstrated encouraging results in segmentations of cardiac substructures in recent years . It has also been found that deep learning can be directly applied to scar segmentation as a fully automated end-to-end solution for the input LGE CMR images. With co-registration of different modalities together and deep learning based transfer learning, the combination of LGE CMR with other CMR imaging modalities [e.g., balanced steady-state free precession (bSSFP)] may further improve the efficacy and efficiency of the segmentation results.
The use of Gadolinium-based contrast agent (GBCA) has led to concerns over the patient's safety, particularly for the patient with renal impairments (Ledneva et al., 2009). With deep learning based methods, cardiac scarring regions can now be localized and quantified in non-Gadolinium enhanced CMR images without GBCA injections .
As all pre-2016 and pre-2013 cardiac scarring segmentation have been carefully benchmarked and summarized by Karim et al. (2013Karim et al. ( , 2016, this paper instead focuses on the survey of all post-2016 methodologies in fibrosis and scars delineation and segmentation of the LA and LV anatomy from LGE CMR images. This study also discusses the potential use of the modalities other than LGE CMR in locating and quantifying the scars.

Search Criteria
To identify related contributions, search engines like Scopus and Google Scholar were queried for papers on or after 01 Jan 2016 containing ("atrial" OR "ventricular") and ("cardiac") and ("segmentation") with or without ("scar") in their titles or abstracts. Papers that do not primarily focus on the segmentation of cardiac scar or scar-related cardiac anatomy were excluded. Each paper was reviewed and agreed upon by at least two of us (Y.W., Z.T., B.L.) before inclusion. We found 4,384 papers from the search engines and shortlisted 110 of them following the criterion above (Figure 1). After full-text screening for their relevances to the topic, we eventually included 47 of them into this study. The last update to the included papers was on 13 May 2021.

LGE CMR
Fibrosis found in LA are signs of atrial structural remodeling and can be considered as a major risk factor in the progression of the atrial fibrillation (AF) (Allessie, 2002;Boldt, 2004), where the identification of scarring and fibrosis regions in LA has been crucial for diagnosis, prognosis and treatment planning. Native pre-ablation fibrosis can be a sign of AF recurrence (Oakes et al., 2009), and post-ablation detection of ablation induced scars can facilitate the identification of post-ablation ablation line gaps, which is the main reason of ablation failures (Peters et al., 2007;Badger et al., 2010). In contrast to the traditional method of the electro-anatomical mapping (EAM) system, which is an invasive technique in localization of the atrial scar and the fibrosis with suboptimal accuracy (Zhong et al., 2007;Schmidt et al., 2009), LGE CMR enables the atrial scarring and fibrosis regions to be localized and quantified non-invasively without ionizing radiation. LGE CMR employs the slow washout kinetics of Gadolinium in these regions to highlight these scarring and fibrosis regions (Peters et al., 2007;McGann et al., 2008;Oakes et al., 2009;Akkaya et al., 2013;Bisbal et al., 2014).
In addition to the atrium, LGE CMR has also been considered as a gold-standard modality for the assessment and quantification of the scarring regions in the left ventricle (Simonetti et al., 2001;Wu et al., 2001;Wagner et al., 2003a;Hendel et al., 2006), where fibrotic and scarring regions found can be considered as a sign of earlier or current episodes of the MI (Choi et al., 2001;Krittayaphong et al., 2008;Wu et al., 2008;Larose et al., 2010). In addition to MI, with growing prognostic evidence, LGE has been successful in the identification of scarring regions in cardiomyopathy, inflammatory and infiltrative conditions (Wagner et al., 2003b;Maceira et al., 2005;Smedema et al., 2005;Flett et al., 2009).
However, the LGE CMR modality often suffers from poor image qualities, which may be due to residual respiratory motions, variabilities in the heart rate and gadolinium washout during the currently long acquisition time (Yang et al., 2017). Particularly, the spatial resolution of the left atrium in the LGE CMR image is limited , considering • Published on or after 01 Jan 2016 • Contain ("atrial" OR "ventricular") AND ("cardiac") AND ("segmentation") Contain ("scar") OR ("fibrosis") OR ("infarction") OR ("necrosis") Contain ("LGE") OR ("Gadolinium")  (Figure 2). The variable anatomical morphological shapes of the LA and pulmonary veins (PV) also impose an additional challenge to the LGE CMR segmentations. To improve the visualization of these scar regions, we can see a successful attempt by maximum intensity projection (MIP) to enhance intensities on post-ablation LA LGE CMR (Knowles et al., 2010). Moreover, some irrelevant cardiac substructures may be highlighted in LGE CMR images as well, in addition to the scarring and fibrosis regions. These may be due to, for example, the navigator beam artifact, which is often seen near the right PV, Gadolinium uptake by the aortic wall and valves, and confounded enhancement in the spine, esophagus, etc. (Karim et al., 2013;Yang et al., 2017). As a result, these can lead to a poor result in the delineation of LA and LV scar or fibrosis regions and even a significant amount of false positives in segmentations of these structures and regions. In addition, although LGE CMR has been successful in being the gold standard reference technique for AF and MI, including LGE in MRI significantly extends the scanning time. There have been also increasingly growing concerns regarding the safety of the Gadolinium based contrast agent used, particularly for the patient with renal impairments (Ledneva et al., 2009).

LGE CMR With Other Modalities
In addition to LGE MRI, which could highlight the scarring regions, segmentation of the anatomy and scarring regions can also utilize other modalities (Figure 3) to further improve the accuracy if applied with LGR CMR by co-registering different modalities together (Zhuang, 2019).
There have been challenges benchmarking a range of algorithms for the cross-modality fusion based segmentation of anatomy, scar and oedema.
(1) MS-CMR challenge (MS-CMR Challenge, 2019; Pop et al., 2020) presented a range of algorithms taking multiple modalities in to further improve the segmentation accuracy of LV myocardium, LV blood cavity and RV. (2) MyoPS challenge (MyoPS Challenge, 2020;Zhuang and Li, 2020) presented algorithms to delineate LV myocardium with scarring and oedema.
Other modalities and sequences can include: (1) Magnetic resonance angiography (MRA) sequence -to image LA and PV with high contrasts, which has been demonstrated by Tao et al. (2016) to improve the error distance in segmenting LA anatomy to within 1.5 mm. However, MRA is usually ungated and usually acquired in an inspiratory breath-hold, making anatomy delineated from MRA significantly distorted from LGE CMR.
(2) Balanced steady-state free precession (bSSFP) -provides a clear boundary between the myocardium and blood cavity under movements, which is usually respiratory and cardiac gated. It can offer cine CMR with a uniform texture. (3) T2 -high intensities in T2 presents myocardial oedema with high specificity and sensitivity (Gannon et al., 2019), T2 could be helpful in segmenting myocardial oedema and scar simultaneously if incorporated with LGE-CMR and bSSFP . Identification of oedema on CMR can help clinicians to differentiate between acute and remote myocardial infarction (Friedrich, 2017). The presence of oedema in patients without extensive irreversible injury (e.g., scar) can serve as a marker for clinicians to predict the recovery of LV systolic functions (Vermes et al., 2014).

CONVENTIONAL METHODS
Conventionally, a two-stage approach is adopted in the identification and evaluation of fibrotic and scarring tissue -(1) segmentation of the relevant anatomical structure (LA and PV in the case of LA fibrosis/scar segmentation and LV in the case of LV infarction segmentation) and (2) then segmentation of the fibrotic and scarring regions. This two-stage approach is particularly beneficial for LA and PV, as LA and PV are highly morphological variables and relatively small in size. We shall then elaborate on the recent developments of methodologies for each of them.

Segmentation of Anatomical Structures
The delineation of anatomical structures, e.g., LA and LV wall, from others can be difficult in LGE CMR images. In LGE scarring tissues are significantly enhanced while the signals from the healthy tissues are attenuated (Keegan et al., 2015), making the segmentation of LA, PV and LV anatomical structures very challenging.
3.1.1 Why Is Accurate Segmentation of Anatomical Structure Necessary Before Scar Segmentation?
Accurate segmentation of the anatomy (LA or LV wall) is essential as it gives an accurate initialization for the scar segmentation. Therefore, traditionally, the segmentation of these structures were all done manually. We could see in the cDEMRIS challenge in ISBI 2012 (Karim et al., 2013) algorithms with manually initialized LA segmentation showed significantly better performance than Others. It demonstrated the need for an accurate anatomy Frontiers in Physiology | www.frontiersin.org segmentation ahead of the scar segmentation along with Rajchl et al. (2015). Moccia et al. (2018) also demonstrated that manual and accurate segmentation of the LV wall could improve the deep learning based segmentation of the LV infarction.

Conventional Methods in Segmenting Anatomical Structures
In the early 21st century, radiologists looked between LGE CMR and cine CMR back and forth to delineate the myocardium region. To mimic that, we can see methods in the first decade and early second decade of this century utilizing both LGE and cine modalities by, for example, non-rigid registration to achieve high accuracy in segmentation of myocardium over LGE CMR (Dikici et al., 2004;Ciofolo et al., 2008;Wei et al., 2011Wei et al., , 2013. However, by doing so, the result may suffer from registration misalignment between LGE and cine modalities and the model may be computationally demanding. As such, from 2014 we can see methods that are less computationally demanding and using LGE modality only (Albà et al., 2014;Kurzendorfer et al., 2017a,b,c).
Conventional methods in medical image segmentation usually have limited efficacy. Representative methods are summarized in Table 2, which mainly include the following methodologies.
For LA, in particular, the methods involving pre-defined shape priors (Zhu et al., 2013;Veni et al., 2017) often suffer from relatively poor error distance, which is more than 1-2 mm required (Xiong et al., 2021) under the clinical setting considering the thin LA wall . However, one of them reported a relatively high Dice score (79%) (Zhu et al., 2013).

Segmentation of Scarring Regions
Upon successful segmentation of the anatomy, the scarring regions can be identified by a range of approaches. These approaches can be mainly divided into the following categories: threshold based methods, classification methods, or the combination of both.

Fixed Threshold Based Methods (n-SD and FWHM)
Traditionally, the scarring regions can be detected as they are accentuated in LGE CMR. Among a range of conventional techniques, 2-SD has been advocated by official guidelines (Kramer et al., 2013), while the full width at half maximum (FWHM) technique has been advocated as the most reproducible method to segment ventricular scars (Flett et al., 2011).
2-SD and FWHM are both fixed threshold methods in segmenting the scarring region, where pixels with intensities above a fixed threshold would be labeled as the scar. 2-SD or even n-SD methods define such threshold as the sum of the mean and two or n standard deviations of signal intensities in a remote reference region, whereas FWHM defines such threshold as the half of the maximum signal intensity within the scar. Karim et al. (2016) evaluated 2, 3, 4, 5, 6 -SD and FWHM methods on a public human LV infarct dataset and showed that FWHM superseded all n-SD methods tested by its Dice Scores and that the Dice Scores went slightly higher with the threshold rising from 2 to 6 -SD. However, it is not the case when Karim et al. (2013) evaluated 2, 3, 4 -SD and FWHM on a public human LA fibrosis/scar dataset. For pre-ablation LA fibrosis, FWHM performed much worse than all n-SD methods tested. For post-ablation LA scar, FWHM gave similar Dice Scores as 2-SD's with 3, 4, 6 -SD methods' Dice Scores much lower than these two.
However, these fixed-threshold techniques, including n-SD and FWHM, are unlikely to handle variations well (Oakes et al., 2009). The variations can come from two sourcesscar itself and external circumstances. Scars are highly variable in their morphology and their brightness distribution on LGE CMR. Varied external factors including resolution, contrast, signal-to-noise ratio (SNR), inversion time and surface coil intensity variation can also adversely impact the accuracy of the segmentation. This is particularly the case for pulmonary veins, which are highly morphological variables.

Conventional Adaptive Methods
An LV scar segmentation challenge (Karim et al., 2016) organized in MICCAI 2012 and LA scar segmentation (Karim et al., 2013) challenge organized in ISBI 2013 carefully benchmarked and summarized the majority of the pre-2013 conventional methods. In the LV segmentation challenge in 2012, it showed all of the algorithms benchmarked did not exhibit superiority against FWHM, although they did perform better than n-SD methods.

Adaptive thresholding based methods
Conventional threshold based approaches are summarized in Table 3A, which mainly include the following methodologies.

Classification based methods
In addition, conventional classification approaches are summarized in Table 3B, which mainly include the following methodologies.
(4) Simple linear iterative clustering (SLIC) and support vector machine (Yang et al., 2018b). (5) Random forest classification (Kurzendorfer et al., 2018).  LGE MRI (1) LV localization -image registration (2) short axis estimationprincipal component analysis (PCA) (3) endocardial refinement -a minimal cost path search (MCP) in polar space using the edge and scar information (4) epicardial refinement -by shape and inter-slice smoothness constraints (5) surface extraction -3D mesh generation by marching cube algorithm (Lorensen and Cline, 1987) Fast speed and low computational workload by using simple texture features; Consider image data along the longitudinal axis in addition to the short axis, improving inter-slice smoothness and avoid inter-slice shift; No need to register with other modality (e.g., bSSFP cine MRI) Poor performance in apex and LV outflow tract, poor accuracy in basal regions; Since this method is texture based, the distribution of scar and the small size of the atrium adversely affect its performance  Karim et al., 2013) Hysteresis thresholding (Canny, 1986) Coherent segmentation (adjacent faint scar sections can still be segmented)

DEEP LEARNING BASED METHODS
Deep learning based methods are constructed from deep artificial neural networks. In this section, we will briefly introduce the common types of artificial neural networks (ANNs) and then focus on their variants targeting cardiac anatomy and scar segmentations. The authors would also like to recommend interested readers to refer to Goodfellow et al. (2016) for more detailed explanations and mathematical illustrations of these networks and Chen C. et al. (2020) for more thorough demonstrations of these networks in general cardiac imaging analysis.

Neural Networks of Deep Learning in Image Analysis
Convoluted neural networks (CNNs), particularly fully convoluted neural networks (FCNNs), have demonstrated success in delineating anatomical structures in medical images (Shelhamer et al., 2017), especially in cardiac MR . Successful examples include ResNet (Szegedy et al., 2016), U-Net (Ronneberger et al., 2015), and etc. U-Net (Ronneberger et al., 2015), in particular, has been known for its ability to gather latent information in medical image analysis and thus to gain better performance in segmentation, which has become the most popular CNN backbone architecture, especially after demonstrating success in the ISBI cell tracking challenge in 2015.
The recurrent neural network (RNN) is another type of ANNs. The RNN is rather more useful in processing sequential data, as it could 'memorize' past data and utilize its 'memory' to assist with its current prediction. Widely used structures of RNNs include long-short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) and gated recurrent unit (GRU) (Cho et al., 2014).
Autoencoders (AEs) are also a type of ANNs, which are able to learn latent features of imaging data. Unlike CNNs and RNNs, AEs learn these features without supervision. With latent features gathered by AEs, it could be used to guide the segmentation of medical images (Oktay et al., 2016;Yue et al., 2019).
Generative Adversarial Networks (GANs) was initially proposed for image synthesis (Goodfellow et al., 2014). With its two-player model structure (a generator network to give a synthesized image and a discriminator network to try to differentiate that synthesized image from a true image), the model can enhance the resolution of the synthesized image by adversarial training. The GAN could also be used for segmentation, where its discriminator network would rather attempt to see if the output label is in an anatomically plausible shape (Luc et al., 2016).  Pop et al., 2020) in MICCAI 2019, and MyoPS 2020(MyoPS Challenge, 2020Zhuang and Li, 2020) in MICCAI 2020 for LV. With the recent development in deep learning, we can observe a range of methodologies developed for LA and LV segmentation in LGE CMR (Jamart et al., 2020).

Segmentation of Anatomical
In particular, in LASC'18, all deep learning methods had their mean surface distance in LA wall segmentation below 1.7 mm, with the minimum mean value of 0.748 mm. This demonstrated the efficacy of the deep learning based methods by the surface distance, which is required to be less than 1-2 mm under the clinical setting (Xiong et al., 2021).

Deep Learning Methodologies in the Anatomical Structure Segmentation
Successful networks demonstrating success in delineating anatomical structures include VGG-net (Simonyan and Zisserman, 2014), U-Net (Zabihollahy et al., 2019b), and V-Net (Milletari et al., 2016). To further exploit the information on the z-axis, LSTM and its variants (Yang et al., 2018a;Zhang et al., 2020) and dilated residual learning blocks (Yang et al., 2018a) can be introduced to the widely used U-Net.
On top of the U-Net, Xiong et al. (2019) proposed a dual path U-Net variant, which is demonstrated to have the best Dice Score (0.942) followed by VGGNet (0.864) in their benchmarking of a range of popular CNNs including the original U-Net and one non-deep-learning based method (Zhu et al., 2013) in LA segmentation. Multi-view learning, incorporating axial, sagittal and coronal views together, gave superior performance compared to models based on one view only (Xiao et al., 2020).
On the contrary, further research showed that structural variations in U-Net are unlikely to cause a significant improvement of its performance in LA segmentation from LGE CMR , and that deep supervision and attention blocks are unlikely to further improve LA segmentation performance either (Borra et al., 2020b).
In addition to these supervised learning based methods, Chen J. et al. (2019) proposed a feature-matching based semi-supervised learning technique to further improve the segmentation efficacy.
All the methods discussed above are summarized in Table 4.

Segmentation of Scarring Regions
We can observe a range of deep learning based methods in segmenting scars (Table 5).

LA Scar Segmentation Models
For LA (Table 5A), Yang et al. (2017) proposed a deep learning based method using Stacked Sparse Auto-Encoders to delineate the LA fibrosis region, which is based on accurate anatomical structure delineation. Li et al. (2020) proposed a graph-cuts framework based on multi-scale CNN to further incorporate local and global texture information of the images.

LV Scar Segmentation Models
For LV (Table 5B), E-Net (Moccia et al., 2018) and FCNN (Moccia et al., 2019) were demonstrated for its high accuracy if with manually segmented LV walls. Then, multi-view U-Net has  also been developed in segmenting the scar in a cascaded way (Zabihollahy et al., 2020).

Development of End-To-End Scar Segmentation Models Instead of Staged Segmentation Networks
With more recent developments of deep learning, the models can extract further latent information from the LGE CMR images and segment the scar directly from LGE CMR images without acquiring accurate segmentation of the relevant cardiac anatomical structures (e.g., LA wall) in advance while maintaining the accuracy. There has also been a range of methods ( Table 6) that can complete the segmentation of both the anatomy of cardiac chambers and the scar simultaneously (referred to as "two tasks" below). This is particularly the case for LV, where there is much less variability in its anatomical shape.

LA End-To-End Scar Segmentation Models
For LA (Table 6A), due to the thin LA wall, it is particularly difficult to achieve an end-to-end segmentation of scar directly from LGE CMR. A multi-view two task (MVTT) deep learning based method with dilated attention network was proposed to complete the two tasks simultaneously (Chen et al., 2018;Yang et al., 2020). This study also benchmarked a range of popular deep learning networks such as U-Net and V-Net on each of the two tasks. It compared the performance of its network with conventional methods such as 2-SD and k-means to demonstrate the superiority of its network in completing both of the two tasks accurately on both pre-ablation and post-ablation datasets . This study also suggested that 2-SD, k-means and fuzzy c-means methods clearly over-estimated the enhanced LA scar region . Later, with a joint GAN discriminator, Chen et al. were able to further improve the segmentation accuracy by dealing with the significantly unbalanced two LA targets (LA wall and scar)  Table 7). In their method, cascaded learning, a widely applied technique in learning labels with unbalanced classes in natural image segmentation (Dai et al., 2016;Murthy et al., 2016;Li et al., 2017;Lin et al., 2017;Ouyang et al., 2017;Cai and Vasconcelos, 2018;, demonstrated superiority in learning.

LV End-To-End Scar Segmentation Models
As LV has less variant morphology and greater size, there have been more successful methods demonstrating their efficacies and efficiencies in LV scar segmentation (Table 6B). E-Net (Moccia et al., 2018) and FCNN (Moccia et al., 2019) were the first few *As  and (Chen et al., 2018) reported very similar methodologies, we reported  only in this table.  (Zabihollahy et al., 2019a).
In addition, Fahmy et al. (2020) proved that a 3D CNN deep learning based approach could be applied for LV scar segmentation for patients with hypertrophic cardiomyopathy (HCM) via a multicenter multivendor study.
Inspired by the two-stage approach, a multi-view cascaded U-Net driving for even higher efficacy in segmentation was developed to cascade the two tasks sequentially while considering sagittal, axial and coronal views (Moccia et al., 2019).

Segment LGE CMR Jointly With Other Modalities
As explained in Section 3.1.2, traditionally, clinicians check both bSSFP cine and LGE MRI modalities to ensure accurate segmentation of the myocardium and then the scar. Therefore, many methods suggested the use of both bSSFP cine and LGE modalities in delineating anatomical structures and scar to mimic that. For LA, it is also known that MRA gives a clear boundary in PV to help with LA wall segmentation. We can see many methods taking MRA as an extra modality into their models to enhance their segmentation accuracy. However, many studies chose bSSFP over MRA, as bSSFP can be acquired in the same phase as LGE CMR by cardiac gating. Although MRA provides better resolution, MRA is not cardiac gated and can be difficult and error-prone in co-registration with LGE CMR, causing misalignments in registered images. Additionally, as explained in Section 2.2, integration with other modality (e.g., T2) may enable more findings from the CMR (e.g., oedema) in addition to scars.
There are few challenges benchmarking a range of algorithms for the cross-modality fusion based segmentation of anatomy, scar and oedema. MS-CMR challenge (MS-CMR Challenge, 2019; Pop et al., 2020) presented a range of algorithms taking multiple modalities in to further improve the segmentation accuracy of LV myocardium, LV blood cavity and RV. MyoPS challenge (MyoPS Challenge, 2020;Zhuang and Li, 2020) presented algorithms to delineate LV myocardium with scarring and oedema.
Common methods to segment anatomy and scar from multiple modalities include: (1) Cross-modality style and feature propagation (typically from bSSFP to LGE-MRI) [e.g., multi-atlas label fusion (MAS) ].
(3) A two-stage approach to firstly co-registering anatomical segmentation from one modality to another (typically from bSSFP segmentation to LGE-MRI) and then segment scars based on the co-registered anatomy segmentation (Leong et al., 2019).
However, respiratory and/or cardiac motion complications between acquisitions of different modalities can still cause errors in registration and possible misalignments.

SCAR SEGMENTATION WITH NON-CONTRAST-AGENT (NON-CA) ENHANCED IMAGING MODALITY ONLY
Although LGE CMR has been very successful in being the gold standard reference technique for AF and MI, including LGE in an MRI scanning significantly extends the scanning Time. Moreover, there have been increasingly growing concerns regarding the safety of the Gadolinium based contrast agent used, particularly for the patient with renal impairments (Ledneva et al., 2009). There has been a rising attention in exploring methods to segment scars without injecting contrast agents to the patients on non-CA modalities. Non-CA modality based cardiac scar segmentation methods have been widely demonstrated for LV scar delineations but has not been realized for LA scar delineations. Dastidar et al. (2019) and Liu et al. (2018) demonstrated the potential of pre-contrast scar segmentation by comparing the inter-modality manual observations of myocardial infarction regions on LGE CMR and native-T1 mapping without the Gadolinium contrast agents.

Relaxation Time Based Scar Segmentation in T2
T1 and T2  are modalities that are not enhanced by any contrast materials, where relaxation times in MI is longer compared to the healthy myocardium and could be referenced for MI region segmentation reproducibly (Abdel-Aty et al., 2004;Kali et al., 2014;Smulders et al., 2015). However, the relaxation time is field strength specific (Raman et al., 2013;Haaf et al., 2017) and requires the acquisition of images for additional breath holds, which significantly extends the CMR acquisition time.

MRI Feature Tracking
Magnetic resonance imaging feature tracking is also an approach to differentiate MI induced cardiac wall abnormalities from normal myocardium, which can be acquired as part of a standard CMR scanning examination (Muser et al., 2017;Ogawa et al., 2017). However, this technique can only detect and locate the position of MI without quantifying it.

Scar Segmentation in CINE MRI
To further improve scar segmentation on non-contrast enhanced CMR, trained by co-registered LGE and cine MRI modalities, SVM based texture analysis in pre-contrast cine MRI only can discriminate between nonviable, viable and remote segments (Larroza et al., 2018). Non-contrasted enhanced CMR scar segmentation has also been demonstrated via neighborhood approximation forests (Bleton et al., 2016), Simple Linear Iterative Clustering (SLIC) (Achanta et al., 2012) based supervoxels (Popescu et al., 2017).

Deep Learning Based Scar Segmentation in CINE MRI
With the development of deep learning, a method based on a combination of Long short-term memory (LSTM), recurrent neural network (RNN) and fully convoluted neural network (FCNN) (Xu et al., 2017) and a GAN based method (Xu et al., 2018) have been demonstrated accuracy in detecting, locating and quantifying LV scarring regions from non-contrast enhanced CMR images. Zhang et al. proposed a deep learning based framework to greatly improve the efficacy of the segmentation of LV scar on cine MRI (with its stages consisting of (1) ROI localization, (2) RNN based motion pattern extraction, and (3) pixel classification by FCNN) and assess their network extensively under a clinical setting . Xu et al. (2020) on top of the deep learning based workflow, proposed a progressive sequential causal generative adversarial network (GAN) to simultaneously synthesize LGE-equivalent images and multi-class tissue segmentation (including LV blood cavity, LV myocardium and scar region) from cine CMR images. A detailed summary and results of a private benchmarking of all these algorithms can be found in Table 8. (1) priori coarse tissue mask generation GAN, (2) condition LGE-equivalent image synthesis GAN, (3) fine segmentation GAN Segment more than just LV scar -LV blood pool, myocardium and scar regions; Further improve temporal-spatial learning by a two-stream structure that includes a spatial perceptual pathway, a temporal perceptual pathway, and a multi-attention weighing unit.

EVALUATION METRICS
A range of evaluation metrics can be employed for assessing the results of the segmentation of the anatomy. These include Dice score, sensitivity, specificity, Hausdorff distance (HD) and surface-to-surface distance (STSD).
(1) Dice Score The Dice Score coefficient, DICE, is one of the most widely used evaluation metrics in segmentation accuracy evaluations. It is particularly sensitive to the difference between the ground truth label and the result label.
Given a 3D prediction label tensor, A, and 3D ground truth label tensor, B, the Dice score can be defined as: (2) Sensitivity Sensitivity score, also known as True Positive Rate, can be adapted to reflect the success of the algorithm for segmenting the foreground (cardiac anatomy) as: where TP stands for true positive and FN stands for false negative.
(3) Specificity Sensitivity score, also known as True Negative Rate, reflects the success of the algorithm for segmenting the background as: where TN stands for true negative and FP stands for false positive.
(4) Hausdorff Distance Hausdorff distance, HD, is an important parameter in evaluating the geometrical characteristics which measures the maximum local distance between the surfaces of the predicted LA volume label tensor, A, and the ground truth label tensor, B, given by: where a and b are all pixels locations within A and B.
In practice, the HD is not generally recommended to use it directly since it has a great sensitivity to outliers, and as noises and outliers are quite common in medical image segmentation (Gerig et al., 2001;Zhang and Lu, 2004). However, Huttenlocher et al. (1993) proposed a way to handle outliers by defining the HD as the q th quantile of distance instead of the maximum to exclude the outliers.

(5) Surface-to-Surface Distance
Surface-to-surface distance, STSD, measures the average distance error between the surfaces of the predicted LA volume and the ground truth.
where n A and n B are the numbers of pixels in A and B, respectively. Variables p and p describe all point between A and B.
The maximum error distance acceptable in the LA wall segmentation should be 1-2 mm under the clinical setting considering the thin LA wall (Xiong et al., 2021).

(6) Error of the Anterior-Posterior Diameter of the Anatomical Structure
The anterior-posterior diameters of LA and LV are widely used as an essential clinical measure in clinical diagnosis and treatments.
The diameters can be estimated by finding the maximum Euclidean distance along the anterior-posterior axis of each CMR scan (Xiong et al., 2021).

(7) Error of Volume of the Anatomical Structure
The anatomical volumes of LA and LV are widely used as an essential clinical measure in clinical diagnosis and treatments.
The volume of the structure can be found as the sum of positively labeled voxels. Given the volume of the predicted anatomical structure, V A , and the volume of the ground truth, V A , the total volume error can be defined as:

) Scar Volume Percentage
In addition to the ones mentioned above, scar segmentation also employs a scar volume based metric in assessing the segmentation result, which is much more widely used as the quantification of scar is important for clinical use. They calculate the volumetric percentages of the scarring regions and compare them across the predicted and the ground truth labels.
The scar percentage is defined as the percentage of the volume of the scarring region, V scar , relative to the volume of the relevant anatomical wall, V wall (e.g., LA wall) (Tao et al., 2010).

Inter-Observer Variability in the Manual Annotation of Ground Truth Labels
For validation and benchmarking of different methods and training of deep learning based methods, accurate, consistent and reproducible acquisition of ground truth labels is essential. Validation by employing labels from a single clinician may not be ideal as these labels may exhibit bias and intra-observer variances when the same clinician is asked to repeat their labeling. Thus, it is recommended that we take observations from multiple clinicians and fuse them together.
However, we can see significant inter-observer variances, particularly for LA anatomical segmentation in LGE-MRI where the boundaries of the LA walls are very blurred. Kurzendorfer et al. (2017c) attempted to compensate for inter-observer variances by additional smoothing but ended up with slight improvement in Dice Scores (+0.04).
It is recommended that the data source reports the inter-and intra-observer variances by employing evaluation metrics such as the Dice Score coefficient. The currently widely used method of label fusing is obtaining a 70% consensus label among multiple annotations, which can be low in their consistency levels. The level of each observer's expertise (novice, medical student, trainee, junior clinician or senior clinician) must also be clearly noted, particularly when multiple observers are involved. It may be also recommended that the observers should all be experienced senior clinicians to maintain the high accuracy and low variance in the manual annotation.

Dataset Sources
Many methods use single-vendor single-center datasets to validate their methods, which may not demonstrate the ability to generalize the accurate segmentation methodology to centers with CMR machines of different settings and compositions.
There have been some trials assessing the performance of models based on multi-vendor and multi-center data (Engblom et al., 2016;Fahmy et al., 2020). However, evaluation based on multi-vendor and multi-center data with a more significant patient population should be introduced for a more comprehensive unbiased validation, comparison of performances of different methods and assessment for their scalability.

Quantitative Result Reporting
We would like to alert readers that nearly all studies summarized in this study used their own distinct private datasets when reporting results. Biased conclusion may be derived when directly comparing these quantitative metrics across studies. The authors would like to ask all readers to refer to the experimental settings in acquisitions of the datasets stated in their original articles when comparing quantitative results across different studies, instead of only looking at these numbers stated. We would also look forward to a public benchmarking of all these methods as a fairer review of their performances.
Also, the authors would like to ask readers to be cautious when directly comparing Dice Scores reported for the segmentation of the LV infarction than the LA necrosis'. As the LA is much smaller than the LV, an equivalent volume of discrepancy may trigger a more significant reduction in the LA necrosis' Dice Score ratios than the LV infarction's. Additionally, the LA necrosis tends to be more challenging to be accurately segmented than the LV infarction explained above.
In addition, the image quality, contrast, class imbalance and other factors of the image data can directly impact the result generated from it and thus the accuracy reported. In particular, the authors would advocate future literature to report (1) scar to blood pool contrast ratios (SC-BP) (Karim et al., 2014) to show the scar contrast, (2) signalto-noise ratio (SNR) to show the noise variation along with evaluation metrics in results, so the readers can have a better understanding of the experimental settings before interpreting all the metrics reported quantitatively. These two additional metrics are essential, particularly when it comes to LA scar segmentation, where the scar segmentation is more difficult and where higher SC-BP can give higher Dice Scores in the results generated (Karim et al., 2014).

Advantages -Computational Load and Explainability
Obviously, as conventional methods are less demanding on the composition of the computing device, they can be deployed for wider clinical uses more easily. This is an advantage when it comes to the scalability and generalizability of the product, where a standard computer is enough for its deployment.
Conventional methods are also more explainable than deep learning. The explainability also guarantees easier acceptance from the clinicians, as the product may appear more trustworthy and more reliable.

Fixed threshold conventional methods
Fixed threshold methods may not fit some LGE CMR images, as they are unlikely to handle variations well (Oakes et al., 2009).
Scars are highly variable in their morphology and their brightness distribution on LGE CMR. Some severe LV cardiac scar may appear bright in its surroundings and very dark in its center, as the center of the scar is so severely infarcted that very little GBCA carrying perfusion arrives there. N-SD and FWHM, which require the pixel intensity to be more than a certain threshold for that pixel to be recognized as a scar, may not label these dark centers as the scar. Additionally, due to the partial volume effect, fibrotic regions containing both intermingling bundles of fibrotic and viable myocytes will be darker than the complete necrosis region. The low intensity exhibited from such fibrotic regions may be below the fixed threshold set and make these fibrotic regions be falsely recognized as healthy myocardium.
Varied external factors including resolution, contrast, signalto-noise ratio (SNR), inversion time and surface coil intensity variation can also adversely affect the accuracy of the scar segmentation.
LGE CMR modality often suffers from poor image quality, which may be due to residual respiratory motion, variability in the heart rate and gadolinium wash-out during the currently long acquisition time (Yang et al., 2017). Considering the thin transmural thickness of the atrial wall [mean = 2.2-2.5 mm (Saìnchez-Quintana et al., 2005)] (Figure 2), the spatial resolution of LGE CMR images is relatively limited, particularly for the left atrium . The variable anatomical morphological shapes of pulmonary veins (PV) also impose an additional challenge to the LGE CMR segmentations. In addition, some uninterested cardiac substructures may be highlighted in LGE CMR images as well in addition to the scarring and fibrosis regions. These may be due to the navigator beam artifact (which is often seen near the right PV), Gadolinium uptake by the aortic wall and valves and confounded enhancement in the spine, esophagus, etc. (Karim et al., 2013;Yang et al., 2017).

Conventional adaptive methods
Although adaptive conventional methods may mitigate adverse impacts from variable scar shapes and varied external factors, adaptive conventional methods can also be affected by sizes, variances and artifacts in testing image data as they utilize prior information learned. Kurzendorfer et al. (2017c) showed that a particular scar distribution over the myocardium could adversely affect their methods in segmenting endocardial contours. Such vulnerability may be more problematic when it comes to LA anatomical structures, as PV is a very morphological variable and LA walls are much smaller and thinner.

How We Could Make the Deep Learning Perform Even Better?
For detailed designs of the deep learning networks, LASC'18 benchmarked (Xiong et al., 2021) a range of U-Net variants in LA wall segmentation from LGE CMR. This challenge, along with other literature for cardiac scar segmentation, demonstrated the following.
(1) Image Sources (a) Higher image qualities (as in signal-to-noise ratio) would result in a higher Dice Score, although not statistically significantly linearly related. (b) In addition, models with contrast normalization as a preprocessing technique performed significantly better than the ones without using normalization.
(2) Model Backbone (a) CNN based methods delivered better results compared to the atlas based methods. (b) U-Net based methods outperformed other networks using VGGNet, ResNet, etc. (c) There was no statistical difference between the segmentation performances of the models based on 2D CNNs and models based on 3D CNNs. However, further research showed that 3D CNNs greatly outperformed 2D CNNs with the same model architecture in terms of the Dice Scores of their segmentation results (Borra et al., 2020a).
(3) Segment on ROI or the Whole Image?
(a) Centring LA on ROI as an input to the second sequential model would make the model perform significantly better compared to the model with non-centered ROIs. (b) Class imbalance induced by significantly big or small ROI size could lead to an adverse effect on the segmentation results in terms of Dice scores. (c) Double sequential CNNs (Li et al., 2019;Xia et al., 2019;Yang et al., 2019;Xiong et al., 2021) (one detecting the region of interest first and then the second model performing regional segmentation within the region of interests (ROI) detected) achieved much better results compared to the methods with only one single CNN. (d) Double sequential 3D CNN outperformed single 2D CNN and single 3D CNN models regarding its Dice scores, surface distance, LA diameter error and LA volume error.
(4) Model Architecture (a) Models with residual connections performed significantly better compared to the ones without residual connections. (b) The use of dropout blocks did not perform significantly better than the one without using dropout. (c) Rectified Linear Unit (ReLU) trained models did not perform significantly better than the Parametric Rectified Linear Unit (PReLU) trained models.
(5) Loss Functions (a) Dice loss trained models performed significantly better than the cross-entropy trained models.

Computational load
Although we are able to observe much better results generated from deep learning based methods, we can also observe a rise in computational demand from deep learning networks. For deep learning based methods, high-end computer graphics processing units (GPUs) become a necessity when deploying these models, whereas standard computers with CPUs only are sufficient for most of the conventional methods to run. Under a clinical setting, hiring a GPU is not always possible, as it is not part of a standard clinical computing workstation. The requirement of a high-end computer with GPU in deploying a deep learning based method may significantly limit the ability of these methods to scale. However, if a standard computer was only used to infer a deep learning model, its runtime may be a bit long but still falls within the maximum time limit that clinicians can accept (usually a few minutes per slice for models that are not extra complex). Therefore, we can see these models can be deployed and scaled only if they are sophisticatedly trained, as training on the clinician's side, where unlikely they have a GPU, is not usually possible. As the inference time may vary significantly across different models over CPUs and depend on their architectures and complexities, reporting of inference time per slice on a standard computer without a GPU should also be mandatory in addition to the inference time over a GPU.

Scarcity of annotated data
Training datasets with abundant paired labels are essential to the success of deep learning model training. However, there has been a scarcity of labels due to the tedious process of manually annotating the ground truths in medical imaging. In order to mitigate such scarcity in ground truth labels, several methods can be adopted, including the following.
In addition, to mitigate the challenging training process brought by the great data size required to train a scalable network, active learning (Mahapatra et al., 2018) has been introduced to reduce manual annotation workloads as well as the computational loads.

Explainability in deep learning
Although there has been a wide range of evidence demonstrating the efficacy of deep learning in medical image analysis, the deep learning networks behave more like a 'black box, ' where its interpretability is poor. It has been shown that these deep learning networks can be attacked by adversarial noises or even just rotation in medical images (Finlayson et al., 2019), questioning the reliability and scalability of these deep learning models in assisting diagnosis. For alerting users of these possible failures, segmentation quality scores (Robinson et al., 2019) and confidence maps [e.g., uncertainty maps (Sander et al., 2019) and attention maps (Heo et al., 2018)] should be provided to highlight uncertainties in the model prediction.
7.4 Non-CA Modality Segmentation: Bye-Bye to Gadolinium? Although many methods can accurately segment scars on non-CA cine MRI, the impact from different numbers of cardiac phases on cine MRI has not been assessed.
In addition, the binary class of either normal or scar may be too simplistic. Quantification of the so-called "gray-zone, " which has been proposed for the clinical implication of ventricular arrhythmia (Jablonowski et al., 2017), immediately surrounding the ventricular scar may be useful clinically.
Also, gadolinium based contrast agent is not only applied for scar imaging but also for assessing myocardial perfusion, which is usually assessed together in LGE CMR, for which additional classification and differentiation of ischemic and remote regions of myocardium would be useful (Leiner, 2019). To achieve that, Liu et al. (2016) demonstrated non-Gadolinium contrast adenosine stress and rest T1 Mapping for identification and classification of normal, infarcted, ischemic and remote regions in LV myocardium.
We are glad to see a range of algorithms demonstrated for LV scar segmentation in non-contrast enhanced CMR. However, this has not been realized for CMR images of LA, which is more difficult as the LA scarring regions in CMR suffers from greater variances in morphology and relatively lower resolution of CMR. Moreover, LA scars can appear in discrete regions (Figure 2), which imposes further challenges to the LA scar segmentation from non-CA modalities.

CONCLUSION
This study summarizes the recent developments in cardiac scar segmentation, covering a wide range of conventional and deep learning techniques. In particular, we presented and discussed the usefulness of non-LGE modalities in cardiac anatomy and scar segmentation. We then further discussed the recent progress in segmenting the cardiac scarring region from non-contrast-enhanced images. We hope this review can provide a comprehensive understanding of the segmentation methodologies for cardiac scar and fibrosis and increase the awareness of common challenges in these fields that can call for future research and contributions.

AUTHOR CONTRIBUTIONS
YW and GY contributed to study design and writing -original draft preparation. ZT, BL, YW, and GY contributed to data collection. YW, ZT, BL, and GY contributed to data visualization. YW, ZT, BL, DF, and GY contributed to writing -review and editing. DF and GY contributed to supervision and funding acquisition. All authors have read and agreed to the published version of the manuscript.

FUNDING
This work was supported in part by the British Heart Foundation (Project Number: TG/18/5/34111, PG/16/78/32402), Heart Research UK (RG2584), the Hangzhou Economic and Technological Development Area Strategical Grant (Imperial Institute of Advanced Technology), the European Research Council Innovative Medicines Initiative on Development of Therapeutics and Diagnostics Combatting Coronavirus Infections Award "DRAGON: rapiD and secuRe AI imaging based diaGnosis, stratification, fOllow-up, and preparedness for coronavirus paNdemics" (H2020-JTI-IMI2 101005122), the AI for Health Imaging Award "CHAIMELEON: Accelerating the Lab to Market Transition of AI Tools for Cancer Management" (H2020-SC1-FA-DTS-2019-1 952172), and the UK Research and Innovation (MR/V023799/1).