Interrater variability of ML-based CT-FFR during TAVR-planning: influence of image quality and coronary artery calcifications

Objective To compare machine learning (ML)-based CT-derived fractional flow reserve (CT-FFR) in patients before transcatheter aortic valve replacement (TAVR) by observers with differing training and to assess influencing factors. Background Coronary computed tomography angiography (cCTA) can effectively exclude CAD, e.g. prior to TAVR, but remains limited by its specificity. CT-FFR may mitigate this limitation also in patients prior to TAVR. While a high reliability of CT-FFR is presumed, little is known about the reproducibility of ML-based CT-FFR. Methods Consecutive patients with obstructive CAD on cCTA were evaluated with ML-based CT-FFR by two observers. Categorization into hemodynamically significant CAD was compared against invasive coronary angiography. The influence of image quality and coronary artery calcium score (CAC) was examined. Results CT-FFR was successfully performed on 214/272 examinations by both observers. The median difference of CT-FFR between both observers was −0.05(−0.12-0.02) (p < 0.001). Differences showed an inverse correlation to the absolute CT-FFR values. Categorization into CAD was different in 37/214 examinations, resulting in net recategorization of Δ13 (13/214) examinations and a difference in accuracy of Δ6.1%. On patient level, correlation of absolute and categorized values was substantial (0.567 and 0.570, p < 0.001). Categorization into CAD showed no correlation to image quality or CAC (p > 0.13). Conclusion Differences between CT-FFR values increased in values below the cut-off, having little clinical impact. Categorization into CAD differed in several patients, but ultimately only had a moderate influence on diagnostic accuracy. This was independent of image quality or CAC.


Introduction
Patients evaluated to be treated with transcatheter aortic valve replacement (TAVR) are generally elderly and have a high prevalence of coronary artery disease (CAD) (1)(2)(3).CAD is recommended to be excluded and if needed to be treated before the procedure (3)(4)(5)(6)(7).Coronary computed tomography angiography (cCTA) is the first line diagnostic tool for the exclusion of CAD in other patient groups (7) and its high negative predictive value (NPV) is known to be preserved also in patient before TAVR (3,6,8).Thus, its use is increasingly recognized as part of the standard CT evaluation protocol for TAVR-planning (3)(4)(5)(6)(7)(8).However, cCTA remains limited by its low specificity, particularly in this patient group.CT-derived fractional flow reserve (CT-FFR) has been described as a promising tool to mitigate this limitation by non-invasively predicting hemodynamic relevance (9)(10)(11) also in patients prior to .
Machine learning (ML)-based CT-FFR is a computationally less demanding approach, which makes on-site computation of CT-FFR feasible on standard workstations and is known to correlate well with the more conventional computational fluid dynamics (CFD) approach (17).As opposed to the commercial off-site approaches, where the exact segmentation process is unknown, the user himself performs the segmentation in MLbased CT-FFR.While a high reliability of segmentation is presumed, the significance of the segmentation process and observer experience on the reliability of CT-FFR has not been well examined (18,19), with no systematic analysis as of now.
In this study, we systematically compared the ML-based CT-FFR measurements carried out by two observers with differing expertise, on segment, vessel and patient level in a large group of patients before TAVR.Furthermore, we analyzed the frequency of conflicting categorizations and the influence of image quality and coronary artery calcium score (CAC).

Study design and patient population
The patient population and study design have previously been reported on (8,13).In short, consecutive examinations with retrospectively ECG-gated CT for TAVR-planning over a period of 7 months were screened.Only patients having undergone invasive coronary angiography (ICA) within 3 months of CT were considered for the current analysis.Of the 388 patients, 272 had at least one coronary stenosis (≥50%) on cCTA being of interest for CT-FFR evaluation (Figure 1).

CT acquisition
The scan protocol has previously been described in detail (8).Briefly, a retrospectively ECG-gated helical CT of the heart was performed from caudal to cranial, immediately followed by a high-pitch helical CT in the opposite direction for depiction of the aorta and iliofemoral access route using a single bolus of 70 ml iodinated contrast medium.All patients were examined with the same scanner (Somatom Definition Flash; Siemens).No beta blockers or nitrates were given.The ECG-gated scan of the heart was used for computation of the ML-based CT-FFR.

cCTA and CT-FFR analysis
Coronary arteries were analyzed morphologically by segment according to the 18-segment model (20).When a stenosis of ≥50% diameter was identified on cCTA, CT-FFR values were obtained approximately 2 cm distal to the stenosis (21).The standard of reference was ICA with quantitative coronary analysis (QCA) with the same threshold and ≥70% for a secondary evaluation.
ML-based CT-FFR (cFFR version 3.2.0,Siemens; not commercially available) was performed by observer B on all examinations previously analyzed by observer A (13).The MLbased prototype used for this study has been described in detail before (13,17,22).The computationally less demanding process enabled on-site computation on a desktop workstation.
Per-segment interpretations were combined to form per-vessel and per-patient ratings, considering the respective worst segment (highest grade of stenosis; lowest CT-FFR value).CT-FFR values of ≤0.80 were considered as hemodynamically significant CAD (CAD f+ ) (23).
Both observers received the same instructions for segmentation and measurement of CT-FFR.Observer A had received several weeks of training in coronary artery imaging, including formal reading of cCTA and case discussions with correlation to ICA.Observer B only received comprehensive instructions on coronary artery segmentation and handling of the CT-FFR prototype at hand.Both measurements were taken within 18 months.The methods adopted for this study comply with the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) (24).

Statistical analysis
Continuous variables are presented as median and [interquartile range (IQR)].Differences between the two observers in CT-FFR values and evaluation times were assessed using the Wilcoxon signed-rank test.Interobserver agreement for CT-FFR values was evaluated using intra-class correlation (ICC) type ICC (1,3) according to the convention proposed by Shrout and Fleiss (25).For interpretation of ICC coefficients, we followed the guidelines given by Cicchetti, which identify values <0.5 as a poor, 0.5-0.75 as moderate, 0.75-0.9as a good, and >0.9 as an excellent correlation (26).Interobserver agreement with respect to categorization into hemodynamically significant CAD according to CT-FFR was assessed using Cohen's kappa and interpreted as proposed by Landis and Koch, which classifies correlation as follows: <0.2 slight, 0.2-0.4fair, 0.4-0.6 moderate, 0.6-0.8substantial, >0.8 almost perfect (27).Correlation between CT-FFR differences and covariates was calculated using Spearman's rank correlation (quantitative image quality measures and calcium burden) or Kendall's rank correlation (qualitative image quality).
Correlation between mismatched coronary artery disease categorization and covariates was determined using the pointbiserial correlation (quantitative image quality and CAC) or rankbiserial correlation (qualitative image quality).A p value of <0.05 was considered statistically significant.Statistical analysis was performed using R (version 4.1.2;R Foundation for Statistical Computing, Vienna, Austria).
Of the included patients 90 (42.1%) were female and the mean body mass index was 29.2 ± 5.5 kg/m 2 .).The LM had a much lower number of stenoses (n = 13) and thus could not be considered for analyses (Table 1).
Patients recategorized as false negative (FN) from true positive (TP) by either observer showed CT-FFR values closest to the cutoff of ≤0.80 (observer A: n = 17, 0.85 [0.83-0.87];observer B: n = 28, 0.84 [0.82-0.89]).The distribution of CT-FFR values of both observers is shown in Figure 2. Observer A measured more outliers, particularly in the RCA and LCX.Discrepancies of CT-FFR values between the observers were smaller for high CT-FFR values, and larger for low values (Figure 3).

Interobserver variability
Analysis on patient level showed fair-good agreement between both observers (ICC coefficient: 0.567; p < 0.001).On vessel level, correlation between both observers was fair-good in the RCA and LAD.Agreement of measured values in the LCX was fair (Table 1).
Categorization into hemodynamically significant CAD according to CT-FFR correlated between both observers.On patient level, interobserver agreement was moderate-substantial (Cohen's kappa: 0.570; p < 0.001).Correlation in the RCA was moderate-substantial, in the LAD fair, and in the LCX fair (Table 2).
Observer B recategorized 14/214 patients from negative to positive and 23 patients from positive to negative, with 13 recategorizations being incorrect in regard to the standard of reference, resulting in a difference of diagnostic accuracy of Δ−6.07%.Specificity and negative predictive value (NPV) on patient level were decreased by Δ−1.82% and Δ−12.84%, respectively.On patient and vessel level, more recategorizations occurred from positive to negative than vice versa (Table 3).Gross frequency of different categorizations on vessel level was LCX: 38.6% >LAD: 23.2% >RCA: 20.9%, resulting in a net difference of Δ−3.34% in accuracy.On vessel level, specificity was slightly higher for observer B (Δ+0.73%), while other test metrics were slightly lower (Table 3).There was no discernable trend towards a differing frequency in discrepant categorizations on segment level, e.g., in the distal segments.The gross rate of discrepant categorizations into CADf + or CAD f− on segment level is shown in Appendix 1.On segment level the difference in accuracy was Δ+1.46%.

Influence of image quality and coronary artery calcifications
Absolute CT-FFR values did not correlate with quantitative image quality (CNR, HU).CT-FFR values correlated weakly with qualitative image quality on patient level and in the RCA (patient: r = −0.116;RCA: r = −0.16;p < 0.03).CT-FFR values correlated weakly with CAC on patient level and in the LAD (patient: r = 0.18; LAD: r = 0.206; p < 0.009).
Categorization into CAD was independent of quantitative or qualitative image quality and of CAC (Table 4).

Discussion
Interobserver variability of ML-based CT-FFR has not been studied extensively on a large patient cohort with a high prevalence of CAD.This study on patients before TAVR was carried out by observers with differing levels of experience and rendered somewhat different results.This led to occasional differences in categorization of hemodynamically significant CAD and moderate changes in diagnostic performance.Recategorization of patients was independent of image quality or CAC.Absolute values of CT-FFR showed significant differences between the observers from 0.03 to 0.04 on vessel and 0.05 on patient level (Table 1).This led to occasional recategorizations into hemodynamically significant CAD when CT-FFR values fell close to the cut-off [grey zone 0.75-0.80(28)] and likely was the most relevant reason for differences in diagnostic performance between both observers.No observer was clearly superior to the other, with observer A having higher diagnostic accuracy on patient and vessel level and observer B performing slightly better on segment level.The difference in measured values between the observers was lower for high CT-FFR values and much larger for low values.There is no clear cut-off for all levels of observation, but differences are higher below the clinical cut-off CT-FFR ≤0.80 (Figure 3).A possible explanation for this observation could be that segmentation of larger vessel lumina is easier and thus more reproducible; while the opposite is true for small vessels.This is consistent with our observation of the smallest vessel, the LCX, having the weakest interobserver agreement.The higher discrepancy of small values is of little concern, as values far below the common cut-off (0.80 or 0.75) are of little to no significance for clinical decision-making (21, 29).Absolute CT-FFR values measured by the more experienced observer (observer A) had higher variance compared to observer B (Figure 2).A possible reason for this may be a more conservative segmentation of the contrasted lumen, while observer B may have tried to extrapolate the lumen in the presence of blooming artifacts at heavily calcified lesions in a more generic way (30).
Correlation of CT-FFR values in RCA and LAD and on patient level was moderate or borderline-good.Regardless, overall correlation in our patient cohort was lower than that reported in other patient groups.Ko et al. reported a median difference of 0.03 on patient level vs. 0.05 in this study.Studies with more experienced observers reported good-to-excellent interobserver agreement (18,19,31).Observers with less training showed higher discrepancies and moderate agreement (32,33).The studies consisted of much younger patient groups (60.0 ± 8.5 years; 64.6 ± 8.9 years; 61.8 ± 10.2 years; 62.7 ± 8.9 years vs. 78.9± 9.7 years) with fewer stenoses per patient [0.53; 1.47; 0.33; 1.22 (stenosed vessels) vs. 1.6 stenoses per patient] (18,19,32,34).Median evaluation time was vastly different (Ko et   the lower experience of the observers, the most likely cause for the weaker interobserver agreement in our study lies in the much different and more challenging patient group of patients prior TAVR with more frequent and perhaps higher grade and frequently calcified stenoses.Overall, more experienced observers had better agreement, while less experience only had moderate agreement between the observers (18,32,33).This is a direct result of differences in lumen segmentation by the user himself in ML-based CT-FFR.Resulting differences are thus no different from the reproducibility of other techniques e.g., of cCTA with good interobserver and intraobserver agreement between trained observers, and moderate agreement between untrained observers (35,36), or even ICA interpretation (37).Other CT-FFR solutions, namely the only commercially available and FDA approved CFD-based technique has not publish data concerning observer experience and reliability.Overall, observer experience seems to have a large influence on reliable CAD diagnosis.
Standardized training and certification may likely improve reliability of ML-based CT-FFR further.Interobserver agreement of categorization of patients into hemodynamically significant CAD was similar to the correlation of absolute values with moderate and sometimes moderate-togood correlation for the RCA, LAD and patient level.This may be reassuring as in clinical decision making most commonly a discreet cut-off is used.Despite the agreement between the observers not being optimal, it can be considered fair, taking into account the observer's differing experience (Table 2).Notably, the LCX had the lowest correlation of absolute values and lowest agreement between categorized values.Although there is no definitive answer for this observation, the LCX is generally the smallest vessel with relatively few and short segments and the second highest rate of motion during the cardiac cycle (38).This

Median difference of CT-FFR values in dependence of absolute values. Median difference of CT-FFR values at different cut-offs between both observers at patient (A) and vessel level (B-D).
The median difference of CT-FFR values is higher for low absolute values and lower for high values.Note, there is no discrete cut-off for all levels of observation, but differences are higher below the clinical cut-off CT-FFR ≤0.80.The dashed red lines correspond to the CT-FFR cut-off used to characterize hemodynamically significant CAD.The lines of the graph have been smoothed with a Gaussian filter to help avoid over-interpretation of small steps.CT-FFR, CT-derived fractional flow reserve; LAD, left anterior descending artery; LCX, left circumflex artery; RCA, right coronary artery.corresponds to the influence of CNR, the second to HU, the third to QIQ while the fourth p value column (and 95% CI) corresponds to the influence of calcium burden.
P values <0.05 were considered statistically significant.
may contribute to motion-and step-artefacts consequently decreasing diagnostic performance (39) and ultimately making segmentation the most challenging in this vessel.Observer B showed lower diagnostic performance on patient level, with especially lower sensitivity (Δ−10.58%).However, on vessel level, specificity of observer B was higher (Δ+0.73%)(Table 3).A possible explanation may be a different, more conservative segmentation approach for the less experienced observer B, e.g., when encountering artifacts.The much lower NPV on all levels of observation might be caused by observer B's lack of clinical experience, possibly leading to a generic extrapolation of the lumen in calcified lesions and failure to differentiate plaque from artifact and vice versa.Thus, more hemodynamically relevant stenoses were missed (higher falsenegative count).However, it must be kept in mind that the standard of reference in this study was anatomical (ICA with QCA).The hemodynamic significance of stenosis in ICA, especially in the context of aortic valve stenosis (AS) and subsequent left ventricular (LV) hypertrophy, is unclear.Because CT-FFR is derived from the vessel cross-section and dependent on specific vessel anatomy and LV mass, AS may influence this technique and generate different values than in patients without AS and none of the resulting adaptations.
The change in diagnostic performance and specificity (Δ−1.82%) on patient level is dependent on the standard of reference.The very conservative cut-off of ≥50% for QCA might not be optimal for clinical decision-making, as it likely includes many stenoses without hemodynamic relevance.Meanwhile, CT-FFR may have classified these as not hemodynamically relevant causing false-negative categorizations.A higher ICA cut-off (e.g., QCA ≥70%) would lead to fewer false-negative categorizations by CT-FFR.Changing the standard of reference to this more stringent cut-off would decrease sensitivity, potentially increase specificity and may decrease the differences in diagnostic performance between both observers (accuracy: Δ−6.07-Δ−1.40;Table 3).
Overall, observer A likely evaluated stenoses more strictly, which explains the higher sensitivity.More clinical experience is the most probable reason for the better performance on the clinically relevant levels of observation, namely patient and vessel level.Minute differences in segmentation of the lumen may lead to different categorization into CAD whenever values fall close to the cut-off.Notably, specificity remains very similar between the observers.This can be explained by many true-negative categorizations of values relatively clearly above the cut-off.The patients that are categorized as false-negative presented CT-FFR values closer to the grey zone (0.75-0.80) than other patients (Observer A: 0.85; Observer B: 0.84).These borderline cases are prone to recategorization between both observers.As many recategorizations are correct in regard to ICA and cancel each other out, their influence on diagnostic performance is much smaller than their number leads to believe.This is supported by the number of differing CAD categorizations being larger than the actual change in diagnostic performance (recategorizations: 37/214; accuracy: Δ−6.07).
Image quality and calcium burden may interfere with the correct assessment of coronary arteries.However, the categorization into hemodynamically relevant CAD with CT-FFR was independent of CAC and image quality, which is encouraging for the use of CT-FFR in the group of patients prior to TAVR.Small, likely not clinically relevant correlations of absolute CT-FFR values and image quality and CAC were noted.Considering the number of tests performed, these findings should not be overestimated.Our findings with little to no influence of CAC on CT-FFR are consistent with the literature, with only Tesche et al. finding a degrading effect CAC on CT-FFR with very high scores (18,19,30,39,40), even though also patients with much higher CAC were included in our study.The virtual lack of correlation of CT-FFR values to image quality suggests that once a certain threshold of image quality is reached, CT-FFR may be expected to be performed reliably.Even a new deep learning algorithm for the improvement of image quality was not able to increase diagnostic performance of CT-FFR further (30).
Patients prior to TAVR assessed with ML-based CT-FFR by two observers with differing experience were sometimes categorized differently into having hemodynamically relevant CAD or not.This was independent of image quality or CAC.This can easily be understood if values fall close to the cut-off and CT-FFR is only measured at a single point in a fixed distance distal to the stenoses (Figure 4).However, hemodynamical implications of luminal narrowing can manifest distal to that point of measurement (21).On the other hand, diffuse arteriosclerosis without a distinct stenosis may have a cumulative effect (41) additive to or independent of the stenosis measured.Instead of a single measurement with a fixed cut-off, a relative decrease of CT-FFR values along the coronary tree could perhaps prove more representative for the global hemodynamic situation (21, 41-44) of the coronary arterial vasculature.

Limitations
Several important limitations to our study must be noted.First, the standard of reference in this study was morphological, not functional with the conservative cut-off of ≥50% diameter on QCA.We explored how discrepancies with a more stringent cut-off would change.But ultimately, a functional standard of reference like invasive FFR would be desirable, particularly in the patient group before TAVR with hemodynamical changes, likely also in coronary artery physiology.Independently of the applied standard, the observed differences in CT-FFR values between the observers are real and likely to be similar in practical application and should be considered whenever performing CT-FFR for clinical decision making.Furthermore, patients before TAVR generally have severe AS with subsequent LV-hypertrophy.As ML-based CT-FFR also considers LV-mass in addition to vessel cross section and specific vessel anatomy for its computation, AS and underling secondary changes may influence computed CT-FFR values.Though different, clinical experience of both observer A and B was limited and may not reflect clinical practice at academic centers with dedicated experts performing such analysis, it can be assumed that expert observers are more consistent (18,19).However, so far no data is available about the segmentation process of the commercially available off-site solution.Furthermore, the limited experience of the observers likely amplified the differences in read values, perhaps even allowing for a better evaluation of the potential disturbing factors of image quality and CAC.

Conclusion
Measurement of ML-based CT-FFR in patients prior to TAVR by observers with different clinical experience lead to discrepancies in CT-FFR values and CAD categorization, with larger discrepancies in low values and smaller discrepancies in high values.This caused a moderate difference in diagnostic accuracy.Image quality and CAC appear not to influence categorization according to CT-FFR.It seems advisable for segmentation to be performed by expert observers, particularly when values around the "grey zone" are to be expected.

FIGURE 2
FIGURE 2 Distribution of absolute CT-FFR values.Box plots of CT-FFR values measured by two observers on patient (A) and vessel level (B-D).Observer B measured higher median CT-FFR values with a smaller interquartile range on patient and vessel level (p ≤ 0.003) (A-D).The difference between both observers were larger on patient level (A) compared to vessel level (patient: 0.05> RCA: 0.05; LAD: 0.04; CX: 0.04) (B-D).Observer A shows more outliers, especially in RCA and LCX (B,C).CT-FFR, CT-derived fractional flow reserve; LAD, left anterior descending artery; LCX, left circumflex artery; RCA, right coronary artery.

FIGURE 3
FIGURE 3 Correlation of CNR, HU, qualitative image quality and CAC with absolute CT-FFR differences between the two observers and mismatch of categorization into hemodynamically significant CAD.The first p value column (and 95% CI)

FIGURE 4
FIGURE 4 Patient with severe LAD stenosis and discrepant categorization according to CT-FFR values.Patient with severe stenosis (arrow in a-d) in the middle LAD (S7) on ICA (QCA: 78%) (A,B) and results of CT-FFR of observer A (C) and observer B (D) CT-FFR values were taken approximately 2 cm distal to the stenosis (asterisk in C,D).The CT-FFR value measured by observer A was 0.79, indicating hemodynamically significant CAD (C), the value measured by observer B was 0.86, indicating non-significant CAD (D) The threshold for hemodynamically significant CAD was ≤0.80.CAD, coronary artery disease; CT-FFR, CT-derived fractional flow reserve; ICA, invasive coronary angiography; LAD, left anterior descending artery; QCA, quantitative coronary analysis.

TABLE 1
Interobserver variability of absolute CT-FFR values.

TABLE 2
Interobserver agreement of categorization according to CT-FFR.

TABLE 3
Differences in categorization and changes in diagnostic performance.
(Re)categorization into hemodynamically significant CAD and differences in diagnostic performance of CT-FFR between the observer A and B. Thresholds for significant CAD were respectively.Acc, accuracy; LAD, left anterior descending artery; LCX, circumflex artery; LM, left main coronary artery; NN, remained negative; NP, negative-to-positive; NPV, negative predictive value; PN, positive-to-negative; PP, remained positive; PPV, positive predictive value; RCA, right coronary artery; Sen., sensitivity; Spe., specificity.

TABLE 4
Influence of image quality and coronary arterial calcifications.