CARDIAN: a novel computational approach for real-time end-diastolic frame detection in intravascular ultrasound using bidirectional attention networks

Introduction Changes in coronary artery luminal dimensions during the cardiac cycle can impact the accurate quantification of volumetric analyses in intravascular ultrasound (IVUS) image studies. Accurate ED-frame detection is pivotal for guiding interventional decisions, optimizing therapeutic interventions, and ensuring standardized volumetric analysis in research studies. Images acquired at different phases of the cardiac cycle may also lead to inaccurate quantification of atheroma volume due to the longitudinal motion of the catheter in relation to the vessel. As IVUS images are acquired throughout the cardiac cycle, end-diastolic frames are typically identified retrospectively by human analysts to minimize motion artefacts and enable more accurate and reproducible volumetric analysis. Methods In this paper, a novel neural network-based approach for accurate end-diastolic frame detection in IVUS sequences is proposed, trained using electrocardiogram (ECG) signals acquired synchronously during IVUS acquisition. The framework integrates dedicated motion encoders and a bidirectional attention recurrent network (BARNet) with a temporal difference encoder to extract frame-by-frame motion features corresponding to the phases of the cardiac cycle. In addition, a spatiotemporal rotation encoder is included to capture the IVUS catheter's rotational movement with respect to the coronary artery. Results With a prediction tolerance range of 66.7 ms, the proposed approach was able to find 71.9%, 67.8%, and 69.9% of end-diastolic frames in the left anterior descending, left circumflex and right coronary arteries, respectively, when tested against ECG estimations. When the result was compared with two expert analysts’ estimation, the approach achieved a superior performance. Discussion These findings indicate that the developed methodology is accurate and fully reproducible and therefore it should be preferred over experts for end-diastolic frame detection in IVUS sequences.

Introduction: Changes in coronary artery luminal dimensions during the cardiac cycle can impact the accurate quantification of volumetric analyses in intravascular ultrasound (IVUS) image studies.Accurate ED-frame detection is pivotal for guiding interventional decisions, optimizing therapeutic interventions, and ensuring standardized volumetric analysis in research studies.Images acquired at different phases of the cardiac cycle may also lead to inaccurate quantification of atheroma volume due to the longitudinal motion of the catheter in relation to the vessel.As IVUS images are acquired throughout the cardiac cycle, end-diastolic frames are typically identified retrospectively by human analysts to minimize motion artefacts and enable more accurate and reproducible volumetric analysis.Methods: In this paper, a novel neural network-based approach for accurate enddiastolic frame detection in IVUS sequences is proposed, trained using electrocardiogram (ECG) signals acquired synchronously during IVUS acquisition.The framework integrates dedicated motion encoders and a bidirectional attention recurrent network (BARNet) with a temporal difference encoder to extract frame-by-frame motion features corresponding to the phases of the cardiac cycle.In addition, a spatiotemporal rotation encoder is included to capture the IVUS catheter's rotational movement with respect to the coronary artery.
Results: With a prediction tolerance range of 66.7 ms, the proposed approach was able to find 71.9%,67.8%, and 69.9% of end-diastolic frames in the left anterior descending, left circumflex and right coronary arteries, respectively, when tested against ECG estimations.When the result was compared with two expert analysts' estimation, the approach achieved a superior performance.

Introduction
Intravascular ultrasound (IVUS) is the preferred modality to accurately assess lumen dimensions and coronary atheroma burden in clinical practice and in research studies, playing a pivotal role in diagnosing, treating, and monitoring coronary artery disease (CAD).In contemporary practice, IVUS image acquisition is performed using an automated pull-back device that withdraws the catheter at a constant speed without gating.However, the dynamic changes in luminal dimensions during the cardiac cycle can introduce significant variability, affecting the accuracy of volumetric analysis (1).Moreover, the IVUS catheter's movement in relation to the vessel during the cardiac cycle introduces additional errors in the quantification of atheroma volume (2).Recent reports have highlighted the superiority of IVUS volumetric analysis performed in end-diastolic (ED) frames, where cardiac motion is minimized, in providing more consistent and reproducible assessments of atheroma volume.Yet, the accurate detection of these ED-frames remains a challenge due to the intricate motion of the epicardial coronary arteries and the simultaneous motion of the IVUS catheter (3).Compounding this challenge are factors like noise, artifacts, and the complex imaging environment, which further hinder the correct identification of ED frames (4), as exemplified in Figure 1.Notably, even trained experts, despite their extensive experience, often struggle to consistently identify the ED-frames.Given these challenges, there's a pressing need for a fully automated, accurate, and reproducible method for ED-frame detection, which holds the promise of revolutionizing CAD management and treatment outcomes.
Neural networks have recently been proposed for the analysis of sequential data series.The long short-term memory (LSTM)-based method has been developed for processing sequential musical audio data (5) and the detection of deception from gaze and speech (6).Recently, transformer-based methodologies were adapted for temporal information processing in natural language (7), audio (8), image (9), and video processing (10).However, these approaches require training on analyzed datasets, meaning they have limited generalizability to the ED-frame detection problem for which accurate manual labelling is unavailable.
Key frame detection in computational image analysis has been attempted with various approaches including neural networks (11,12), clustering algorithms (13) and bidirectional LSTM (14,15,22).Moreover, video action recognition with skeleton-based and video-based methods has been also used for this purpose (16,17).However, these methods contain complex encoder structures, and thus need to be trained on even larger datasets.This constraint prevents them from being readily appliable to IVUS sequences which have challenging image qualities and motion patterns.In addition, the amount of time it takes for the above approaches to process an IVUS sequence is prohibitive, making them unsuitable for clinical applications of automated IVUS analysis.
Human expert analysts tend to capture sudden changes in motion patterns when identifying ED-frames-such as reverse rotation of blood vessels and sudden start or stop of the vesselwith the presumption that the period before the largest movement of the vessel corresponds to end-diastole.Existing computational approaches for IVUS gating are based on similar assumptions and can be broadly divided into two categories: feature extraction and supervised methods.Feature extracting methods extract motion signals from IVUS pullback sequences, and gate them by identifying local extrema in the entire sequence.Since automatic ED-frame detection requires extracting key features from relative vessel motions, the main innovation has previously been to focus on exploiting motion features from shallow-learned feature representation.Several methodologies have been introduced over the recent years for IVUS ED-frame detection that relies on feature extraction including local mean intensity-based (18-20), cross-correlation based (19,21,22), longitudinal displacement based (21,(23)(24)(25), clustering-based (26), filter-based (3), and wavelet transform-based algorithms (27).The supervised EDframe detection methods can be further divided into two groups: electrocardiogram (ECG)-guided methods and expert annotationguided methods.Most current ED-frame detection methods based on ultrasound images are guided by expert annotations, meaning they use expert annotations as the gold standard (3,27,28).In contrast, many traditional shallow learning-based algorithms are available to solve ED-frame detection in IVUS with the support of ECG gating, such as Darvishi et (18).These ECG-guided methods use simultaneously captured ECG signal to train the machine learning models (31,32).This paper focuses on a Deep Learning-based IVUS gating approach, which distinguishes itself by employing deep learning techniques for gating, departing from the reliance on image features and signal processing for identifying key frames.
In the realm of IVUS gating, traditional methods such as ECGbased gating have been limited by synchronization challenges and susceptibility to arrhythmias.Image-based gating, although simpler, often compromises on accuracy due to the inherent complexities of images.Deep learning-based gating, as exemplified by our prior work (33), employed recurrent neural networks (RNNs) and served as a foundational step in liberating IVUS gating from ECG synchronization, thereby enhancing resilience to noise and improving accuracy.However, the current study introduces CARDIAN, a more advanced computational framework for realtime ED-frame detection in IVUS.Unlike the previous work that primarily utilized a bidirectional gated-recurrent-unit (Bi-GRU), CARDIAN incorporates a more complex BARNet to exploit both forward and backward motion features in IVUS sequences.It also employs meticulously designed high-performance encoders-Temporal Difference and Spatiotemporal Rotation-for robust feature extraction.The framework is further enriched by a duallayer Bidirectional Long Short-Term Memory (Bi-LSTM) structure with attention mechanisms, allowing for the processing of longer input sequences and offering more accurate post-processing.Rigorous training and testing protocols, including leave-one-out and three-fold cross-validation methods, are outlined.Additionally, novel strategies for unit acquisition and data augmentation have been introduced to adapt the model to various vessel wall motions and other artifacts.Developed in partnership with industry (InfraReDx, Inc., Burlington, Massachusetts), CARDIAN has the potential to be incorporated into commercially available systems for real-time processing of near-infrared spectroscopy-IVUS images.This multi-faceted approach significantly extends the scope, robustness, and versatility of our previous work, aiming to set a new standard in the accuracy, efficiency, and reliability of IVUS EDframe detection.To substantiate the efficacy of CARDIAN, we have conducted rigorous internal validation using three-fold crossvalidation methods.Furthermore, we have benchmarked our approach against Image-based gating methods, which are widely employed in commercial IVUS analysis software, thereby providing a comparative perspective on its performance.
The main contributions are summarized as follows: - In this study, we provide a detailed description of the proposed method in Chapter 2. The performance of the proposed method is evaluated and analysed in practical scenarios, and compared against the stateof-the-art methods in Chapter 3. The discussion in Chapter 4 provides further insights into the effectiveness of the proposed method, and its potential for clinical applications is analysed.

Materials and methods
The overall architecture of the proposed CARDIAN methodology is illustrated in Figure 2. In the following, every component in this architecture is introduced in detail.

Data acquisition
In the data acquisition process, this study encompassed six participants diagnosed with obstructive coronary artery disease who were undergoing coronary angiography and percutaneous coronary intervention.These individuals were enlisted in the "Evaluation of the effectiveness of computed tomographic coronary angiography (CTCA) in the evaluation of coronary artery morphology and physiology" investigation (NCT03556644), forming the basis of the current analysis.The core procedure involved subjecting all patients to near-infrared spectroscopy intravascular ultrasound (NIRS-IVUS) imaging of their coronary arteries and significant lateral branches.This imaging was conducted using the innovative Dualpro NIRS-IVUS system, developed by Infraredx, located in Burlington, MA.
To orchestrate the process meticulously, the NIRS-IVUS probe was systematically retracted at a uniform pace of 0.5 mm/s, facilitated by an automated pull-back apparatus.Concurrently, data acquisition occurred at a rate of 30 frames per second (fps).While this transpired, an electrocardiogram (ECG) trace was concurrently recorded alongside the IVUS sequence.Notably, the frame rate for this ECG data was elevated to 120 fps.A visual representation of this synchronization and coordination can be observed in Figure 1, where the IVUS sequence and ECG trace converged, allowing for seamless co-registration and precise identification of the IVUS frame that corresponded to the zenith of the R-wave, an event termed the ED-frame.
The 50 MHz Dualpro system developed by Infraredx in Burlington, Massachusetts was employed for NIRS-IVUS imaging.Localization of the NIRS-IVUS probe was achieved through the introduction of a contrast agent, which was succeeded by the acquisition of an angiographic projection following the administration of 400mcg of nitrates.This preparatory phase facilitated the subsequent advancement of the NIRS-IVUS probe towards the distal section of the vessel.

FIGURE 2
The overall architecture of the proposed CARDIAN methodology, depicting the consecutive stages from data processing to ED-frame detection.The process commences with using a 140-frame acquisition window to extract relevant segments, followed by identifying ED-frames using ECG signals.Next, feature signal extraction is performed to gather relevant motion information.Data augmentation is then applied to improve model adaptability and accuracy.Then, BARNet is employed to predict the likelihood of a frame being an ED-frame in each unit.Finally, the ED-frame search module arranges the units sequentially and further explores the accurate positions of ED-frames based on a generated probability graph.During the pullback procedure, precision was maintained through the use of an automated pullback device, ensuring a consistent velocity of 0.5 mm/s.This pullback action was concomitantly accompanied by the capture of an ECG trace.The NIRS-IVUS pullback process was meticulously synchronized with the ECG data.A frame rate of 30 fps was allocated to the automated pullback mechanism, while a camera equipped with a heightened frame rate of 120 fps was deployed to record the amalgamated display, showcasing both the NIRS-IVUS data and the ECG trace.
A manual inspection was then administered by experts in the field.They engaged in a comprehensive review of the amalgamated video footage, which harmonized NIRS-IVUS data and ECG signals.This meticulous analysis involved the manual annotation of all end-diastolic frames within the NIRS-IVUS recording, guided by the cues provided by the ECG signals.The identification of the IVUS frame corresponds to the peak of the R-wave within the ECG signal, which was aptly documented as the end-diastolic frame and gold stranded of the research.For the purpose of rigorous validation, the dataset was carefully partitioned into three folds, adhering to a patient-based stratification approach.This ensured that pullbacks from the same patient were not present in both the training and test sets within each fold.Additionally, efforts were made to balance the number of ED-frames across these folds to maintain a consistent level of challenge for the model during the cross-validation process.

Frame denoising and motion encoding
The conventional encoders in CNN mainly focus on pixellevel short-term relationships (34,35).We observed that the most relevant information for identifying ED-frames in IVUS pullbacks is the relative motion of the coronary arteries with regards to the IVUS probe.Thus, we aim to design encoders that can extract dynamic change data across frames as descriptive features.The IVUS images contain significant noise that can interfere with the encoding of key information.It is essential to smooth this noise and reduce its obfuscating effects, allowing the model to focus on the periodic motion features induced by the cardiac cycle rather than fluctuations from noise, and consequently improve extraction of clinically relevant features from the IVUS imagery.To achieve this, we have implemented a guided image filter for smoothing out perturbations.The filter operates according to the equation G(x) ¼ a Á I(x) þ b, where G(x) is the output, I(x) is the input image, and a and b are two constants.This denoising technique effectively mitigates the impact of noise, thereby allowing the model to concentrate on the cyclical motion characteristics intrinsic to cardiac activity.
After noise filtering, we developed two specialized encoders: the Temporal Difference Encoder and the Spatiotemporal Rotation Encoder, to extract motion feature caused by cardiac motion.The details of which will be elaborated upon subsequently.The encoders not only capture the dynamic changes across frames but also enhance the model's resilience to noise and other confounding factors.

Temporal difference encoder
The IVUS sequences are projected onto a one-dimensional feature signal where each value represents how much difference there is between every two adjacent IVUS frames.The change data between every two frames are calculated by the sum of absolute pixel intensity differences: where e n is the encoded motion feature between two consecutive frames f n and f nþ1 with frame resolution H Â W. P i,j n is a pixel's intensity in the frame f n , where i and j representsthe pixel coordinates.Some examples of the temporal difference motion features are shown in Figure 3.It can be observed that most motion peaks are strongly correlated to the R peaks in ECG, but not always, particularly in vessels with excessive motion like the right coronary artery (RCA).
The temporal difference encoding is simple yet effective in reducing the quality demand for the input sequence, allowing the method to work on pullbacks captured by different catheters with different frame rates.

Spatiotemporal rotation encoder
The rotational motion of the vessel during the cardiac cycle is key information for ED-frame feature extraction.Therefore, a spatiotemporal rotation encoder is designed to extract the vessel's rotation features.First, a rotation angle extractor (RAE) is developed to align two consecutive frames.Figure 4B illustrates the rotation of the vessel around the catheter center o between two adjacent IVUS frames.Assuming a rotation angle u the pixel d is rotated to the position q in the next frame.To estimate the actual rotation angle u, we first determine the angle range based on the prior information of the acquisition device.Then we obtain a set of angles u x by averagely sampling all angles in this range, where x is the angle index.We denote the rotation operation around the center of the catheter as R. The estimated angle u Ã is calculated by minimizing the difference between adjacent rotated frames by Equation 2 and Equation 3. In Figure 4A, the catheter in f nþ1 is rotated by u Ã degrees clockwise to align the two adjacent frames.After that, the pixel d in the current frame, and q0 in the next frame are on the same line.We apply the spatiotemporal rotation encoding frame by frame in each IVUS sequence.
The artery motion captured by spatiotemporal rotation encoding is calculated for each two adjacent and aligned IVUS frames:

Training the CARDIAN
In the training stage, first, the input IVUS frames are de-noised to minimize the influence of imaging artifacts.Then, the frames are encoded by the two lightweight encoders, the temporal difference encoder and the spatiotemporal rotation encoder, to generate a descriptive representation of the IVUS sequence in the temporal domain.The representation is then reorganized into small units which are used to train the BARNet model together with the reference ECG-derived ED-frame as the gold standard.Through the training, BARNet learns to predict the likelihood of each frame being an ED-frame.

Testing the CARDIAN
The trained model is then tested using the leave-one-out crossvalidation approach-all sequences are used for training apart from the sequence of one vessel.This is done for each vessel type, namely, left anterior descending (LAD), left circumflex (LCx) and right coronary artery (RCA).This process is repeated leaving a different vessel out from the training set each time until all vessels are used for testing.
A three-fold cross-validation method was applied to the matched ECG-IVUS data to compare the performance of experts and automated methodologies.In each fold, the dataset was evenly distributed among the three types of coronary arteries: RCA, LAD, and LCx.To tackle the data imbalance problem, the underrepresented vessel type frames were multiplied to ensure an equivalent representation for each type.To mitigate the risk of data leakage, each pull-back was strictly allocated to either the training or testing set within a given fold, ensuring no overlap between the two sets.Specifically, in each fold, approximately 67% of the pull-backs were designated for training, while the remaining 33% were exclusively used for testing.The performance metrics were then averaged across all folds to provide a robust estimate of the overall efficacy of the proposed method.We then averaged the results to estimate the overall performance of the proposed method.

Unit acquisition
The distributions of cardiac cycle durations in three types of arteries are shown in Figure 6.Based on our dataset, the average cardiac cycle is about 900 ms or 27 frames.To prepare suitable inputs to the detection model, a sliding window of 140-frame length is applied to the encoded motion feature of every pullback with a step size of 1.The 140-frame window roughly covers the motion of five cardiac cycles.This means that for each IVUS sequence with K, a total number of frames K À 139 encoded motion segments {E 1 , E 2 , : : :, E kÀ139 } are acquired, where En ¼ {e 1 n , e 2 n , : : :, e 140 n }, n ¼ 1, 2, . . ., K À 139.Each corresponding ECG signal goes through the same process to obtain matching ground truth units {Y 1 , Y 2 , : : :, Y kÀ139 }.
After that, pairs of encoded IVUS feature units and ECG signal units covering a 140-frame length are prepared as the training input.
The units of 140 frames, which roughly cover five cardiac cycles can provide a wider view of the network and reduce the chance of confusing T peaks with R peaks.An ED-frame lies around the middle between every two cardiac cycles.This unit setting allows our model to determine the locations of ED-frames based on the temporal motion information over a longer term.In addition, among the multiple ED-frames, each one can use the others as references in the prediction process.

Augmentation
Data augmentation serves as a critical step in our pipeline, performed prior to feature extraction.The primary objective is The distribution of the cardiac cycle period in the 3 coronary arteries.The average heart rate in the dataset is 67 heartbeats per minute-corresponding to a 27 frames interval per heartbeat.(A) LAD, (B) LCx, and (C) RCA. to equip the model with the ability to generalize across a wider range of vessel wall motions, cardiac cycle amplitudes, and other IVUS-specific artifacts such as plaque morphology, ventricular or atrial ectopics, and random noise.This strategy aims to mitigate the impact of these variables on detection accuracy.
Our augmentation techniques include random interpolation, frame elimination, and the addition of Gaussian noise.Specifically, for each 140-frame unit, we randomly remove 1-5 frames and replace them with new synthetic frames, the values of which are computed as the average of adjacent frames.This stochastic alteration of the IVUS sequence effectively modulates the cardiac cycle period, thereby training the model to adapt to varying heart rates.Furthermore, we introduce a random scaling factor between 0.8 and 1.2 to each frame's pixel values and add a 10% Gaussian noise to the feature signal.These steps are designed to make the model resilient against IVUS artifacts and improve its ability to discern genuine vessel and lumen characteristics.
While generative models offer the potential for creating synthetic inputs, they often require a large volume of training data to produce reliable and highly resembling outputs.Given the specialized and complex nature of IVUS imaging, and the challenges associated with data collection, a poorly trained generative model could introduce more noise and confounding variables, thereby potentially degrading the model's performance.Therefore, we opted for targeted, clinically explainable augmentation techniques that are specifically tailored to address the unique challenges of IVUS imaging.By employing these augmentation techniques, we aim to create a more versatile and robust training set, thereby enhancing the model's performance and generalizability.

Bidirectional attention recurrent network
RNN (36) and attention (37) have become milestone techniques for text classification and speech recognition tasks.Recently, the transformer has become one of the most popular state-of-the-art attention branches (38).Inspired by its robust performance, we design a bidirectional attention recurrent network (BARNet) to detect ED-frames on the encoded IVUS features, as depicted in Figure 7.We apply two bidirectional gated recurrent units (GRUs) (39) or LSTM (40) as the first two layers of BARNet.The reason for considering a bidirectional RNN (41) is that the relevant motion features involve adjacent frames both before and after the target position.The output features of the bidirectional RNN pass through an attention layer to further learn the long-term dependency inside each unit, as shown in the BARNet block in Figure 2.For a long sequence with 140 cells, some intermediate state information will inevitably be lost in the middle cells.Compared with bidirectional LSTM and GRU, the attention layer in BARNet has a higher reception field for better learning the long-term dependencies.This capability has obvious advantages in ED-frame detection, by giving higher weights to frames with larger motion amplitude and focusing on these key frames for generating predictions.
We input the encoded feature units E n into the two-layer bidirectional RNN and denote the output of the layer as F n [ R 140Â1 .Then the attention weight A n [ R 140Â1 of the feature A schematic illustration of the proposed bidirectional attention recurrent network (BARNet) for ED-frame detection.The network consists of two bidirectional RNN layers (GRU or LSTM), followed by an attention layer to learn long-term dependencies, and finally generating the predicted EDframe likelihoods.The encoded IVUS features are input into the network, and the output is the probability of each frame being an ED-frame.5: where W t F n,t is the influence of the n th unit on the t th feature element of the target.To bring in more non-linear information and increase the margin between ED-frames and non-EDframes, we obtain the local attention B n using a tanh function and an exponential exp on the normalized A n .The exponential function is used to alleviate the gradient vanishing problem of the tanh operation.The predicted ED-frame likelihood Y 0 n [ R 140Â1 of each element n is generated by the element-wise multiplication between the local attention B n and the encoding F n , as shown in Equation 6: The ground truth on ED-frames is denoted as . y n ¼ 1 when the n th frame is an ED-frame.A BARNet model is then trained using encoded motion segments E n and their corresponding ground truth ED-frames Y n .The ED-frame likelihood Y 0 n of the motion segment E n is predicted by minimizing the mean squared error (MSE) loss L MSE in every training epoch (42).

ED-frame search module
In the proposed method, the spatiotemporal rotation feature and temporal difference feature are used as inputs to generate more robust prediction results.Inspired by the argumentation methods in image classification and segmentation tasks, we proposed a test-time embedding augmentation (TTEA) scheme for IVUS ED-frame detection.In the prediction stage, for a unit of 140 frames {E 1 , E 2 , : : :, E kÀ139 }, each frame is randomly multiplied by a number from 0.8 to 1.2 and added with a 10% Gaussian noise.The model will generate a prediction on this version of the unit.The augmentation and prediction processes are repeated 50 times, and the output of 50 probability graphs is averaged to obtain the final probability graph, indicating the likelihood of each of the 140 frames being an ED-frame {y 0 1 , y 0 2 , . . ., y 0 KÀ139 }.The mean likelihood value v for each frame is considered as the final likelihood of the corresponding frame being an ED-frame.Since the duration of an average cardiac cycle is equivalent to 27 frames, a Hanning smoothing window of size 13 is performed on the final probability graph v.To avoid identifying more than one EDframe in a cardiac cycle, a 13-frame sliding window will go through the smoothed probability graph, and the local maxima on the 7 th frame of a sliding window will be finally identified as an ED-frame.

Performance evaluation
The performance of the CARDIAN was compared with the visual screening results of two expert analysts from an intravascular imaging core-lab.They reviewed the IVUS pullbacks and identified the end-diastolic frames as the frame with the minimum vessel motion before a sudden motion of the vessel in relation to the catheter.Furthermore, we compared the performance of CARDIAN with an automated ED-frame detection methodology for retrospective gating of IVUS images.This automatic method relies on detecting neighboring frames where the lumen motion is minimal (LM-method) (43).This methodology has been incorporated in a user-friendly software, the QCU-CMS IVUS image analysis software (Leiden University Medical Center, Leiden, The Netherlands), and has been extensively used in the past to identify the ED frames in clinical research.

Statistical analysis
For quantitative evaluation in this study, numerical variables are presented as mean ± standard deviation (SD), and categorical variables as absolute values and percentages.The chi-squared test was used to compare categorical variables.Bland-Altman analysis was employed to compare the estimations of expert analysts, the conventional image-based approaches (LM) (43) and the proposed method CARDIAN.
To effectively demonstrate the performance in ED-frame detection, in this paper, we define a few new metrics, namely, group-of-pictures (GoP) recall, GoP precision and GoP F1 score.In calculating these recall/precision values, each detection is considered a hit if the predicted ED-frame is within a tolerance range of ±3 frames of the target frame, that is, ±66.7 milliseconds (ms) in time.This approach follows the evaluation paradigm used in evaluating human labelling of ED-frames in previous studies (33), but it uses a tighter range of ±number of frames or time.
The GoP recall is defined mathematically as: In this study, the prediction tolerance (PT) is set to be a range of ±66.67 ms or 2 frames from the peak of the R-wave on the ECG, but it can be adapted to other values as appropriate.GoP recall represents the percentage of correctly detected frames (P i ) within the range of PT from their closest ED frame by ECG (R i ).The total number of ECG-derived EDframe is denoted as 'n'.The summation runs from i ¼ 1 to n, the indicator function I(condition) is calculated at each position, which equals 1 if the absolute distance between the predicted frame (P i ) and its corresponding closest ED frame in ECG (R i ) is smaller than the specified range PT; otherwise, it equals 0.
Similarly, group of Picture (GoP) precision is defined mathematically as: Here, 'm' denotes the total number of the detected frames.The summation runs from i ¼ 1 to m, and each indicator I(condition) is calculated, which equals 1 if the absolute distance between each frame classified as ED (P i ) by the experts or the tested methodology and its corresponding closest ECG-derived ED frame (R j ), is smaller than the specified range PT; otherwise, it equals 0. GoP precision represents the percentage of the correctly detected ED frames (P i) out of the total number of the detected frames by an expert or an algorithm.
GoP F1 score is a measure that combines both GoP recall and GoP precision into a single value, providing a balanced representation of the method's performance.It can be calculated using the following formula: The F1 score ranges from 0 to 100%, with higher values indicating a better performance in terms of high detection rate as well as minimized errors.Additionally, nearest prediction interval (NPI) was employed to measure the average time interval between every detected EDframe and its closest ECG-derived ED-frame.Given the total number of predictions (n) and the distance between each detected ED frame (P i ) and its closest real frame (R i ), NPI can be computed as: These metrics together offer a multi-perspective insight into the effectiveness of the automated and manual methods for EDframe detection.

Results
This study involved patients with an average age of 61.7 ± 10.3 years and 83.3% of them were male.None was a smoker but most of them had a positive family history of CAD (66.7%), hypertension (66.7%) and hypercholesterolemia (66.7%).Five patients (83.3%) had normal and one had impaired left ventricular function.The studied vessels (n = 20) included 9 LCx, 6 LADs and 5 RCAs.Out of the 92,526 frames acquired from these vessels; after excluding cases of non-interpretable IVUS images and ECG tracings because of artifacts, 3,271 were classified as ED by the ECG.The average heart rate was 66 beats per minute.
After adding segments in each vessel type with fewer ED frames to obtain a more balanced dataset-as described in the methodology section-a total of 3,556 ED-frames were included in the analysis, of which 1,269 ED-frames were located in the LAD, 1,133 in the LCx, and 1,154 in the RCA.

Ablation study
An ablation study was performed to determine the effect of each proposed module, as reported in Table 1.The proposed training augmentation mechanism significantly improved the efficacy of the method to detect the correct ED frames.Since RNN-like structures are sensitive to the cardiac cycles, by randomly changing each cardiac cycle length, networks can better adapt to patients' data with different cardiac cycles.Further, the noise and random disturbance added into training set reduced the influence of challenging areas like frames with large plaques, artifacts, side-branches, or noise from the catheter, helping the model to capture critical features from the input signal.The proposed TTEA module also marginally increased the performance of all experiments.
We also found that Bi-LSTM models outperforms Bi-GRU.Further, the attention layer provides an 1%−3% increase in both models.Since the Bi-LSTM or Bi-GRU in the experiment both have a length of 140 cells, a primary issue for such long RNN structures is that the information shared by distancing cells is Most of the ED-frames can be predicted correctly by our endto-end CARDIAN approach.Figure 8 illustrates the detection results on four IVUS sequences based on the best configuration.In Figure 8A-C, with clear cardiac motion patterns, the proposed framework detects all the ED frames with a high accuracy.In an IVUS pullback with irregular cardiac movement and extensive disease (Figure 8D), the performance of the model is impaired, but most of the ED frames can still be roughly located.
The detection errors are often due to the variance in motion patterns, the accuracy of the network declines when the IVUS frames have an irregular motion or barely move.This usually occurs in sequences portraying coronary heart diseases as shown in Figure 8D.
Tables 3A,B presents the overall performance of two analysts, the LM methodology, and the proposed CARDIAN method regarding GoP recall, precision, and F1 score.CARDIAN outperforms the other methods, with a significant

Prediction interval evaluation
The nearest prediction interval measurements for analysts 1 and 2, the LM and the CARDIAN methods across the three coronary arteries are shown in Table 4.It is apparent that the largest prediction interval values for Expert 1 and 2 are noted in the RCA where the vessel motion is larger, while for the LM method, the largest interval is noted in LCx.Conversely, the smallest nearest prediction interval values for both experts and the LM method are noted in the LAD.Compared to these results, the CARDIAN method demonstrated a significantly smaller nearest prediction interval and minimum variations across the three coronary arteries (Figure 10).For LCx, LAD, and RCA, the median of nearest prediction interval between the predicted ED-frame and the ground truth is 33.3, 33.3, and 66.6 ms, respectively.

Discussion
This paper introduces a novel ED-frame detection approach, CARDIAN, that uses ECG-estimations as the gold standard for training and testing purposes.This approach takes advantage of specific features seen in IVUS sequences and the synchronous ECG tracings to accurately detect ED-frames, achieving superior performance compared to human experts and conventional image-based approaches (LM).
Over the last years, several computational approaches have been introduced for IVUS gating based on feature extraction and supervised methods, which assume that ED-frames are highly correlated to sudden changes in motion patterns.However, we have previously demonstrated (33) that extrema point detection cannot solely be used to reliably indicate the ED phase in an IVUS sequence.This should be attributed to the complex artery motion, which varies depending on the studied vessel, and imaging artifacts such as the presence of side branches or significant atherosclerotic lesions, making the visual identification of ED-frames a challenging task for humans and traditional image-analysis approaches (18).
The experiments in this study show that the proposed method CARDIAN can effectively achieve the real-time ED-frame detection task in IVUS sequences.The accurate detection of EDframes is crucial for guiding interventional decisions, optimizing therapeutic interventions, and ensuring standardized volumetric analysis in IVUS studies.The CARDIAN framework integrates several dedicated computational methods to extract motion features and predict ED-frames, including dedicated motion encoders, a bidirectional attention recurrent network (BARNet), and a spatiotemporal rotation encoder.The motion encoders, including a temporal difference encoder and a spatiotemporal rotation encoder, extract frame-by-frame motion features corresponding to the phases of the cardiac cycle.The BARNet model predicts the likelihood of each frame being an ED-frame using a bidirectional recurrent network with an attention layer.The spatiotemporal rotation encoder captures the IVUS catheter's rotational movement with respect to the coronary artery.All     The deep learning solution of CARDIAN is shown to overcome the challenges and outperform both human expert analysts and conventional approaches, with results even superior to a GRU approach as previously described.This is attributed to the integration of motion, temporal, and spatiotemporal rotation encoders with a BARNet network.Unlike previous deep learning approaches that focus on feature extraction and shallow-learned representations, the CARDIAN approach takes advantage of the bidirectional attention mechanism of the BARNet to capture the temporal relationship between frames, essential for cardiac gating.The CARDIAN approach also considers the rotation of the IVUS catheter with respect to the coronary artery, an important motion feature previously overlooked.This highlights the novelty and superiority of the proposed CARDIAN approach for ED-frame detection in IVUS sequences.
Testing of the developed methodology against the ECG estimations underscores the potential but also the limitations of the CARDIAN methodology.We found that in contrast to the expert analysts and the LM method, the CARDIAN approach provides consistent results in all the 3 epicardial coronary arteries.More importantly, the performance of our approach is 2-3 times better than the conventional methodologies or manual screening.The performance of the CARDIAN approach was excellent in detecting the ED when a prediction tolerance of 100 ms was used, however, the superiority was still present when this cutoff was 66 ms.
A limitation of the present analysis is that it did not include patients with arrhythmias-such as atrial fibrillation or frequent ectopics-to evaluate the performance of the CARDIAN approach in these cases.Arrhythmia can affect the R-R interval, which is a crucial component of the CARDIAN approach for The result of the proposed gating method compared with the expert's predictions in 6. ED-frame detection.Despite this limitation, the proposed ED frame detection constitutes a key advance in IVUS image analysis and is expected to positively influence subsequent research.We have previously demonstrated that ED-frame-based volumetric IVUS analysis is more reproducible than conventional IVUS segmentation (44).These findings are important for longitudinal intravascular imaging-based studies assessing the implications of pharmacotherapies on plaque volume, as a more reproducible IVUS analysis is expected to reduce the number of vessels that should be included in these studies to demonstrate statistically significant changes in plaque burden (45,46).Another limitation is undersized dataset based on which the experiments are performed, this is due to unavoidable constraints in the current practices, such as limited equipment availability, labor-intensive data collection, limited suitable patient cases, etc.These problems limit the number of samples available for training the model.We acknowledge and appreciate the support from InfraReDx, Inc. for this work.They have expressed interest in incorporating the developed CARDIAN approach into their system to accurately detect the ED-frame for more reproducible volumetric analysis.This collaboration further validates the potential impact and usability of the CARDIAN approach in real-world clinical and research settings.

Conclusions
This study introduces a novel computational approach for realtime end-diastolic frame detection in intravascular ultrasound using bidirectional attention networks, CARDIAN, that is capable to accurately detect the EDs in the three coronary arteries.The proposed method operates in real-time and has superior performance to expert analysts and conventional LM methods.These advantages prove that this method is useful in clinical research and, in particular, in the analysis of large imaging datasets collected in longitudinal studies of coronary atherosclerosis.

FIGURE 3
FIGURE 3Examples of the normalized encoded temporal feature sequences, from original IVUS sequences.The blue line represents the temporal feature data, which is extracted by the proposed temporal difference encoder.The red dots represent the ED frames detected by the ECG.Each sequence includes 300 frames.

Figure 5
Figure 5 shows the signals encoded by the temporal difference encoder and spatiotemporal rotation encoder, together with the corresponding longitudinal view for the original IVUS pullback.In the first three cases, the encoded signals from intense and regular

FIGURE 4 (
FIGURE 4 (A) The correspondence between the peak of R wave in ECG and ED-frames in IVUS.(B) Catheter rotation alignment between two consecutive frames.The spatiotemporal rotation encoder aligns two consecutive IVUS frames by rotating the vessel around the catheter center.The rotation angle u is estimated by minimizing the difference between the adjacent rotated frames, and the aligned frames are then used for feature extraction.The pixel d in the first frame is rotated to the position q in the next frame.

FIGURE 5 (
FIGURE 5 (A) Partial longitudinal views and the proposed encoded sequence in 4 IVUS sequences-the length of each partial longitudinal view is 800 frames (26.6 s).The image on the left is a representative IVUS frame with the red line showing the sampled pixels used in the partial longitudinal view (0 ).The red lines in the longitudinal view indicate the ED-frames annotated by ECG.The yellow curves show the feature sequence encoded by the spatiotemporal rotation encoder and the green the temporal difference encoded sequence.

Four
result samples generated by BARNet in 300 frames (10 s).Blue curves: features encoded by the temporal difference encoder; red dots: ED-frames ground truth based on ECG; yellow curves: BARNet generated prediction; green lines: final ED-frame prediction based on the prediction scores of CARDIAN.Images (A-C) show results generated on regular cardiac cycles, and (D) presents the result for a patient with frequent ectopics and irregular cardiac cycles.
these designs are obtained based on an in depth understanding of the true requirements of cardiovascular research, by analysing the unique properties of the data and task, and by putting together the most advanced computer vision and machine learning algorithms in a dedicated way for tackling the end diastolic frame detection challenge.

FIGURE 9
FIGURE 9 6 s or 200 frames.Red lines: ED-frames annotated by ECG; green lines: ED-frame predictions.The examples in LAD (A-C), LCX (A-C) and RCA (B,C) represent vessels with common motion patterns.In the cases LAD (D), LCX (D) and RCA (A,D), there is a smaller range of motion and artifacts making it harder for ED-frame detection.Huang et al. 10.3389/fcvm.2023.1250800Frontiers in Cardiovascular Medicine 13 frontiersin.org 10ang et al.10.3389/fcvm.2023.1250800cardiac motion show a clear cycle of motion between ED-frames, demonstrating the regular systolic relaxation of the heart.In the last case on the lower right, the movement of the vessel is small, and thus the coded movement feature is weak as well as vessel's rotation.The proposed temporal difference encoder and spatiotemporal rotation encoder transfers 2D image signal to 1D global temporal features to reduce the demand for GPU performance for ED-frame detection and eliminates the interference factors during data collection such as noise, rotation, and local diseases of vessels.
E n is calculated based on a normalized element-wise multiplication with the learnable weight W [ R 140Â1 , as shown in Equation unit

TABLE 1
Ablation study on the ECG-IVUS dataset based on GoP recall.In this situation, the attention layer plays an important role by providing a larger field of view to help generate more accurate predictions, eliminate false predictions caused by other movements and prevent overfitting.

TABLE 2
Efficacy of the expert analysts of the LM and of the CARDIAN method in detecting the ED, using a prediction tolerance of 66.7 ms (2 frames).
TABLE 3A Comparative performance evaluation of two experts the LM, and CARDIAN method in the entire dataset, based on a prediction tolerance of 66.7 ms (2 frames).

TABLE 3B
The bold values indicate the highest scores for each coronary.

TABLE 4
The nearest prediction interval (in ms) of the two experts, the LM and the CARDIAN methodology.
The values in bold represent the best performance in each coronary artery.