Machine learning-based detection of cardiovascular disease using ECG signals: performance vs. complexity

Introduction Cardiovascular disease remains a significant problem in modern society. Among non-invasive techniques, the electrocardiogram (ECG) is one of the most reliable methods for detecting cardiac abnormalities. However, ECG interpretation requires expert knowledge and it is time-consuming. Developing a novel method to detect the disease early improves the quality and efficiency of medical care. Methods The paper presents various modern approaches for classifying cardiac diseases from ECG recordings. The first approach suggests the Poincaré representation of ECG signal and deep-learning-based image classifiers. Additionally, the raw signals were processed with the one-dimensional convolutional model while the XGBoost model was facilitated to predict based on the time-series features. Results The Poincaré-based methods showed decent performance in predicting AF (atrial fibrillation) but not other types of arrhythmia. XGBoost model gave an acceptable performance in long-term data but had a long inference time due to highly-consuming calculations within the pre-processing phase. Finally, the 1D convolutional model, specifically the 1D ResNet, showed the best results in both studied CinC 2017 and CinC 2020 datasets, reaching the F1 score of 85% and 71%, respectively, and they were superior to the first-ranking solution of each challenge. The 1D models also presented high specificity. Additionally, our paper investigated efficiency metrics including power consumption and equivalent CO2 emissions, with one-dimensional models like 1D CNN and 1D ResNet being the most energy efficient. Model interpretation analysis showed that the DenseNet detected AF using heart rate variability while the 1D ResNet assessed the AF patterns in raw ECG signals. Discussion Despite the under-performed results, the Poincaré diagrams are still worth studying further because of the accessibility and inexpensive procedure. In the 1D convolutional models, the residual connections are useful to keep the model simple but not decrease the performance. Our approach in power measurement and model interpretation helped understand the numerical complexity and mechanism behind the model decision.


Introduction
Cardiovascular disease is a serious public health problem that affects millions of people worldwide and is also a leading cause of death [1].The expense of healthcare, lost productivity, and a diminished quality of life due to heart illness has a significant economic and social impact on individuals, families, and society as a whole [2].This emphasizes the value of early disease identification.While the electrocardiogram (ECG) is considered as the most crucial method for detecting and diagnosing cardiac problem [3], it takes time and requires trained professionals with specialized skills to interpret ECGs.The ECG analysis task includes beat annotation and signal classification.While the former deals with aligning the signal segment to the heart contraction, the latter tries to predict the disease from the signal data.
In the domain of ECG classification, there are a number of methods ranging from feature-based models to deep-learning based ones.The feature-based approach takes advantage of a feature extraction technique and a machine learning model.The methods for feature extraction are very diverse, however, the domain-dependent features, statistical descriptors, morphological characteristics, and frequency-based features are widely chosen [4].
Challenges (with annotated datasets provided) of ECG classification such as The PhysioNet/Computing in Cardiology Challenge (CinC) 2017 and 2020 [5,6] are aimed to provide opportunities for data science community to develop a novel method for automatic detection.While CinC 2017 focused on arrhythmia disease, CinC 2020 contained ECG signals in a wide range of cardiac abnormalities.Although there are many efforts to apply machine learning/deep learning approaches to reach the highest performance, the results were still modest, especially in CinC 2020.
Additionally, there are state-of-the-art methods that improve the performance of signal classification, especially in deeplearning-based methods.Besides the improvement of accuracy, the architecture of models becomes more complicated, so they require more energy to train and have a long inference time.This problem limits the application of the method, especially in handheld and wearable devices.
In this study, we focused on enhancing ESG classification approaches in terms of performance, numerical complexity, inference time, and its interpretability.
Contribution.The contribution of our paper is threefold: • First, we introduce a pipeline for heart disease classifier evaluation in terms of performance and numerical complexity.
• Second, we achieved a state-of-the-art level (regards to CinC 2017 and CinC 2020 challenges) performance with 1D ResNet model for both CinC 2017 and CinC 2020 benchmarks.
• Third, we provided interpretation techniques for DenseNet121 and 1D ResNet models.

Related work
Many previous works on ECG classification focus on detecting heart disease, including atrial fibrillation, tachycardia, bradycardia, arrhythmia, and other problems [7,8,9].Some other studies try to predict mortality rates or demographic characteristics [10,11].The input of these models is usually raw ECG signal in single or multiple leads; however, sometimes only ECG images were used [12].
The data could be transformed into good features before feeding to a machine learning model or be processed automatically to produce a high dimensional representation by a deep learning model.In the feature-based method, four groups of descriptors can be extracted from the ECG signal: time-domain features, nonlinear-domain features, distance-based features, and time-series features ( [13]).In the next step, the classifier, such as logistics regression, support vector machine, or boosting algorithms, gives the prediction based on these features.Besides that, the deep learning-based approaches perform feature extracting and predicting simultaneously.These models take advantage of multiple layer perceptron [8], CNN model [7], or LSTM model [14] for extracting the high-level features of ECG signal.
The work by Jun et al. [12] focused on predicting arrhythmia diseases based on ECG beat.The author used the data from the MIT-BIH arrhythmia database.In the pre-processing phase, the ECG signal was partitioned and centered based on the Q-wave peak time before being plotted to generate a 128 x 128 grayscale image as the input for learning.Data augmentation was performed by cropping and resizing the training data images.The CNN model tried to classify eight labels from these plotting.In that paper, the AlexNet, VGGNet, and a customized CNN architecture were used to optimize the performance.The proposed model reached 0.989 AUC and over 99% accuracy.The data augmentation showed the benefit of raising the sensitivity of the models.Compared to our approach on the Poincaré diagram, we used the scatter plot on heart rate variation instead of the raw signal.The used CNN architectures were the novel models, including ResNet and DenseNet.The cropping and resizing augmentation were not applied because this technique could modify the dispersion of the Poincaré plot, which leads to the wrong prediction.Instead, the limited random erasing on the input image was used to regularize the CNN model.
There was a concerted effort by Shenda Hong et al. [7] to develop an ensemble system to process the waveform data.
Hong's architecture includes three key parts named the model zoo, the ensemble composer, and the real-time serving system.The model zoo takes the responsibility for training the data with several collections of hyperparameters, while the ensemble composer would figure out the best set of models under the constraints of validation performance and latency.The serving system takes the output of the ensemble composer and then deploys a system that can handle massive input data as well as queries in real time.The authors performed many experiments on signal leads with several types of deep learning models such as CNN, ResNet, ResNeXt, and RegNet.The experiments showed that this system could reach an accuracy of 95% and a latency of under a second on a 64-bed simulation.
In the study by Ribeiro et al. [8], the authors investigate how to train a one-dimensional convolutional neural network to predict cardiac diseases.The trained data includes more than two million 12-lead ECG signals that are between 7 to 10 seconds.The dataset was annotated semi-automatically that combines algorithms and human verification.The chosen model was based on the ordinary 1D ResNet architecture that has 4 Residual Blocks with kernel size increasing from 128 to 320.Finally, this pipeline surpassed human performance with an F1 score of over 80%.
The work by Zhang's team [9] proved the dominant results of deep learning compared to the feature-based machine learning model.In this study, 1D CNN was trained on 12-lead ECG recordings from CPSC 2018 database.The model could predict 9 subtypes of arrhythmias with an F1 score of over 80%.The impressive idea of this work is to use the SHAP value to explain the model output at the individual level as well as the population level.At the individual level, the model could show the characteristics of ECG that support the model decision such as the abnormal QRS pattern following the P waves in AF, or the prolongated PR distance in IAVB.At the population level, the authors examined the contribution of each lead to model output, so that the lead II, aVR, V1, V2, V5, and V6 are the most important leads in their model.
Besides the signal classification problem, the beat-level annotation is also investigated.Corradi et al. [14] took advantage of recurrent neural networks to annotate the ECG with over 90% accuracy in many public datasets.Their model was also efficient enough to deploy in the wearable device.In 2020, Teplitzky presented BeatLogic.[15] This was a comprehensive system that could detect and classify the cardiac beat and rhythm simultaneously by using the 1D-CNN-based model.As a result, BeatLogic outperformed other methods on every mentioned task.
Our proposed method also takes advantage of the Poincaré plot, a standard procedure for studying heart rate variability.Early work by [16] used point coordinates in the Poincaré plot to calculate the interval and variability of interbeat before using a support vector machine model to classify AF and non-AF patients.[17] combined the RR interval and the difference of RR intervals to make a robust AF classifier with few heartbeats.[18] deploys the ensemble of neural networks to extract five geometric patterns in the Poincaré plot, including comet, torpedo, fan, double side lobe, and multiple side lobes, before using them to classify the major cardiac arrhythmias.[19] proposed a modified Poincaré plot from heart rate difference.The features extracted from these plots by image processing help diagnose AF from Premature Atrial/Ventricular Contraction.

Data
The data was collected from two challenges PhysioNet/CinC Challenge 2017 and 2020 [5,6].The disclosure data was split into train/validation/test subsets with the ratio 60/20/20.The dataset CPSC is the collection of the ECG signals of Chinese patients which were recorded at 500 Hz.The patient's gender and age were disclosed in this dataset, however, the age of over-89-year-old patients is masked as 92 due to the HIPAA guidelines.The INCART database contains 30-min recordings at 257 Hz while the PTB and Georgia datasets consist of 10-second recordings only.The private data is not public so this source was not included in our work.The remaining dataset was split into train/test/split with the ratio 60/20/20.Like the CinC 2017 dataset, the data from CinC 2020 is also WFDB-compliant.The header files embedded the demographics information and diagnosis labels.

Learning over Poincaré representation
For the methods based on the Poincaré diagram, the input ECG signals were preprocessed by biosppy [20] to extract the R-peak positions from the signal [21].This library filters the ECG signal in the frequency range from 3 to 45 before using Hamilton algorithm [22] to detect the R-peak.The distance between R-peaks (or RR intervals) was evaluated from the R-peak location.Furthermore, in our study, we only used the NN intervals which are the distances between normal R-peaks after removing the noise and artifacts.The Poincaré diagram was constructed by plotting the scatter charts for N N i and N N i+1 intervals.Figure 3 shows the examples of Poincaré diagram of a short and long recording.
To predict the heart disease over the Poincaré diagram, the default architecture of ResNet50 [23] and DenseNet121 [24] were used to train from the scratch (without pre-trained weights).The last layers of these models were also tailored to match the number of classes of each dataset.The Gradient-weighted Class Activation Mapping (GradCAM) [25] was constructed to explore the mechanism behind the model decision.

Learning over 1D signal
The 1D CNN model comprises twelve base blocks.Each base block consists of a 1D Convolutional layer, 1D Batch Normalization, Activation function, Pooling layer, and Drop-out layer.In the 1D Convolutional layer, the padding is always 'valid' while the stride size is always 1.The output channel starts at 256 and decreases gradually to 32 in the last convolutional layer, and the kernel size starts at 20 followed by 5 layers with a kernel size of 5, and then 3 for the remaining layers.The Batch Normalization layers have the number of weights the same as the number of output channels of the prior convolutional layer.The momentum of normalization is 0.99 for every block.The Pooling of base block is MaxPooling1d of which the kernel size and stride size are 2.The dropout probability is set to 0.3 in every place.Before flattening the tensor and feeding to the last fully connected layer for the logit outputs, there is an Average pooling layer with a kernel size of 1 and stride size of 2.
The structure of the base block in 1D ResNet includes a 1D Convolutional layer, 1D Batch Normalization, ReLU activation function, a Drop-out layer, another 1D Convolutional layer, and 1D Batch Normalization.In a base block, the input would go through these layers before adding the residual which is also the input tensor.This summation is activated by the ReLu function after leaving the block.
In our work, the 1D ResNet starts with a 1D Convolutional layer with a kernel size of 15 and the number of output channels is 64 followed by a 1D Batch Normalization, ReLU Activation function, and Max Pooling layer.After that, there are four base blocks with kernel sizes increasing from 65 to 256.The output of the last base block goes through two pooling layers: an Average Pooling layer and a Max Pooling layer.These outputs are concatenated before feeding to the final fully connected layer to compute the output logits.
In both 1D CNN and 1D ResNet models, the signals are converted to first-order difference and scaled to zero mean and unit variance before transferring to the models.We also adapted the GradCAM [25] to figure out which regions in ECG recordings contribute to the model results.

Learning over XGBoost feature space
In the pipeline of the XGBoost model, the processed ECG signals need to feed to module tsfresh to extract the features before training model.The features extraction used the default setting in the subset EfficientFCParameters but filtered out the time-consuming features including: approximate_entropy, sample_entropy, matrix_profile, number_cwt_peaks, partial_autocorrelation, agg_linear_trend, augmented_dickey_fuller.In the feature matrix, the pipeline filled the missing data with −999 and removed the low-variance features.
The hyperparameters of XGBoost were optimized by searching within the predefined space (Figure 2).The optimum collection was found by Bayesian optimization implemented in the library scikit-optimize.[26] The number of search trials was limited to 100 because of time constraints.

Cardiovascular diseases classification
The experiment results showed the superior performance of the 1D ResNet model learned over raw data in both datasets.Especially, in CinC 2020, this model surpassed the 1st rank solution by a large margin.The comparison of F1 scores and the efficiency metrics (power consumption, eq.CO2) are given in Table 3.To better illustrate the tandem "Perfomance vs. Complexity" for examined models the figure 4 fives cross-plots on F1 score and CO2 emissions for both datasets.In particular, one can reveal that DenseNet121 and ResNet50 models learned over Poincare diagrams stand out from other models as inefficient while ResNet learned on raw ECG signals outperforms.The Poincaré-based methods have adequate performance in the CinC 2017 challenge.However, they do not perform well in the CinC 2020 challenge.In particular, some classes are not discriminated in the Poincaré diagrams.The models ResNet50 and DenseNet121 only identified the types AF, SB, SNR, STach and other, while the metrics for the remaining types are close to zero.This result is understandable as the information on heart rate variability is not sufficient to identify many types of heart disease.They are also the most power-hungry models: ResNet50 and DenseNet121 consumed 2 to 3 times more energy than the others.
The XGBoost is under-expected because it ranked lowest in CinC 2017 and only third in CinC 2020 despite the gradient-boosting family usually gaining the highest place at many machine learning benchmarks.In terms of power consumption, this model is very efficient when processing the short-term signal in CinC 2017; however, the required energy increases by seven folds when processing the long-term signals in CinC 2020.This phenomenon is the result of the heavy preprocessing step in this pipeline.We also analyzed the performance of investigated models in each source of the CinC 2020 dataset (Table 4).The ResNet50 was good at the short-term recordings while performing poorly in long-term data.The DenseNet121 was better than ResNet50 in long-term signal classification but did not surpass the 1D Convolutional model.The XGBoost outperformed the others in long-term ECG.However, the number of long-term signals is modest, so their metrics might not stable.Figure 6 shows the focusing points of the 1D ResNet when predicting the AF signal.The yellow area is the segment that the model attracts.These heatmaps show that the classifier focused on the signal at the neighbor of the QRS complex.These regions are corresponding to the P-wave and T-wave of ECG recordings.In fact, the absence or abnormality of P-wave and T-wave is related to the fluctuation of heart rate and predicts arrhythmia disease [27].

Feature importance of XGBoost
To explore how model XGBoost predicts classes, the feature importance score was calculated and summarized in Table 5.The results show that the features relating to the peak of signal, like fft_coefficient and ratio_beyond_r_sigma, are the highly important features.We can see that the XGBoost model infers the heart rate information indirectly via the peak-related features, after that the model could give the prediction of arrhythmia from heart rate.

Inference Time
Table 6 shows the comparison in the inference time among trained methods.Although that XGBoost had a lightning prediction time, this model dominated the total inference time benchmark, which comes from the heavy processing steps.This problem leads to the fact that XGBoost still inferred 24 times longer than the second place.The Poincaré-based method requires an approximate two-fold longer inference time than the 1D CNN or 1D ResNet.This result complies with the mathematical characteristics of the 1D and 2D convolutional operators.

Statement on computational resources and environmental impact
The experiment was performed on a workstation with 1 CPU Intel Core i7-9700F and 1 GPU NVIDIA RTX 3600.This work contributed totally 1.8 kg equivalent CO 2 emissions.The carbon emissions information was generated using the open-source library eco2AI1 [28].

Conclusion
In the paper, we have presented novel various approaches to classify cardiac diseases from ECG recordings.The first approach took advantage of the Poincaré diagram and deep-learning-based image classifiers.ResNet50 and DenseNet121 architecture were chosen to process the graph.The experimental results figured out that these methods are decent for atrial fibrillation but not good at predicting other types of arrhythmia.In particular, the Poincaré-based methods have adequate performance in the CinC 2017 data but not good in the CinC 2020 data.However, RR or NN intervals, and therefore Poincaré diagrams, are much more accessible and can be obtained without the relatively complicated and expensive ECG procedure.Thus, it is still worth studying further in this approach.The XGBoost's performance is more impressive in the subset of long-term than the short-term data.This gradient-boosting model has a long inference time because of the expensive calculation in the preprocessing step.The one-dimensional convolutional model showed the best results in both studied datasets.Especially the 1D ResNet was superior to the first-ranking solution of each challenge.The residual connection showed its advantages in transferring information while keeping the model not too deep.
We have also investigated the efficiency metrics while training the models, including power consumption and equivalent CO2 emissions.Because of the high workload when processing 2D images, the 2D ResNet and DenseNet are at the top in power-consuming rankings.The XGBoost is energy efficient for the short term, but the power requirement is multiplied many times when training on long-term signals.Since the 1D convolution operator is optimized in the calculation, the unidimensional models like 1D CNN and 1D ResNet are the most energy efficient among the studied methods.
In the aspect of model interpretation, three models (DenseNet, 1D ResNet, and XGBoost) were analyzed to figure out how they discriminate the normal and AF data.The DenseNet detected AF using the heart rate variability, which was measured by the spreading of the data cloud and the presence of data in the upper-left and lower-right in the Poincaré diagram.On the other hand, the 1D ResNet assessed the AF pattern in raw ECG signal similar to a medical expert: this model focused on the area around the QRS complex, which is also the location of P and T waves.

Figure 1 :
Figure 1: The classes distribution in datasets CinC 2017 (left side) and CinC 2020 (right side)

Figure 3 :
Figure 3: The Poincaré diagrams of the short-term (a) and long-term (b) ECG.The diagrams plot the normal R-peak intervals (or NN intervals).

Figure 4 :
Figure 4: F1 score vs. CO 2 emissions: left side -models learned over CinC 2017 dataset, right side -models learned over CinC 2020 dataset.Dotted red ellipses highlight relatively heavy models

4. 2 Figure 5 :
Figure 5 visualized the GradCAM output of DenseNet121 on CinC 2017.We can see how this model processes the Poincaré diagram differently.In the Normal graph, the model focused on the area in the upper-left and lower-right, while the shape of the point cloud was ignored.In the arrhythmia diagram, the model focuses on the point cloud or the diversion of data.

Figure 6 :
Figure 6: Explaining 1D ResNet decision by GradCam methods in case of normal regimes and arrhythmia

Table 1 :
Descriptive statistics for CinC 2017 and CinC 2020 datasets filtered by the recorder.The host provided the data in WFDB format with a .matfile containing signal data and a .heafile containing headers for basic information including ID, recording parameters, and patient information.The CinC 2020 dataset contains 12-lead signals which come from five different sources: CPSC Database and CPSC-Extra Database, INCART Database, PTB and PTB-XL Database, The Georgia 12-lead ECG Challenge (G12EC) Database, and the Private Database.

Table 2 :
The hyperparameters searching space of XGBoost.

Table 3 :
Performance on test datasets for CinC 2017 and CinC 2020 competitions

Table 4 :
The F1 score on each sources in CinC 2020 dataset.

Table 5 :
Feature importance score of each feature group in the model XGBoost

Table 6 :
The inference time of trained models.