PulseDB: A large, cleaned dataset based on MIMIC-III and VitalDB for benchmarking cuff-less blood pressure estimation methods

There has been a growing interest in developing cuff-less blood pressure (BP) estimation methods to enable continuous BP monitoring from electrocardiogram (ECG) and/or photoplethysmogram (PPG) signals. The majority of these methods have been evaluated using publicly-available datasets, however, there exist significant discrepancies across studies with respect to the size, the number of subjects, and the applied pre-processing steps for the data that is eventually used for training and testing the models. Such differences make conducting performance comparison across models largely unfair, and mask the generalization capability of various BP estimation methods. To fill this important gap, this paper presents “PulseDB,” the largest cleaned dataset to date, for benchmarking BP estimation models that also fulfills the requirements of standardized testing protocols. PulseDB contains 1) 5,245,454 high-quality 10-s segments of ECG, PPG, and arterial BP (ABP) waveforms from 5,361 subjects retrieved from the MIMIC-III waveform database matched subset and the VitalDB database; 2) subjects’ identification and demographic information, that can be utilized as additional input features to improve the performance of BP estimation models, or to evaluate the generalizability of the models to data from unseen subjects; and 3) positions of the characteristic points of the ECG/PPG signals, making PulseDB directly usable for training deep learning models with minimal data pre-processing. Additionally, using this dataset, we conduct the first study to provide insights about the performance gap between calibration-based and calibration-free testing approaches for evaluating generalizability of the BP estimation models. We expect PulseDB, as a user-friendly, large, comprehensive and multi-functional dataset, to be used as a reliable source for the evaluation of cuff-less BP estimation methods.


File Structure
The PulseDB dataset is released as 5, 361 MATLAB ® data files (MAT-File version 7.3), with each file corresponding to all segments belonging to one subject. The files were put into two folders, "PulseDB MIMIC" and "PulseDB Vital", separating subjects in the MIMIC-III matched subset from subjects in the VitalDB dataset.
Each of these files is an 1-D array of MATLAB structures. Each structure corresponds to a 10-s segment, which includes the following fields: • SubjectID: It takes the format of "pXXXXXX", in which the 6 digits after "p" is the original subject ID used by the MIMIC-III matched subset and the VitalDB dataset. Segments in the same data file share the same SubjectID.
• CaseID: Identifier of record. Segments in the same data file with different CaseID come from different records belonging to the same subject.
• SegmentID: Sequence of segments. For segments with same SubjectID and CaseID, a segment with smaller SegmentID occurs prior to another segment with larger SegmentID temporally.
• ECG Raw, PPG Raw, ABP Raw: Raw, unfiltered physiological signals directly cropped from records in the MIMIC-III matched subset and the VitalDB database. In each segment, the amplitude of ECG Raw and PPG Raw signals were linearly remapped between 0 and 1, while the amplitude of ABP Raw signal was not modified, such that the absolute SBP and DBP values were preserved. These raw signals can be filtered with user-defined settings to be used as inputs or outputs that fit best to the desired BP estimation method.
• ECG F, PPG F, ABP F: Filtered ECG, PPG and ABP signals. The ECG signal was band-pass filtered at [0.5, 40] Hz with a Butterworth filter, while the PPG and ABP signals were band-pass filtered at [0.5, 8] Hz using a Chebyshev-II filter, as described in Section 2.4. We suggest using the ECG F and the PPG F signals as inputs of BP estimation models, if there is no specific requirement for processing the input signals of the model.
• ABP Lag, PPG ABP Corr: ABP Lag is the lag that yields the highest cross correlation between ABP F and PPG F. The maximum possible lag is limited to ± 125 samples, or ±1 s. The value of Pearson's correlation coefficient between ABP F and PPG F aligned using this lag is PPG ABP Corr.
• ECG RPeaks, PPG SPeaks, PPG Turns, ABP SPeaks, ABP Turns: Positions of extracted characteristic points, specified as indices. For example, indexing ECG Raw using ECG RPeaks identify the amplitudes of all R-peaks of the ECG signal in this segment.
• SegSBP, SegDBP: the segment-averaged SBP and DBP values. We suggest using these values as reference SBP and DBP labels for training sequence-to-label BP estimation models (e.g. CNN), while using ABP Raw for training sequence-to-sequence models (e.g. LSTM).
• Age, Gender, Height, Weight, BMI: Demographic information of the subject from which the segment was retrieved. The Height, Weight and BMI fields are only available for segments derived from the VitalDB database, since the MIMIC-III matched subset does not record these information.

Proxy Files for Statistical Analysis
To enhance data loading efficiency and to reduce memory cost for analyzing a large dataset, proxy files are generated for PulseDB to include only the demographic information and the reference SBP and DBP labels in each segment for basic statistical analysis, without requiring the effort of loading the waveform dataset.
Three types of proxy files are generated and included as part of the dataset. The first type, named as "PulseDB Info", is a summary of demographic information and reference BP values for all subjects in the PulseDB dataset. The second type, named as "Train Info", "CalBased Test Info", "CalFree Test Info","AAMI Test Info" and "AAMI Cal Info", corresponds to each of the training, calibration, and testing subsets of segments summarized in Table 4. Since these proxy files may record information of subjects from both MIMIC and VitalDB datasets, the field Subj Name is used to identify different subjects, which consists of an additional digit of 0 or 1 after the SubjectID in the data files, to distinguish subjects in the MIMIC-III matched subset from subjects in the VitalDB dataset. Moreover, "VitalDB Train Info", "VitalDB CalBased Test Info", "VitalDB CalFree Test Info","VitalDB AAMI Test Info" and "VitalDB AAMI Cal Info" correspond to the supplementary subsets generated from only VitalDB subjects, described in Section 4.3 and Table 7.

Generation of Python-Friendly Training, Calibration and Testing Subsets
A MATLAB function is provided to generate the training, calibration, and testing subsets summarized in Tables 4, 7 and Section 4.3, using the subset separation settings defined in the proxy files. To maximize training and testing efficiency, the function fetches required data from the MATLAB data files that store the signal segments, and concatenates them into multi-dimensional arrays. These arrays can be effectively loaded into Python environments for training and evaluating deep learning models. The generated supplementary training, calibration, and testing subsets are available from Kaggle.