Feature evaluation of accelerometry signals for cough detection

Cough is a common symptom of multiple respiratory diseases, such as asthma and chronic obstructive pulmonary disorder. Various research works targeted cough detection as a means for continuous monitoring of these respiratory health conditions. This has been mainly achieved using sophisticated machine learning or deep learning algorithms fed with audio recordings. In this work, we explore the use of an alternative detection method, since audio can generate privacy and security concerns related to the use of always-on microphones. This study proposes the use of a non-contact tri-axial accelerometer for motion detection to differentiate between cough and non-cough events/movements. A total of 43 time-domain features were extracted from the acquired tri-axial accelerometry signals. These features were evaluated and ranked for their importance using six methods with adjustable conditions, resulting in a total of 11 feature rankings. The ranking methods included model-based feature importance algorithms, first principal component, leave-one-out, permutation, and recursive features elimination (RFE). The ranking results were further used in the feature selection of the top 10, 20, and 30 for use in cough detection. A total of 68 classification models using a simple logistic regression classifier are reported, using two approaches for data splitting: subject-record-split and leave-one-subject-out (LOSO). The best-performing model out of the 34 using subject-record-split obtained an accuracy of 92.20%, sensitivity of 90.87%, specificity of 93.52%, and F1 score of 92.09% using only 20 features selected by the RFE method. The best-performing model out of the 34 using LOSO obtained an accuracy of 89.57%, sensitivity of 85.71%, specificity of 93.43%, and F1 score of 88.72% using only 10 features selected by the RFE method. These results demonstrate the ability for future implementation of a motion-based wearable cough detector.


Introduction
Chronic Respiratory Diseases (CRDs) are common health conditions that affect the respiratory tract airways and lungs, resulting in various symptoms such as dyspnea (difficulty breathing), chest pain, cough, sputum, and others.Some diseases that fall under the CRD umbrella are chronic obstructive pulmonary disease (COPD), asthma, pneumonoconiosis, interstitial lung disease, and pulmonary sarcoidosis.These are incurable yet treatable conditions that require continuous monitoring of the disease's progression and an adjustable treatment plan.Among those conditions, the two main contributors to the global burden of chronic respiratory diseases are asthma and COPD (1), with the latter being the third cause of death worldwide (2).Moreover, global health metrics in 2019 concluded that CRDs were responsible for 71.31 million years of life lost (YLLs) and 32.4 million years of healthy life lost due to disability (YLDs) (1).
To help ease this global health burden inflicted by CRDs and to limit patients' deterioration, possible solutions include early detection and continuous monitoring of the disease to allow for proper treatment plans and on-time interventions when needed.To achieve this, many researchers targeted the detection and monitoring of respiratory diseases by leveraging the potential of data-driven artificial intelligence (AI) algorithms.The studies focused on the use of medical symptoms as an indicator to the presence of a respiratory health problem-detection-and as a measure of the progression of the respiratory conditionmonitoring.Some of the symptoms that have been analyzed and studied include respiratory sounds produced due to illness, such as wheeze, stridor, crackle, and gasp (3), and the symptom that has been the center of attention most recently is coughs.
The use of sophisticated machine learning (ML) and deep learning (DL) algorithms for the detection and the diagnosis of respiratory diseases from cough sounds has become a research focus in the last few years.Researchers have employed linear ML methods such as linear regression (LR) (4)(5)(6), non-linear ML methods including both support vector machine (SVM) (7)(8)(9) and k-nearest neighbors (kNN) (10)(11)(12), ensemble methods such as random forest (RF) (13,14), and XGBoost (15), and more complex DL methods such as artificial neural network (ANN) (16) and convolutional neural network (CNN) (17,18) to detect cough sounds.These studies targeted the use of audio recordings to detect the presence of cough events achieving an accuracy or sensitivity above 90%.Similar algorithms were also used for the diagnosis of different respiratory conditions, with more recent studies focusing on the diagnosis of COVID-19 from audio recordings such as the case in Imran et al. and Wei et al. (19,20) using CNN models to differentiate between COVID-19, pertussis, and bronchitis.In addition, these advances in cough detection and diagnosis algorithms were then taken a step further to provide a portable, ubiquitous solution by implementing them within a mobile App.This provided a costeffective, easy-to-use solution that witnessed a rise in its use especially during the COVID-19 pandemic.Some of these mobile Apps are COVID-19 sounds app (21) used for data collection, AI4COVID-19 (19), and QUCoughScope (22) for both data collection and diagnosis.
Though the use of the conventional audio-based method for cough detection and diagnosis has resulted in acceptable performance, it is dependent on capturing the audio during a cough event.This requires the use of always-on microphones that are always recording/listening to conversations and personal information in the wait for a cough event.As such, the dependency on audio for cough detection presents both privacy and security issues where private conversations could be maliciously used.In addition to the privacy invasion, another disadvantage of this portable solution is phone battery depletion, since the App has to be working in the background at all times.Therefore, an alternative solution to the audio-based method is needed for the implementation of a portable wearable cough detector.
In this work, we investigate the use of motion-based detection in place of the audio-based detection method.The rationale behind this is explained by the fact that the audible cough sound that is commonly used for detection is the result of a physical motor pattern of the body.The forced expulsive pattern of cough can be broken down into three phases: (i) inspiratory phase-characterized by typical inspiration (drawing of air into lungs); (ii) compression phase-glottis is completely closed and thorax is compressed to increase subglottic pressure; (iii) expulsive phase-glottis opens rapidly resulting in high-pressure airflow (23,24).These motions leading to a cough can be captured using an accelerometer worn close to the body, allowing the collection of data related to coughing.
Accelerometer sensors have been mostly used in applications related to human activity recognition (HAR) such as fitness tracking, with limited research targeting the use of an accelerometer as a cough sensor.One of the few studies using motion-based cough detection was presented in Mohammadi et al. ( 25) using a dual-axis contact accelerometer placed at the patient's neck to capture epidermal vibrations.A total of 35 features composed of temporal, time-frequency, frequency, and information-theoretic were extracted from the accelerometry signals.A binary genetic algorithm (BGA) was used for feature selection together with an SVM model to differentiate between voluntary cough and rest, achieving an accuracy of 99:26 + 0:12%.Other works targeting the use of only accelerometry signals were presented throughout multiple works (26-28) comparing the performance of cough detection in accordance to the position of the sensor.A total of five different positions were examined: chest, stomach, arm-using a tri-axial accelerometer attached by an elastic belt, ear using a headset with a fastened accelerometer on one side, and finally shirt pocket using a tri-axial accelerometer of the phone placed freely and firmly in the pocket.Raw data of the x, y, and z axes were used to train a CNN model to detect cough events.The best performance was achieved at the ear and chest positions with an accuracy of 97% for both (26).Further development used spectral summation of the tri-axial signal together with a CNN model to boost the performance demonstrating up to approximately 99% accuracy in the ear position and 98.2% accuracy at the chest (27).A different approach was used in Vyas and Doddabasappla (28), where spectrum spread of only the y and z axes were used as input to kNN and SVM models comparing only three positions (chest, stomach, and pocket).The SVM model outperformed the kNN model in detecting cough events, with the best accuracy being achieved in the pocket position at 96.1%, followed by the chest at 94.3%, and then the stomach at 94%.Other studies that are worth mentioning have used an accelerometer together with other sensors, such as in Otoshi et al. (29), that used a strain sensor around the neck with a tri-axial accelerometer around the epigastric region.Another study used a tri-axial accelerometer with an ECG front-end placed around the chest for cough detection (30).
In the reviewed cases of accelerometer-based cough detection, the used accelerometer was a contact sensor that required direct attachment to the user's skin either around the neck, chest, stomach, or ear.The only case with a non-contact accelerometer was the use of a phone placed freely in a shirt's pocket.In both cases, considering the end-user's need for continuous monitoring, the use of a contact-based sensor is most likely to present discomfort in the long run, while the use of a phone or other non-contact accelerometer with free placement in the pocket is not practical and is prone to error due to inconsistent positioning and added noise from the movement of the sensor within the pocket.
This leads to the main purpose behind this work: to the authors' best knowledge, there have not been any evaluation studies of time-domain features extracted from non-contact accelerometry signals for motion-based cough detection.Therefore, the first objective of this research study-an extension of our previous studies (31,32)-is investigating the use of a non-contact accelerometer for motion-based cough detection using data collected from multiple subjects.The second objective is to evaluate the ability of time-domain features to distinguish between cough and non-cough movements.This focus on time-domain features is because the extraction of these features does not require the additional step of computing the signal's frequency components as the case is for frequency-domain features.Hence, the computation of timedomain features requires less computational time and space/ memory.The third objective of the study is to compare different feature importance methods used for feature selection and use their feature ranking results to select the top features.Finally, the fourth objective of the study is to build a simple binary classification model for motion-based cough detection using different number of selected features based on different ranking methods, to find the best model with optimum performance and number of features.
The rest of the paper is organized as follows: the materials used for data acquisition and methods for pre-processing and data preparation are covered in Section 2, together with the description of the extracted time-domain features and the feature evaluation methods used.The data distribution of the extracted features and the results of the feature importance ranking are presented in Section 3, as well as the performance metrics used and the results of the classification models for cough detection.Finally, the results are discussed, and the different models' performances are compared, based on the number of features and the type of ranking method used, in Section 4, with some concluding remarks.

Materials and methods
To correctly classify and differentiate between cough and non-cough samples from accelerometry signals, a set of features was extracted for evaluation and comparison.The computed features were chosen based on the results of the literature review of related accelerometer-based activity recognition research, with a focus on time-domain features.The reason behind these targeted time-domain features is to evaluate and investigate the feasibility of cough detection based on only time-domain extracted features, which require a minimum amount of time and space for computation compared to frequency-domain features.In this section, the methods used for data collection, pre-processing, feature extraction, and feature evaluation are discussed.

Data collection
A total of five subjects (three males and two females) with an average age of 27:6 + 3:1 years participated in the experiment.The study was approved by the Imperial College Research Ethics Committee (ICREC ref.: 6669400), and written informed consent was obtained from all subjects.The subjects had a Nordic Thingy:53-a multi-sensor IoT prototyping platform-clipped to their shirt's collar.Accelerometry signals were recorded while subjects were seated for both cough and non-cough events, resulting in a total of 180 recordings (90 cough and 90 noncough) each of 10 s duration, at a sampling frequency of 62.5 Hz.Four subjects recorded 20 cough sessions and 20 non-cough sessions, while one subject recorded 10 cough sessions and 10 non-cough sessions.The cough sessions required subjects to cough throughout the 10 s recording continuously and repeatedly.Non-cough sessions involved the subject resting, talking, and/or using a mobile phone.

Data pre-processing
The acquired accelerometer recordings were composed of triaxial signals in the x-axis, y-axis, and z-axis indicating sideways movement (right-left), vertical movement (up-down), and outward movement (forward-backward), respectively.The first step in the pre-processing stage was the computation of the accelerometer vector magnitude by combining the raw x, y, and z accelerometry signals using (1).The magnitude vector describes the overall movement and is commonly used in processing accelerometer data.After the computation of the magnitude, each recording became composed of four vectors (accX, accY, accZ, Mag), which were then filtered using a fourthorder Butterworth Bandpass filter (0.5-15 Hz) to filter out movements below 0.5 Hz and any DC offset; and to eliminate the frequencies above 15 Hz, where no peaks were observed in cough recordings.
The implemented pre-processing pipeline is summarized in Figure 1, where after filtration the data were split into training and testing.Records were split based on subjects, where 80% of records from each subject were kept for training and the remaining 20% recordings from the same subject were kept for testing, with a balanced division between cough and non-cough recordings.The division of recordings is visualized in Figure 2.  2, where v 0 is the raw signal window, v is the resultant centered window, and i corresponds to the accelerometer signal vector within a window (i ¼ accX, accY, accZ, Mag).

Feature extraction
The audible cough sound made by a subject is the result of a chest motor pattern that is captured by the acquired accelerometer recordings.An example of the acquired accelerometry signal for a cough and a non-cough window (before centering) from each of the five subjects is presented in Figure 3.The cough accelerometer signal displays a non-periodic, random behavior that can be used, once detected, to differentiate it from the less random non-cough accelerometry signals.Hence, time-domain features measuring the distribution, dispersion, and randomness of the signal were computed to distinguish between cough and non-cough signals.An example illustrating a cough and a non-cough 2 s window of the captured accelerometry signal from each of the five subjects.Following the pre-processing stage, the time-domain features were extracted from the centered windows.A set of statistical features was computed from each of the signals (accX, accY, accZ, Mag) to describe the captured movement during cough and non-cough events.All computations were performed in Python using pandas, scipy.stats,and EntropyHub libraries.The list of the time-domain features used is summarized in Table 1 and a brief description is given as follows: Minimum (Min) is the minimum amplitude value of the accelerometer signal within a window.Maximum (Max) is the maximum amplitude value of the signal within a window.Difference (Diff) is the amplitude range calculated by subtracting the minimum value from the maximum value of the signal within a window.Root mean square (RMS) is the square root of the arithmetic mean of the squared values of the signal within a window.The RMS feature can evaluate the signal amplitude and energy in the time domain.For our centered windows (mean = 0), the RMS is also equal to the standard deviation, hence only RMS as a feature was computed.Variance (Var) is the measure of the data dispersion of the accelerometer signal within a window from the mean, which is also the standard deviation squared.Interquartile range (IQR) is the measure of dispersion and spread of the accelerometer signal within a window, computed by the difference between the 75th and the 25th percentile.It is similar to standard deviation and variance in measuring the dispersion of the signal; however, it is more robust against outliers.Median absolute deviation (MAD) is another measure of the data dispersion that is also more robust against outliers.It computes the median of the absolute deviation from the median of the signal within a window.Skewness is the measure of the symmetry of the distribution of the accelerometer signal within a window.A normal distribution has a skewness of 0, while a negative skewed distribution indicates a distribution with the tail in the negative side, and a positive skewed distribution has a tail in the positive side.
Kurtosis is another measure for the distribution shape of the signal within a window describing the peak of the distribution.Using Fisher's definition of kurtosis-the normal distribution of kurtosis is 0, if the value is less than 0 the distribution has a flatter shape, while if greater the distribution has a sharper shape.Entropy (Ent) is the mean of the computed approximate entropy of the signal within a window.It is used to measure the signal randomness and disorder.Correlation (Corr) is the measure of the pairwise correlation between the tri-axial signals within a window: Corr(accX and accY), Corr(accY and accZ), and Corr(accX and accZ).

Feature importance scoring methods
With the extraction of a total of 43 features from the accelerometry signals, the importance and contribution of individual features toward cough classification must be evaluated.This evaluation ranks the features based on their importance, which later guides the process of feature selection for the final classification model.There are various feature selection methods found in the literature that can be used to rank the feature's importance.In this study, we implemented a total of 11 feature ranking methods using only the training windows.The implemented methods and the description of their feature evaluation approaches are as follows:

Model-based feature importance
Model-based feature importance method is dependent and specific to the machine learning algorithm used.Under this method, three tree-based models were used to rank the feature importance using the built-in importance function of the following models: XGBoost, decision tree (DT), and random forest.The XGBoost model computes feature importance based on "gain," which is the improvement in the model's accuracy when a feature is added to the branch.While both DT and RF models compute importance score based on the reduction of the Gini impurity, where Gini is the criterion used to evaluate the quality of a split.

Principle component analysis
Principal component analysis (PCA) is a common method used for dimensionality reduction rather than feature selection.However, the first principal component (PC1) returned by the analysis can be used to score the importance of features.The rationale behind this is that plotting the cumulative explained variance for each component produced by our training data shows that the first component (PC1)using a single feature, can explain over 80% of the variance in the dataset.Hence, the values of PC1 loadings were used to rank the features based on the correlation between the features and the principal component.

Leave-one-out
As the name suggests, this method iteratively leaves one feature out at a time and evaluates the effect of its removal on the model's accuracy.The importance was measured as the difference between the base accuracy and the new accuracy without the feature; the greater decrease in the model's accuracy reflected the weight and importance of a given feature.Thus, higher importance and ranking were given to features with a greater decrease in the model's accuracy when eliminated.

Permutation
In the permutation importance method, feature ranking was based on the effect a feature had on the model's performance after having its values randomly shuffled.Once a feature was shuffled, the new accuracy was measured and compared to the baseline accuracy; this process was repeated 10 times for each feature and the mean value was measured.The greater the drop in accuracy indicated a higher ranking of the feature.This method was implemented using the permutation importance function in the sklearn:inspection library, with 10 repetitions for each feature permutation.

Recursive feature elimination
In this method, the feature selection function was given a fitted trained model to evaluate and rank features by recursively considering a smaller set of features.This process of feature pruning was repeated until the number of required features was reached.In this study, the method was repeated for four different numbers of selected features, setting the desired number of features to 10, 20, 30, and, finally, 43, i.e., using all features.

Correlation
The pairwise correlation was measured between the individual features and the binary target (cough, non-cough).Having both numerical and categorical variables, Spearman's correlation method was computed, and the absolute value was used to rank the feature importance.

Results
The extracted features were initially assessed for their ability to uniquely identify and separate cough events from non-cough events.This assessment was performed by inspecting individual feature's distribution based on target class (cough, non-cough).Box and whisker plots were chosen for the visual summary of a feature's distribution.Figures 4-7 present the 43 features box and whisker plots.Features extracted from the accelerometer x-axis are given in Figure 4, accelerometer y-axis in Figure 5, accelerometer z-axis in Figure 6, and accelerometer magnitude in Figure 7.It is clear that some features can distinguish between both classes, while other features have some overlap but can still be used to separate the two categories.In a few cases, the features have very close distribution indicating their redundancy for the classification problem.
The visual inspection of the feature distribution indicated the feasibility of using these extracted features for cough detection.As a result, the importance of each individual feature was ranked using the previously discussed feature ranking methods.Some methods, as mentioned earlier, were used multiple times with different parameters.Therefore, a total of 11 feature importance ranking methods were utilized.The features were ranked from 1 to 43, with 1 indicating the highest importance and 43 the lowest.A summary of the ranking is visualized in the heat map shown in Figure 8.The color index represents feature significance, with darker cells indicating higher ranking.

Performance metrics
The feature ranking results for feature selection were evaluated by assessing their performance on the test data set.The performance metrics included accuracy (ACC), sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), false positive rate (FPR), false negative rate (FNR), false discovery rate (FDR), and F1 score.These performance metrics are computed as given in Equations 3-11: In these equations, TP (True Positive) represents the number of correctly identified coughs, TN (True Negative) is the number of correctly labeled non-coughs, FP (False Positive) is the number of non-coughs misclassified as coughs, and FP (False Negative) is the number of coughs mislabeled as non-coughs.

Evaluation of feature importance
To evaluate which set of features provided by these methods perform better in distinguishing cough motions, their performance on the classification problem was assessed.The same classifier model, a binary Logistic Regression model, was used for all cases using two evaluation approaches for data splitting into training and testing.In the first performance evaluation approach, the data were split as was shown earlier in Figure 2, using 80% of the records from each subject for training and the remaining 20% for testing.The model was trained on the training dataset using the "top" features chosen by each ranking method.The classifier's performance was then evaluated on the test dataset using the same set of "top" features selected in the training process.This process was repeated for each of the 11 methods for comparison with the base model, which used all 43 features.
In the second performance evaluation, the model was trained and tested using a leave-one-subject-out (LOSO) data split approach for cross-validation.Using this approach on the collected data from five subjects set the model's cross-validation folds to 5, where in the i fold, the ith subject was used for testing, while the remaining subjects were used for training.The average results across the five folds were then computed and used for the model's performance evaluation.Similar to the first approach, this process was also repeated for each of the 11 methods using the "top" features chosen by each ranking method for comparison with the base model using all 43 features.
The logistic regression model was used to also evaluate the trade-off between the number of selected features and model performance on the test data set.The process was repeated for three cases for the number of selected features, using 10, 20, and 30 features, respectively, from each ranking method for both performance evaluation approaches.Therefore, a total of 66 models (33 models/approach) with their performance metrics were evaluated and compared to the baseline model using all features.The results of the performance metrics of all the classification cases using the first approach of subject-record-split are summarized in Table 2 and visually demonstrated in Figure 9 for the six performance metrics.The results of the performance metrics of all the classification cases using the second approach (LOSO) are summarized in Table 3 and Figure 10.

Discussion
In this study, the use of accelerometry signals acquired from a non-contact tri-axial accelerometer sensor was investigated for their ability to differentiate between cough and non-cough events based on motion.This included the extraction of The visualization of feature distribution using the box plots given in Figures 4-7 provided a primary insight into the feasibility of using detected motion to distinguish cough events.The 11 features extracted from the accX given in Figure 4 presented distributions where the majority of the features have different median, upper quartile, and lower quartile values between both classes of cough and non-cough.This observation indicated that at least 50% of the data points of a particular feature have distinct values for each class, such as Min X, and Diff X.However, there are three features (Skewness X, Kurtosis X, and Ent X), which presented an overlap in their distribution, where though they had different median values for each class, their lower and upper quartile values overlap.In this case, differentiating between cough and non-cough events by simply depending on these features alone would be tricky.
Similar observations were made for the data distributions of features extracted from accY, accZ, and Mag signals.In the case of accY features given in Figure 5 and accZ features in Figure 6, 8 out of the 11 features of each axis have distinct values for the median and quartiles of each class, indicating its possible use in identifying cough motions.The three remaining features of each axis presented with overlapped distribution between cough and non-cough events.These features were the skewness, kurtosis, and correlation features (Skewness Y, Kurtosis Y, Corr yz, Skewness Z, Kurtosis Z, and Corr xz).Nevertheless, they can still be helpful if used together with other features, as they all had different median values for each class.Especially in the case of Corr xz, where there is an overlap, it is only with one quartile-the upper quartile of the cough data with the lower quartile of the non-cough data points.As for the features extracted from the Mag signal in Figure 7, they share the same trend as the previous axes, where 8 out of the 10 features have distinct median, lower quartile, and upper quartile values for each class, while the two features related to skewness and kurtosis presented distinct median values with overlap in quartile values.However, it is noticed that the distribution of data points for the features extracted from the magnitude signals has a more distinctive distribution separating the two classes when compared to the features extracted from the other axes.As observed in the following features (Min Mag, Max Mag, Diff Mag, RMS Mag, Var Mag, and Ent Mag), the full range of the non-cough data points overlap only with part of the whisker of the cough data points, indicating an overlap that is less than 25%.This was also observed in 4 features extracted from the accZ (Min Z, Max Z, Diff Z, and RMS Z), reflecting these 10 Box and whisker plot summarizing the distribution of data for 11 extracted features from the accelerometer y-axis.
Diab and Rodriguez-Villegas 10.3389/fdgth.2024.1368574features' higher potential in distinguishing cough events.As for the presence of outliers, it was expected as data were collected from a total of five subjects, where it is highly unlikely to have exactly similar coughing motions from all subjects of different genders, hence the presence of outliers.
A qualitative analysis of the extracted features was implemented using the 11 feature ranking methods to evaluate each feature's importance toward cough identification.It is noticed from the heat map in Figure 8-summarizing the ranking of all features based on method-that features extracted from the magnitude signal rank higher than other features.The Diff Mag ranked first by four methods and ranked second by another two methods.Min Mag also had one of the top rankings showing a fully dark row across all methods similar to Diff Mag.Other features that ranked first were the Max Z by the RFE_10 and RFE_20 methods as well as the Corr xy by permutation and the Corr xz by LOO method.Corr xy also ranked second by four other methods (XGBoost, DT, RFE_10, and RFE_20) reflecting its importance, where half of the methods ranked it in the top two features.As for the least important features, it is perceived that the upper half of the heat map, relating to features extracted from the accX and some of the accY features, are lighter in color with a ranking above 20.Still, there are some anomalies in this observation, such as the rank 3, 5, and 6 for RMS X by XGBoost, DT, and permutation methods, respectively, as well as the first rank given to Min X by both methods.A simplified summary of the ranking results is given in Table 4, where the feature ranking is set to three groups indicating the rank in the top 10 (rank 1-10), top 20 (rank 11-20), or top 30 (rank [21][22][23][24][25][26][27][28][29][30].This summary focuses more on the presence of a feature in a top category rather than its specific ranking, which helps simplify the analysis and comparison of the feature rankings by the different methods.For example, it is noticed that all ranking methods place Min Mag in the top 10 set of features; thus, from this observation, it can be assumed that Min Mag is the most important feature. Overall, the feature importance methods resulted in different rankings across the features.This is expected as the ranking methods measure feature importance in different ways.Modelbased methods (XGBoost, DT, and RF) depend on the built-in importance of measuring the decrease in impurity.They have different biases than other methods, where tree-based models favor features that minimize the number of splits.While the PCA method relies on the cumulative explained variance of the first principal component and its correlation to the features.The LOO, permutation, and RFE methods measure the importance based on the drop in the prediction accuracy when a feature is removed or shuffled.Each of these three methods uses a different feature elimination approach, hence the difference in ranking results.As for the correlation method, it measures the direct correlation between each feature and the expected output class.Another observation is the relatively close ranking results produced by some methods, such as RFE_30 and RFE_43.An explanation behind this ranking overlap is the algorithm behind this methodology to rank the features.Both are based on the recursive feature elimination (RFE) method, but the first was set for 30 selected features, while the other was set for 43 features.The ranking behind this method relies on recursively searching the feature pool to select the best n features and then prune out the remaining less important features.Thus, when the difference is only the number of selected features, it is likely to have some overlap in the ranking results, which is observed in the results of the four methods based on RFE with n ¼ 10, 20, 30, and 43.This trend was also observed by the three tree-based ranking methods-XGBoost, DT, and RF-where the results do not completely overlap but are very close to each other due to some similarities in these models' architectures.
The performance results of the first approach, summarized in Figure 9 and Table 2 using the subject-record-split, show that the baseline model using all 43 features had already presented good performance.However, as one of the objectives of this study is to assess and examine the trade-off between performance and the number of features used, our target is to find optimum models presenting similar or improved performance to the baseline using a lower number of features.Examining the use of only 10 features, it is noticed that three models outperformed the baseline in terms of accuracy with the best score of 0.9107 achieved by the RFE_10 model.In terms of sensitivity, 9 out of the 11 models achieved better scores with the best being 0.8955 by the RFE_10 model.As for specificity and PPV, the LOO model was the only one that performed better than the baseline with 0.9339 for specificity and 0.9305 for PPV.The NPV improved in eight models while staying the same for the RFE_30 model, with the highest NPV at 0.8986 again by the RFE_10 model, which also reported the highest F1 score at 0.9093.As for the three remaining metrics, the best scores of FPR and FDR were achieved by the LOO model at 0.0661 and 0.0695, respectively, while the best FNR was 0.1045 achieved by the RFE_10 model.Since FPR is complementary to specificity, FNR to sensitivity, and FDR to PPV, only the first six metrics are used for comparison in the rest of the discussion.
When 20 features were selected, the RFE_43 model presented the best performance across all metrics, outperforming the baseline model.It increased the baseline accuracy by 0.0192, sensitivity by 0.33, specificity by 0.0053, PPV by 0.0075, NPV by 0.29, and F1 score by 0.208.Overall, the use of 20 features, in comparison to 10 features or the baseline, in the majority of the models improved the classification accuracy, sensitivity, NPV, Box and whisker plot summarizing the distribution of data for 10 extracted features from the accelerometer magnitude.
Diab and Rodriguez-Villegas 10.3389/fdgth.2024.1368574and F1 score; having at least 8 out of the 11 models with scores above the baseline.As for the specificity and PPV, only two and three models, respectively, scored slightly higher than baseline.
Regarding the use of 30 features for the classification problem, the performance metrics of three methods, namely, RFE_10, RFE_20, and RFE_30, achieved the best accuracy, sensitivity, NPV, and F1 scores, while the correlation model achieved the best specificity and PPV scores.It is also noticed that the three models RFE_10, RFE_20, and RFE_30 achieved the same performance, which leads back to the point discussed earlier, about the close ranking of these methods-where the top 30 features (regardless of rank order) of these models were the same.This can be easily visualized in the summarized Table 4.
As a measure of the generalizability of the proposed cough detection method, the LOSO approach was also used for evaluation.The results of this approach summarized in Figure 10 and Table   reflected an acceptable performance from the baseline using all 43 features.Although these baseline results were different than those reported by the subject-record-split approach, this was expected as the model was tested on unseen subjects and the results did not show major deviation.A slight drop in accuracy (À0:0176), sensitivity (À0:0412), NPV (À0:0112), and F1 score (À0:0289) were noticed while an improvement in specificity (þ0:0061) and PPV (þ0:0135) in the LOSO baseline model were noticed.In this approach, it was observed that the increase in the number of features from 10 to 20 to 30 did not necessarily improve the model's performance.In terms of accuracy, sensitivity, NPV, and F1 score, 6 out of 11 models using only 10 features resulted in notably improved performance, while only 1 model with 20 features and 4 models with 30 features resulted in a slightly improved performance.As for specificity, three models with 10 features, four models with 20 features, and two models with 30 features reported improved performance.The reported PPV scores noted an improvement in seven models with 10 features, three models with 20 features, and only two models with 30 features.It is important to note that the increase in performance scores of both specificity and PPV were borderline as noticed in Figure 10, hence they did not affect the choice of the best-performing model.The best-performing model out of the 33 models in this approach was the RFE_10 model using only 10 features reporting the best accuracy, sensitivity, and F1 score with an increase from the baseline score by 0.0105, 0.0226, and 0.0160, respectively.The general observation of the models' performance shown in Figures 9 and 10 present the following trend, where accuracy, sensitivity, NPV, and F1 score have similar behavior as the number of features increases with dependency on the model used, while specificity and PPV share a closer behavior-both metrics that consider false positives (FP) in their computation.In the subject-  record-split approach of Figure 9, it can be deduced that only a few models using 10 or 20 features were able to improve the model's performance in terms of lowering FPs, while when the number of features was increased to 30, the majority of the models (9 out of 11) achieved higher specificity and PPV scores than the baseline.As for the LOSO approach, it is noticed that more models using 10 features were able to achieve a slightly better performance than the baseline, and as the number of features increased, less number of models were able to improve the performance reaching only two models using 30 features, one of which achieved the best specificity and PPV across all 33 models.As for the four other metrics, they were able to achieve maximum improvement using only 20 Performance results of the first approach's 33 classification cases using 10, 20, and 30 selected features from the 11 ranking methods compared to the baseline classification model using all the features.
Diab and Rodriguez-Villegas 10.3389/fdgth.2024.1368574 features in the subject-record-split approach and only 10 features in the LOSO approach.Therefore, we can conclude that it is possible to use as low as 10 features for cough detection following the results presented by the LOSO approach of model RFE_10 to achieve an accuracy of 0.8957, sensitivity of 0.8571, specificity of 0.9343, PPV of 0.9397, NPV of 0.8843, and F1 score of 0.8872.We can also use up to 20 features based on the results of the subject-record-split best-performing model, the RFE_43 with an accuracy of 0.9220, sensitivity of 0.9087, specificity of 0.9357, PPV, of 0.9334, NPV of 0.9111, and F1 score of 0.9209.In both cases, though the improvement from the baseline model is relatively small, it is however achieved with less than a half (subject-record-split) and/or a quarter (LOSO) of the total number of features, which will eventually reflect on the size of the model, computational time, and space required.
Examining the list of features selected by the best-performing models using the summary in Table 4, it is noticed that eight features are common between the best-performing models of both approaches.In the LOSO approach using the top 10 features, 5 features were related to the Mag signal, 3 to the accZ signal, and 2 to the correlation between axes Corr xy and Corr xz.As for the subject-record-split approach using the top 20 features, 10 features were extracted from the magnitude signal, 8 features from the accZ signal, and Corr xy and Min X features.Among these features, the Min Mag can be considered the top feature as it was ranked in the top 10 by all 11 methods, even though none ranked it as 1.It is followed by Diff Mag, which was ranked in the top 10 by 10 methods, and Diff Z, which was ranked in the top 10 by 9 methods, and then Max Mag and Corr xy.These five top features are common between the two best-performing models.As for features that ranked as least important, IQR Y was only ranked by  two methods, once in the top 20 and another in the top 30, while the remaining 9 methods ranked it above 30.Similarly, MAD X, Skew Y, Kurt Y, Kurt Z, and IQR Z, were only ranked by three methods to be within the top 30 features.Nevertheless, within these three ranking methods, some ranked within the top 10 features, specifically the accZ features, which were both used in the best-performing model.Therefore, it is not enough to only evaluate the ranking of features individually, but rather as a set used for the classification problem.
As this study investigates a relatively new approach to cough detection that has not been explored before, the results of our early-stage investigation have some limitations that need to be Performance results of the second approach's (LOSO) 33 classification cases using 10, 20, and 30 selected features from the 11 ranking methods compared to the baseline classification model using all the features.addressed in future work.One of the main limitations is related to the used dataset, where the data were collected from only five subjects.Another is that the collected data represent cough and non-cough events only when the subjects are seated.These limitations highlight the need for subsequent work to further collect data from a wider population and under different states such as walking and/or lying down.In addition, as there were limited data available, our initial evaluation used a subject-recordsplit approach for data splitting.This facilitated the performance evaluation of the cough detection models using unseen records from trained subjects for testing the models; this reported good performance but questioned the generalizability of the models on unseen subjects.To overcome this issue, the second approach using LOSO was implemented and proved the ability of the proposed cough detection solution to perform well on unseen subjects.However, the results from the LOSO approach can be considered a bit biased as one out of the five subjects had half the number of total records compared to the other four subjects, resulting in better performance during its testing fold.Nevertheless, combining the results from both approaches can compensate for the limitations of individual approaches.
In conclusion, this study presented an overall evaluation of different extracted features for motion-based cough detection.It demonstrated the ability to use as minimum as 10 time-domain features, extracted from the acquired signals of the non-contact accelerometer, to distinguish between cough and non-cough events and reflected the redundancy of some features.It presented 68 different classification models using two evaluation approaches for data splitting, based on 11 different feature ranking methods and 3 sets of number of selected features, which can be used and modified for future improvement.It also paves the way for further development, allowing the use of more sophisticated classifiers and expanding on the type of detected motions such as walking and standing.Overall, this evaluation study serves the main purpose of presenting a guide for future research related to the use of motion-based cough detection.
Afterward, each recording was segmented into windows of size 2s and a sliding window of 200ms resulting in a total of 42 segments/ samples per recording.Window segmentation of training and testing records resulted in a total of 6,048 balanced training windows and 1,512 balanced testing windows.Following record segmentation, each window-composed of four signals-was centered by subtracting the mean of the corresponding signal vector as given in Equation

FIGURE 4
FIGURE 4Box and whisker plot summarizing the distribution of data for 11 extracted features from the accelerometer x-axis.

FIGURE 6
FIGURE 6Box and whisker plot summarizing the distribution of data for 11 extracted features from the accelerometer z-axis.

TABLE 1
List of extracted features for evaluation.

TABLE 2
Summary of classification performance results of the first approach, using features selected by the 11 ranking methods under three cases for the number of features used (n_features ¼ 10, 20, and 30) with the baseline method referring to the use of all 43 features.The highest scores attained for each case, in all metrics, are emphasized in bold.

TABLE 3
Summary of classification performance results of the second approach (LOSO), using features selected by the 11 ranking methods under three cases for the number of features used (n_features ¼ 10, 20, and 30) with the baseline method referring to the use of all 43 features.The highest scores attained for each case, in all metrics, are emphasized in bold.

TABLE 4
(4)mary of features used by each ranking method grouped by their ranking into three categories: Top 10 (), Top 20 (A), and Top 30(4), where empty cells indicate a ranking greater than 30.