Heart Rate Information-Based Machine Learning Prediction of Emotions Among Pregnant Women

In this study, the extent to which different emotions of pregnant women can be predicted based on heart rate-relevant information as indicators of autonomic nervous system functioning was explored using various machine learning algorithms. Nine heart rate-relevant autonomic system indicators, including the coefficient of variation R-R interval (CVRR), standard deviation of all NN intervals (SDNN), and square root of the mean squared differences of successive NN intervals (RMSSD), were measured using a heart rate monitor (MyBeat) and four different emotions including “happy,” as a positive emotion and “anxiety,” “sad,” “frustrated,” as negative emotions were self-recorded on a smartphone application, during 1 week starting from 23rd to 32nd weeks of pregnancy from 85 pregnant women. The k-nearest neighbor (k-NN), support vector machine (SVM), logistic regression (LR), random forest (RF), naïve bayes (NB), decision tree (DT), gradient boosting trees (GBT), stochastic gradient descent (SGD), extreme gradient boosting (XGBoost), and artificial neural network (ANN) machine learning methods were applied to predict the four different emotions based on the heart rate-relevant information. To predict four different emotions, RF also showed a modest area under the receiver operating characteristic curve (AUC-ROC) of 0.70. CVRR, RMSSD, SDNN, high frequency (HF), and low frequency (LF) mostly contributed to the predictions. GBT displayed the second highest AUC (0.69). Comprehensive analyses revealed the benefits of the prediction accuracy of the RF and GBT methods and were beneficial to establish models to predict emotions based on autonomic nervous system indicators. The results implicated SDNN, RMSSD, CVRR, LF, and HF as important parameters for the predictions.

The RRI format for the analysis first needed to be changed. The function was then set to obtain the time domain features. where, S represents the number of segments, N represents the number of samples, m represents the index of time (0≤m≤N-1), and k represents the frequency index (0≤k≤N-1), and w(m) represents the window function.
(1) Welch's method was used to estimate the power spectral density of NN intervals

Support vector machine (SVM)
SVM is defined as a classifier that can be linear or nonlinear and is an example of supervised learning that always focuses on minimizing structural risks (30). There are many parameters in SVM, for example, the regularization parameter. The strength of the regularization is inversely proportional to C, which must be strictly positive. In particular, SVM aims to reduce the number of points with incorrect classification as much as possible, instead of having to classify all points correctly. Thus, noisy data may exist. To a large extent, this method will not make the model too complex and will not cause overfitting. The classification effect is satisfactory for researchers.

K-nearest neighbor (KNN)
KNN is a classical, simple, and highly robust classification algorithm. KNN can compare the similarity between testing and training data (16). The KNN algorithm has some advantages compared to other algorithms and is superior to SVM for multiple classification events.

Logistic regression (LR)
LR is a machine learning algorithm that analyzes the relationship between predictors. It is commonly used to solve the problem of classification and prediction. For example, LR can distinguish between positive and negative emotions. The algorithm is also used to minimize the error between the result of classification and the value of the label after training for the sample with the result. LR can also construct a separating hyperplane between two datasets (31). LR is a widely used classifier and is particularly suitable for disease prediction.
LR can be calculated using: where, y is an integer index, and weights are encoded into a vector with length K. The features x are re-defined as the result of evaluating the feature functions fk(y,x) such that there is no difference between the features given by fk(y=i,x) and fk(y=j,x). This form is widely used for the last layer in neural network models and is referred to as the softmax function.

Random forest (RF)
RF was proposed by Breiman (32). It is a prediction made by synthesizing the prediction results of multiple trees. RF consists of a large number of decision trees (DTs) that choose their splitting features. The trees are built using the classification and regression tree methodology without pruning (32). Prediction is determined by the majority voting of ensemble predictions (33). The many important parameters include the number of trees in the forest (n_estimators) and the function to measure the quality of a split (criterion). The "criterion" are "gini" for the Gini impurity, "entropy" for the information gain, maximum depth of the tree (max_depth), minimum number of samples required to split an internal node (min_samples_split), and minimum number of samples required to be at a leaf node (min_samples_leaf).

Naïve Bayes (NB)
The NB algorithm is a classification algorithm based on Bayes' rule. This algorithm is particularly suitable when the dimensionality of the inputs is high (20). NB is mainly performed through manual data preprocessing to create a dataset that can be used for classifier training and to finally complete the classifier with a specific category classification function.
(1) Classification learning: the probability of the class given an instance

Evidence E = instance's non-class attribute values Event H = class value of instance
In a naïve assumption, evidence splits into parts (i.e., attributes) that are conditionally independent. This means that given n attributes, Bayes' rule can be written using a product of per-attribute probabilities:

DT
DT is a type of classification and prediction model that contains several improvements, especially for software implementation (34). The classification process involves intuitively using probability analysis and searching from top to bottom along a branch down to the leaf node. The label of the leaf node is the final classification category. The DT algorithm controls randomness of the estimator (random_state), which has to be fixed to an integer to obtain a deterministic behavior during fitting. A node will be split if this split induces a decrease in the impurity greater than or equal to this value (min_impurity_decrease), which is the threshold for early stoppage of tree growth.
A node will split if its impurity is above the threshold. Otherwise, it is a leaf (min_impurity_split).
The parameter (criterion) is used to determine the calculation method of impurity: (1) criterion='gini' (2) criterion='Entropy' where, T represents a given node, I represents any classification of tags, and P(i|t) represents the proportion of label classification I in node t.

Gradient boosting trees (GBT)
The GBT iterative DT algorithm can establish a prediction model in the form of an ensemble of weak prediction models. The algorithm has also been used to analyze and classify data (35). The algorithm also involves several parameters. For example, the number of boosting stages to perform (n_estimators) is a function that measures the quality of a split (criterion). The "criterion" is "friedman_mse" for the mean squared error with improvement score by Friedman, "mse" for mean squared error, and "mae" for the mean absolute error, minimum number of samples required to split an internal node (min_samples_split), and minimum number of samples required to be at a leaf node (min_samples_leaf). A node will be split if this split induces a decrease of the impurity greater than or equal to this value (min_impurity_decrease).

Stochastic Gradient Descent (SGD)
Machine learning algorithms sometime require a loss function for the original model. The loss function is optimized using an optimization algorithm to identify the optimal parameters and minimize the value of the loss function. SGD is an iterative method to optimize an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization. This reduces the computational burden, especially in high-dimensional optimization problems, achieving faster iterations with a trade-off of a lower convergence rate (36,37).
Gradient descent: (1) Given a conditional probability model p(y|x; ) (2) Parameter vector data ̃,̃, = … (3) Prior on parameters, p( ; ) with hyper-parameter (4) Gradient descent with learning rate η can be written as: For convex models, the change in the loss or the parameters is often monitored, and the algorithm is terminated when it stabilizes.

XGBoost (XGB)
XGB is a scalable tree boosting algorithm and an efficient implementation of the gradient boosting algorithm (38). The basic learners in XGB can be either cart or linear. According to the parameters of XGB, when a node is split, it will be split only if the value of the loss function decreases after splitting. Gamma specifies the minimum loss function descent required for node splitting. The larger the value of this parameter, the more conservative is the algorithm. In general, the operation speed and algorithm accuracy of XGB are better than those of GBT.

Artificial neural network (ANN)
Multi-layer perceptron (MLP) is also an ANN. In addition to the input and output layers, multiple hidden layers can exist. The simplest MLP contains only one hidden layer, which is a threelayer structure. An ANN model is a mathematical model or calculation model that imitates the structure and function of a network. The ANN classification algorithm belongs to the supervised machine learning algorithm. ANN is characterized by a pattern of connections between the neuron architecture and determination of the weights of the connections training or learning algorithm (39).
ANN has been widely used to analyze and classify data (40,41).