Using a Stacked Autoencoder for Mobility and Fall Risk Assessment via Time–Frequency Representations of the Timed Up and Go Test

Fall risk assessment is very important for the graying societies of developed countries. A major contributor to the fall risk of the elderly is mobility impairment. Timely detection of the fall risk can facilitate early intervention to avoid preventable falls. However, continuous fall risk monitoring requires extensive healthcare and clinical resources. Our objective is to develop a method suitable for remote and long-term health monitoring of the elderly for mobility impairment and fall risk without the need for an expert. We employed time–frequency analysis (TFA) and a stacked autoencoder (SAE), which is a deep neural network (DNN)-based learning algorithm, to assess the mobility and fall risk of the elderly according to the criteria of the timed up and go test (TUG). The time series signal of the triaxial accelerometer can be transformed by TFA to obtain richer image information. On the basis of the TUG criteria, the semi-supervised SAE model was able to achieve high predictive accuracies of 89.1, 93.4, and 94.1% for the vertical, mediolateral and anteroposterior axes, respectively. We believe that deep learning can be used to analyze triaxial acceleration data, and our work demonstrates its applicability to assessing the mobility and fall risk of the elderly.


INTRODUCTION
Remote health monitoring has been gaining increased interest as a way to improve the quality and reduce the costs of healthcare, especially for the elderly (Seyfioglu et al., 2017). According to the World Health Organization, a person aged 65 years and over has a fall risk of 28-35%, which increases to 32-42% for those aged over 70 years [World Health Organization [WHO], 2007]. According Letts et al. (2010), 33% of community-dwelling elderly have experienced a fall event, and 50% fall repeatedly. About one-third of elderly people fall every year, and the chance of falling increases with age [World Health Organization [WHO], 2007;Bergland, 2012]. Falling can have serious long-term consequences for the elderly, including hospitalization, decreased mobility, fear of falling and even death. Older people with gait, mobility or balance problems are at higher risk of falling in the future (Ganz et al., 2007;Cuevas-Trisan, 2017). To develop an effective fall prevention program, elderly people with a fall risk must first be identified.
Various factors drive the fall risk. Mitchell et al. (2012) showed sarcopenia, the typical age-related decline in skeletal muscle mass cause strength reduction as well as balance issue. Poor balance and mobility have been validated as a key cause of falls among the elderly. Continuous monitoring could be a practical approach to reduce and prevent falls by providing early warnings to facilitate appropriate interventions (Shany et al., 2012). However, continuous monitoring of gait and postural stability requires extensive healthcare and clinical resources. Limited professional resources (e.g., physical therapists, nurses, and doctors) are insufficient for detecting balance deterioration in a timely fashion, especially as the aged population increases worldwide. This can result in many falls that could have been avoided through continuous monitoring and early intervention. To fill the gap between available resources and care needs, an approach is needed for assessing the balance and mobility of the elderly in a timely manner without involving healthcare professionals.
Wearable systems based on inertial sensors are light, portable, and cheap, and they can be used to quantify body motions. Previous research (Howcroft et al., 2013) on fall risk assessment focused on feature-based methods, in which many related features are derived with domain knowledge. This requires multiple feature engineering steps before the classification or discrimination results can be obtained. The timed up and go test (TUG) is commonly used to evaluate mobility and the fall risk of the elderly in hospital and community environments (Podsiadlo and Richardson, 1991;Barry et al., 2014). Tri-axial acceleration sensors can be used to obtain time-domain signals during TUG (Wu et al., 2019;Lee et al., 2020), which can be transformed through time-frequency analysis (TFA) to extract time-domain, frequency-domain, and spectral energy-related information. Since the past literature (Cardozo et al., 2011) and(Garcia-Retortillo et al., 2020) has shown investigating spectral power distribution of muscle (using accelerometer data or related physiological parameters, such as EMG) and its response to fatigue and aging in elderly subjects, we can use spectral energy-related information to assess fall risk of elderly subjects via TUG test. Nweke et al. (2018) showed that deep neural network (DNN) methods are being adopted for automatic feature learning in diverse fields such as health, image classification, and recently, for the feature extraction and classification of simple and complex human activity recognition in mobile and wearable sensors. They also provided further insights on deep learning based on the decision fusion of human activity recognition for enhanced performance accuracy. Hossain et al. (2018) showed that deep learning architectures have been increasingly used in activity recognition problems that empower several application domains that require considerably less human supervision in the process. Moreover, they showed that such architectures are gaining increasing popularity for extracting meaningful information from these large volumes of data. DNNs are suitable for TFA owing to their excellent discrimination of images. The nonstationary nature of the TUG signal indicates that TFA can be used for motion identification in general and fall detection in particular (Jokanovic et al., 2016a, July). Deep learning can be used to capture the detailed and complex properties of the TF signature and feed the learned underlying features to the classifier (Jokanovic et al., 2016a, May). An autoencoder (AE) is a feed-forward neural network that aims to reconstruct the input at the output under certain constraints. Seyfioglu et al. (2018) proposed an unsupervised pre-training algorithm for initializing the AE weights and bias that is highly effective when only a small number of labeled training samples are available. The stacked autoencoder (SAE) is a DNN that can classify highly similar classes of aided and unaided walking, as might be encountered in assisted-living environments for the elderly, and it has been applied in recognizing 12 different gaits (Seyfioglu et al., 2017) as well as in fall detection.
In this paper, we propose the use of sensor and DNN-based technology, apply TFA to convert tri-axial accelerometer data and deep learning-based latent feature representation with a SAE to develop a surrogate approach for assessing the mobility function and fall risk detection of the elderly. And DNN-based analysis techniques will be an available approach for continuous monitoring in the future.

MATERIALS AND METHODS
We considered two evaluation methods for fall risk: feature-based and DNN-based evaluation. Feature-based evaluation, based on traditional statistical features and method for evaluation, combines feature extraction, feature selection and classifier, and it relies on heuristic handcrafted feature design. By contrast, DNN-based evaluation in this paper is based on the SAE and a softmax classifier layer, and it can automatically learn better feature representations than the handcrafted ones (Ng., 2011). Leave-one-out cross-validation was employed for both evaluation methods to ensure a robust classification accuracy. The results of the two evaluation methods were then compared.

Subjects
Our study took place at a hospital in central Taiwan between April 2014 and May 2015. We recruited and selected 44 elderly subjects dwelling in a community. A medical professional team that included rehabilitation physicians, physiotherapists and functional therapists performed TUG to evaluate the mobility function of the subjects. Prior to the evaluation, written consent was obtained from the subjects. The subjects were over 60 years of age, had no history of musculoskeletal injuries or central nervous system problems in the last 3 months and could walk independently without any help. Valid data were obtained for 44 elderly subjects with a mean age of 78.18 ± 7.97 years. There were 14 male subjects with an average age of 80.43 ± 5.60 years and 30 female subjects with an average age of 77.13 ± 8.74 years.

Sensor
As shown in Figure 1, a tri-axial accelerometer (RD3152MMA7260Q, Freescale Semiconductor-NXP, United States) with a sampling rate of 45 Hz was placed at vertebrae L3-L5 on a subject's back for the TUG experiments. L3-L5 correspond to the center of gravity of the human body and are used in most fall risk assessments (Howcroft et al., 2013). The X-, Y-, and Z-axes were aligned with the vertical (V; up: +, down: −), mediolateral (ML; right: +, left: −), and anteroposterior (AP; forward: +, backward: −) directions, respectively.

Timed Up and Go Test
Each subject was asked to perform a TUG. The observer marked the start and end times. As shown in Figure 2, each TUG was divided into five phases or subtasks: from sitting to standing (sit-to-stand), walking forward (walk-F), reaching the 3-m mark and turning around (turning), walking backward (walk-B) and reaching the chair and returning to sitting (stand-to-sit). The TUG time was recorded, and a threshold time was determined to classify subjects as a fall risk or not a fall risk. Alexandre et al. (2012) recommend that it is considered a high fall risk if the time of community elderly for TUG is greater than 12.47 s.

Feature-Based Evaluation
For feature-based evaluation, the features of the axial signals were obtained by referring to past literature (Banos et al., 2014). The most widely used features include the mean, standard deviation, maximum, minimum, and mean crossing rate (MCR). The mean and standard deviation are used to express the average and variation of the force for each axial signal. The maximum and minimum express the largest and smallest values of the signal for the entire domain. The MCR is the rate at which data cross the average value, and it has been widely used in signal recognition  and physical activity recognition (Gao et al., 2014;Arivu et al., 2018;Bountourakis et al., 2019).
Features were selected for the feature-based evaluation according to their significance (Wu et al., 2019;Lee et al., 2020). The significance was obtained through Student's t-test. A feature was considered significant if p ≤ 0.05. In addition, linear discriminant analysis (LDA) was performed to obtain a confusion matrix for evaluating the performance. Figure 3 shows the flowchart of the DNN-based evaluation. The input signal was the tri-axial data collected during the TUG experiments. TFA was applied to the data, and the SAE was applied in classifying the signal. Finally, the accuracy and confusion matrix were obtained.

Time-Frequency Analysis
A time-frequency representation (TFR) is a view of a signal, which is taken as a function of time, in both time and frequency domains. TFA can be applied to a time series signal to observe the time-domain, frequency-domain and spectral-energy information simultaneously. TFA based on wavelet transform (WT) is widely used in biomedical science for applications such as fall detection (Jokanovic et al., 2016a, July;Jokanovic et al., 2016b, May) and analysis of electroencephalography (Yordanova et al., 2013) and electromyography (Zia ur Rehman et al., 2018). In this study, the Morlet wavelet was used for TFA of the triaxial acceleration signal from the TUG experiments. This method was described in previous literature (Tallon-Baudry et al., 1997). The complex Morlet wavelet w(t,f c) can be generated in the time-domain for different frequencies f as follows: where t is the time, σ t is the wavelet duration, normalization factor A = (σ t √ π) −1/2 , a constant ratio of f c /σ f = 7 was used. f c is the central frequency, and σ f is the width of the Gaussian shape in the frequency-domain. For different f, the time and frequency resolutions can be calculated as 2σ t and 2σ f , respectively, where σ t = 1/2πσ f . Finally, the time-varying energy |E t, f c | of the signal [s(t)] is calculated by squaring the absolute value of the convolution of the signal with the complex Morlet wavelets: In this study, the frequency range was swept from 0.05 to 5 Hz, and a TF image was obtained for classification by the SAE.

Stacked Autoencoder Network Architecture
A neural network with multiple hidden layers can be used to solve classification problems with complex data such as images. Each layer can learn features at a different level of abstraction. However, training a neural network with multiple hidden layers can be difficult. In this paper, we use the SAE structure, which is a DNN based on the AE concept. An AE is a neural network comprising an encoder, followed by a decoder, and it attempts to replicate its input at its output. We used an AE so that the hidden layers can be trained individually in an unsupervised fashion. No labeled data are required for training or learning. The encoder maps the input x to a new representation z, which is decoded back at the output to reconstruct the inputx: (Hinton and Salakhutdinov, 2006;Zia ur Rehman et al., 2018;MATLAB autoencoder, 2021).
where h 1 and h 2 are activation functions, W 1 and W 2 are weight matrices and b 1 and b 2 are bias vectors for the encoder and decoder, respectively. Each layer can learn features with a different level of abstraction. If the number of hidden neurons is less than the number of input neurons, then the AE attempts to learn a sparse representation of the input data (Jokanovic et al., 2016a, July). Sparsity can be encouraged for an AE by adding a regulariser to the cost to prevent overfitting (Zia ur Rehman et al., 2018). In this study, the input was a color image with a resolution of 28 × 28 pixels and three channels (28 × 28 × 3 = 2,352 pixels). The AE had two hidden layers. The logistic sigmoid was used for both layers in the encoder and decoder.
In an SAE, the output of one AE is fed to the input of another AE, and sparsity is encouraged by adding regularization to the cost for neuron i. The average output activation for neuron i can be formulated as (MATLAB autoencoder, 2021): where i is the ith neuron, n is the total number of training examples and j is the jth training example. A regulariser is introduced to the cost function using the Kullback-Leibler divergence: (Kullback, 1997;Zia ur Rehman et al., 2018).
where d is the total number of neurons in a layer and p is the desired activation value (i.e., sparsity proportion). The L2 regularization term weights is also added to the cost function to control the weights: where L is the number of hidden layers, N is the total number of observations and K is the number of features within an observation.
By inserting the regularization terms from Eqs 6, 7 into the mean squared error of the reconstruction, the cost function can be formulated as follows: where λ is the coefficient for L2 regularization to prevent overfitting and β is the coefficient for sparsity regularization that controls the sparsity penalty term (MATLAB autoencoder, 2021). Ju et al. (2015) and Coates et al. (2011) showed that the number of neurons in the hidden layer of a DNN may be more important than the feature-learning algorithm and model depth. In addition, the combinatorial space required to explore all possible combinations of hyperparameters is huge (Tsinalis et al.,   . Therefore, we focused on locally optimizing the number of neurons for two layers and obtained the minimum mean squared error according to Eq. 8. The other parameters were taken from MATLAB: λ was set to 0.004 and 0.002 for the first and second hidden layers, respectively, β = 4 for both hidden layers and p was 0.015 and 0.01, respectively. After unsupervised training, the decoder was removed from the network, and the remaining encoder components were trained in a supervised manner by adding a softmax classifier with two neurons after the encoder. The softmax classifier is an advanced version of probability-based logistic regression and is often used in the final layer of a neural network. Finally, the SAE was obtained.

RESULTS AND DISCUSSION
Subjects were considered a fall risk if their TUG time was greater than 12.47 s and not a fall risk if the TUG time was less than 12.47 s. Table 1 lists the demographic data of the at-risk subjects (n = 22) and no-risk subjects (n = 22). Frontiers in Physiology | www.frontiersin.org

Deep Neural Network-Based Analysis of Timed Up and Go Test Results
Analysis of TF Images Figure 4 shows examples of tri-axial acceleration signals in the time-domain for subjects with and without a fall risk and their corresponding TF images.
(1) For the X-axis, this axis is the vertical acceleration signal in time domain for the no-risk and at-risk subjects showed as Figures 4A,G, respectively. Figures 4A,G can be transformed through TFA to obtain TF images showed Figures 4D,J. Figure 4D clearly shows that the no-risk subject had two regions of interest in zones II and IV of the TF image corresponding to the walk-F and walk-B phases. The TF energy was 10-12, and the frequency was 1.5-2.5 Hz. Similarly, Figure 4J shows that the at-risk subject had regions of interest in zones II and IV TABLE 4 | Mean squared errors for different combinations of neuron numbers in the first and second layers of the two-layer AE for the X-axis (V), Y-axis (ML), and Z-axis (AP).

Axis
The corresponding to the walk-F and walk-B phases. The TF energy was 0.5-3, and the frequency was 1.5-2.5 Hz. Additionally, the turning phase showed obvious difference in Zone III between Figures 4D,J. The TF energy was 6-8, and the frequency was 1.5-2.0 Hz for no-risk subject. On the contrary, the TF energy was relatively low for no-risk subject. This is consistent with previous study (Drover et al., 2017;Wu et al., 2019), which noted that turn-based features are important predictors because they contain useful biomechanical information.
(2) For the Y-axis, this axis is the mediolateral acceleration signal in time domain for the no-risk and at-risk subjects showed as Figures 4B,H, respectively. Figures 4B,H can be transformed through TFA to obtain TF images showed Figures 4E,K. Figure 4E shows that the no-risk subject had two regions of interest in zones II and IV of the TF image corresponding to the walk-F and walk-B phases. The regions had high TF energies of 5-8 and 4-6, respectively, corresponding to frequencies of 2.5-3.5 and 1-1.3 Hz, respectively. Similarly, Figure 4K shows that the at-risk subject had regions of interest in zones II and IV corresponding to the walk-F and walk-B phases. Only one region had a high TF energy of 4-6 with a frequency of 1-1.3 Hz. In the Walk_F and Walk_B phases, the TF image showed the energy of mobility, which is supposedly related to the body and the arm swing when walking. Because of the walking duration, the arm swing is associated with postural stability (Meyns et al., 2013) can enhance gait stability (Bruijn et al., 2010).
(3) For the Z-axis, this axis is the anteroposterior acceleration signal in time domain for the no-risk and at-risk subjects showed as Figures 4C,I. Figures 4C,I can be transformed through TFA to obtain TF images showed Figures 4F,L. Figure 4F shows that the no-risk subject had two regions of interest in zones II and IV of the TF image corresponding to the walk-F and walk-B phases. The TF energy was 6-9, and the frequency was 1.5-2.5 Hz. Similarly, Figure 4L shows that the at-risk subject had regions of interest in zones II and IV corresponding to the walk-F and walk-B phases. The TF energy was 2.5-4, and the frequency was 1.5-2.5 Hz. The body will move forward to maintain balance while walking, and the AP-axis is seemly an important axis.
In summary, the no-risk subjects had higher TF energy than the at-risk subjects in zones II and IV corresponding to the walk-F and walk-B phases for all three axes. This is a reasonable assumption that no-risk subjects must have greater muscle strength or energy when walking than those with atrisk subjects. In addition, the no-risk subjects had obviously higher TF energy than the at-risk subjects did during the sit-tostand (Zone I) and stand-to-sit (Zone V) phases in the Z-axis, referring to the transition subtask involving standing up and sitting down and these two abilities are largely related to strength and power of the lower extremities (Weiss et al., 2013). Moreover, the body must bend in forward-backward displacement. It is also reasonable to infer that the no-risk subjects had more energy to stand up or sit down than the at-risk subjects did. Regarding the differences located between 1 and 3 Hz approximately, the past literature (Schneider et al., 2010;Kline et al., 2016) have showed the frequency for movements along the longitudinal axis during running peaks at approximately 3 Hz, both in the activity  and viewed movement conditions. They reported that a strong relationship exists between intrinsic and extrinsic oscillation patterns during exercise. A frequency of approximately 3 Hz seems to be dominant in different physiological systems (e.g., heart rate and brain cortical activity). Additionally, Robert C. et, al. mentioned that when the step frequency fell in the range of 0.5-3 Hz, the activity was identified as walking (Wagenaar et al., 2011). Compare with these results, we assume TF images may be used as an auxiliary tool to support medical professionals for clinically assessing fall risk.

Parameter Optimization for AE and Reconstruction
The number of neurons was chosen according to the grid search strategy to minimize the mean squared error (Hinton and Salakhutdinov, 2006). The number of neurons in the first layer ranged from 100 to 500 in intervals of 100, and the number of neurons in the second layer ranged from 10 to 30 The accuracy, sensitivity, and specificity are average ± SD values for ten runs.
in intervals of 5. The mean squared error was obtained by averaging ten runs. As presented in Table 4, the minimum mean squared errors for the X-, Y-and Z-axes were 15.34, 12.03, and 9.73, respectively. These corresponded to 300 and 30 neurons in the first and second layers, respectively, for all three axes. Image reconstruction was carried out with 300-30 neurons for the encoder and 30-300 neurons for the decoder. As shown in Figure 5, the reconstructed image successfully restored the original image. Unsurprisingly, the latent features were useful for object recognition and other visual tasks (Ng, 2011).

Analysis of the Stacked Autoencoder
As shown in Figure 6, the SAE had an input of 2,352 pixels with an encoder layer of 300-30 neurons and a softmax classifier layer with two classes. Table 5 presents the classification results of the SAE. The classification accuracies were 89.1, 93.4, and 94.1% along the X-, Y-, and Z-axes, respectively. The sensitivities were 85.5, 94.1, and 94.6%, respectively. The specificities were 92.7, 92.7, and 93.6%, respectively. The SAE performed better along the Y-and Z-axes than along the X-axis. Thus, the latent features of the Y-and Z-axes may offer more predictive ability for DNNbased evaluation. Tables 3, 5 indicate that the DNN-based evaluation performed much better than the feature-based evaluation. Thus, it is a viable approach for fall detection. In addition, the Y-and Z-axes are both important for classification. With regard to the Y-axis, swinging arms are associated with postural stability and can enhance gait stability (Wu et al., 2019) and mobility function. With regard to the Z-axis, this is important to transitions involving standing up or sitting down, where the body must bend in forward-backward displacement. These results are similar to those of previous study (Wu et al., 2019), who identified features extracted along the Z-axis for TUG tasks as significant and Z-axis is seemly an important axis.

CONCLUSION
In this paper, tri-axial accelerometer data were collected from a cheap wearable sensor, and TFA was used to convert the data into TFRs. These TF images offered abundant and discriminative information such as time, frequency and spectral energy-related power in five phases of TUG, which clarified specific TUG aspects or subtasks were impaired in mobility. High energyrelated power of no-risk subjects in both walk phases (walk-F and walk-B) and transition phases (sit-to-stand and stand-to-sit) phases can be observed obviously from TF images for all three axes and AP axis, respectively. We also applied SAE model, DNNbased evaluation, to classify TFRs of elderly subjects for assessing the mobility and fall risk. Experimental results show that the DNN-based evaluation offers much considerably accuracy, sensitivity and specificity rates. Moreover, the results indicated the superior performance of DNN-based evaluation over featurebased evaluation. Further, the discrimination analysis of Y and Z axes seems to be more important than that of X axis.
In the future, we will continuously work on DNN-based evaluation of fall risk for the elderly. This innovative method based on the artificial intelligence technology, i.e., DNN-based evaluation, can be widely used in wearable sensing technology, smart home development and continuous monitoring technologies for real-time measurement and recording of various physiological signals. We trust it will improve the accessibility and convenience of people's medical care.

DATA AVAILABILITY STATEMENT
The data analyzed in this study is subject to the following licenses/restrictions: we have signed a confidentiality agreement with the hospital. Requests to access these datasets should be directed to C-HL, sweat0430@mail.ntust.edu.tw.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Tsaotun Psychiatric Center, Ministry of Health and Welfare (IRB No. 104013). The patients/participants provided their written informed consent to participate in this study.