Machine vision-based gait scan method for identifying cognitive impairment in older adults

Objective Early identification of cognitive impairment in older adults could reduce the burden of age-related disabilities. Gait parameters are associated with and predictive of cognitive decline. Although a variety of sensors and machine learning analysis methods have been used in cognitive studies, a deep optimized machine vision-based method for analyzing gait to identify cognitive decline is needed. Methods This study used a walking footage dataset of 158 adults named West China Hospital Elderly Gait, which was labelled by performance on the Short Portable Mental Status Questionnaire. We proposed a novel recognition network, Deep Optimized GaitPart (DO-GaitPart), based on silhouette and skeleton gait images. Three improvements were applied: short-term temporal template generator (STTG) in the template generation stage to decrease computational cost and minimize loss of temporal information; depth-wise spatial feature extractor (DSFE) to extract both global and local fine-grained spatial features from gait images; and multi-scale temporal aggregation (MTA), a temporal modeling method based on attention mechanism, to improve the distinguishability of gait patterns. Results An ablation test showed that each component of DO-GaitPart was essential. DO-GaitPart excels in backpack walking scene on CASIA-B dataset, outperforming comparison methods, which were GaitSet, GaitPart, MT3D, 3D Local, TransGait, CSTL, GLN, GaitGL and SMPLGait on Gait3D dataset. The proposed machine vision gait feature identification method achieved a receiver operating characteristic/area under the curve (ROCAUC) of 0.876 (0.852–0.900) on the cognitive state classification task. Conclusion The proposed method performed well identifying cognitive decline from the gait video datasets, making it a prospective prototype tool in cognitive assessment.


Introduction
Cognitive impairment, characterized by altered performance in specific cognitive tasks such as orientation, attention, comprehension, memory, reasoning, problem-solving, organizational skills, processing speed, perseverance, and motivation (Allain et al., 2007), can affect multiple domains of cognition simultaneously or consecutively, either gradually or abruptly.Cognitive impairment and dementia are the primary causes of disability in older adults, and promoting healthy brain aging is considered a critical element in reducing the burden of age-related disabilities (Lisko et al., 2021).It is estimated that 40% of dementia might be prevented or delayed by modifying its risk factors, improving activities of daily living (Livingston et al., 2020;Yun and Ryu, 2022).Routine, non-cognitive evaluations alone are insufficient for physicians to accurately predict patients' cognitive function.Therefore, cognitive assessment facilitates the diagnosis and potential intervention of disorders that impair thinking (Woodford and George, 2007).
The association between motor function and cognition can be understood, in part, in the context of the evolution of human bipedalism (Leisman et al., 2016).Bipedalism served as a significant basis for the evolution of the human neocortex as it is among the most complex and sophisticated of all movements.Gait pattern is no longer regarded as a purely motor task but is considered a complex set of sensorimotor behaviors that are heavily affected by cognitive and affective aspects (Horst et al., 2019).This may partially explain the sensitivity of gait to subtle neuronal dysfunction, and why gait and postural control is associated with global cognitive function in very old people, and can predict the development of disease such as diabetes, dementia, or Parkinson's disease years before they are diagnosed clinically (Ohlin et al., 2020).
Previous studies reported that slower walking speeds and a greater decline in speed over time are correlated with a greater risk of developing dementia independent of changes in cognition, supporting the role of gait speed as a possible subclinical marker of cognitive impairment (Hackett et al., 2018).Furthermore, spatial, temporal, and spatiotemporal measures of gait and greater variability of gait parameters are associated with and predictive of both global and domain-specific cognitive decline (Savica et al., 2017).
A variety of sensors and machine learning analysis methods have been used in cognitive studies.Chen et al. (2020), for example, used a portable gait analysis system and collected gait parameters that were used in a machine learning classification model based on support vector machine and principal component analysis.Zhou et al. (2022) collected 23 dynamic gait variables using three-dimensional (3D) accelerometer data and used random forest and artificial neural network to classify cognitive impairment.
The purpose of this study was to develop a machine vision-based gait identification method for geriatric diseases without using contact sensors or indexes, and to explore its potential as a cognitive impairment screening tool that is convenient, objective, rapid, and non-contact.To this end, a series of hyperparameters in machine vision networks for gait feature extraction and identification were deeply optimized to produce a method called Deep Optimized GaitPart (DO-GaitPart), and the optimized components and DO-GaitPart were evaluated.The performance for dementia and mild cognitive impairment (MCI) evaluation was evaluated by receiver operating characteristic/area under the curve (ROCAUC).These methods may be suitable for community screening and generalize to any gait-related approach to disease identification.

Participants
The current research was a cross-sectional designed analysis that included collecting part of baseline data in the West China Health and Aging Trend study, an observational study designed to evaluate factors associated with healthy aging among community-dwelling adults aged 50 years and older in western China.In 2019, we included a subset of 158 participants in Sichuan province.All participants (or their proxy respondents) were recruited by convenience and provided written informed consent to the researchers, and our institutional ethics review boards approved the study.All researchers followed the local law and protocol to protect the rights of privacy and likeness and other interests of participants in this study.

Definition of cognitive impairment
The Short Portable Mental Status Questionnaire (SPMSQ), a widely employed cognitive assessment tool that encompasses location, character orientation, and calculation, was applied.The established cutoff point for differentiating between healthy participants and those with mild to more severe cognitive impairment was set at a level of exceeding 3 errors in 10 questions (Pfeiffer, 1975).

Recording of walking video
The set of recordings was similar to that used in our previous research (Liu et al., 2021).Gait videos were shot in spacious, warm, level, well-lit indoor environments.A complete recording of each participant included six 4 m walking sequences, with three synchronized video segments shot using three different cameras (F = 4 mm, DS-IPC-B12V2-I, Hikvison, Zhejiang, China) for each sequence.The height from ground to cameras was approximately 1.3 m, and their angles were adjusted to ensure that the participant's whole body could be filmed for the entire gait process between benchmarks.Data were stored by the recorder (DS-7816N-R2/8P, Hikvison, Zhejiang, China) in MP4 format at 1080p resolution.

Pretreatment of recording footage and data set
Then video files of each walking sequence were converted into static image frames (Figure 1A).The raw silhouette of walking participants was obtained through the RobustVideoMatting method (Figure 1B) (Lin et al., 2022).The FindContours function of the OpenCV library in Python was used to segment the minimum external rectangle of the maximum silhouette for the more refined silhouettes, after the participant image was centralized and normalized to 256 × 256, the gait silhouette sequence was generated (Figure 1C).The measure for spatial information extraction of skeleton points from the gait silhouette sequence was HRNet (Figure 1D) (Sun et al., 2019).Our dataset, named West China Hospital Elderly Gait (WCHEG), was used to validate the model along with two open gait video databases: CASIA-B and Gait3D.CASIA-B (Yu et al., 2006), includes data from 124 participants, with 6 normal walking sequences, 2 long clothing sequences, and 2 backpacking sequences per participant.Gait3D (Zhu et al., 2021) includes a large-scale outdoor dataset of 5,000 participants, with 1,090 total hours of gait video.The WCHEG dataset was used to test the effectiveness of the model in recognizing cognitive impairment.Each dataset uses gait skeleton images and silhouette images as model inputs, both of which have a size of 128 × 128.

Machine vision approach and analysis
Our gait dataset WCHEG included more than 400,000 frames of raw static images and corresponding silhouette and skeleton gait images.The main purpose of our optimized design was to balance computational power consumption and accuracy of the model classification.A temporal part-based module, GaitPart (Fan et al., 2020), which was designed based on the idea that the local short-range spatiotemporal features (micro-motion patterns) are the most discriminative characteristics for human gait, was applied as the original analysis work frame in the current study.To better adapt this method to the mission of cognitive impairment assessment, three novel components were designed in our analysis pipeline to achieve the proposed DO-GaitPart (Figure 2): short-term temporal template generator (STTG), depth-wise spatial feature extractor (DSFE), and multi-scale temporal aggregation (MTA).

STTG
To ensure that the input gait sequence contains a complete gait cycle with less computational cost and minimal loss of temporal information, we designed an STTG.We grouped the input dualchannel gait sequence X in into M per frame and created a short-term temporal template using systematic random sampling.Most of the previous work (Fan et al., 2020;Huang X. et al., 2021;Kaur et al., 2023) directly input gait sequences into the network frame by frame, with each input gait sequence including at least one gait cycle, which meant that the sequence mean size was usually 30 frames, equivalent to more than 1 s.Because part of our participant data has the feature of cognitive impairment as well as a low stride frequency, a gait cycle often contained far more than 30 frames.As shown in Figure 3A, adjacent frames are highly similar, which generates a large amount of     3E), which can avoid all the disadvantages of the above methods.In the current study, we compared the situation of M = 2 3 4 5 , , , and found that the best results were achieved at M = 4.

DSFE
We develop a DSFE to extract both global and local fine-grained spatial features from gait images.Many previous models (Huang X. et al., 2021;Li et al., 2023) used only basic convolutional neural network (CNN) modules to extract spatial features from gait images, which leads to failure of capture all the gait details.Some networks, such as GaitPart (Fan et al., 2020) developed a component focal convolutional network (FConv) to extract part features, but then just combined those part features, and as a result ignored the connections between part features.However, the DSFE extracts partial spatial features and keeps the relation between part features.The DSFE consists of three blocks.The first block contains one two-dimensional convolutional network (Conv2d) layer and one depth-wise spatial Conv2d (DS-Conv2d) layer.The following two blocks contain two Conv2d layers each.The specific network structure is shown in Table 1.For the DSFE module, we compared the location and quantity of replacing Conv2d with DS-Conv2d in Block 1, Block 2, and Block 3, respectively.We found that using DS-Conv2d in the second layer of Block1 had the best performance.
The structure of the DS-Conv2d module is shown in Figure 4 and can be expressed as Equation (1): where depth-wise two-dimensional convolutional network (DW-Conv2d) represents depth-wise convolution (Guo et al., 2023).As shown in Figure 4, depth-wise convolution is the extraction of local features from a single-channel spatial feature map.Each convolutional kernel only performs convolution operations on a single channel.Depthwise dilated two-dimensional convolutional network (DW-D-Conv2d) is a special type of depth-wise convolution that introduces dilated convolution to increase the model's receptive field and extract long-range features from a single spatial feature map.The combination of the two parts takes into account local contextual information, enlarges the receptive field, and enables the extraction of richer spatial information from the gait sequence.Leaky rectified linear unit (LeakyReLU) is the activation function, which can be expressed as Equation (2):

MTA
MTA is composed of multiple parallel multi-scale temporal modules (MTMs), each of which is responsible for extracting features from the corresponding part of the gait sequence, acquiring multiscale temporal features.The input to the DSFE module passes through the horizontal pooling (HP) module to obtain F where f j f , f l j s and f j MTA represent the frame-level time characteristics, long short-term time features, and multi-scale time characteristics of the j th horizontal part, respectively, and for now indicates normalized data to mean of 0 and standard deviation of 1 by Batch.BiLSTM • is a special type of long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) known as bi-directional LSTM (BiLSTM), which is capable of accessing both past and future information in a time series, introducing more contextual dependencies and performing well in extracting short-term and long-term relationships.Concat • represents the concatenation operation, connecting the frame-level time feature F f with the long short-term feature F ls along the channel dimension.Attention • represents using the attention mechanism of SENet (Hu et al., 2018), which introduces the attention mechanism to focus on the relationships between channels and performs feature weighting on the channel dimension; the greater the weight equivalent the higher the correlation between the channel and key temporal information.Meanwhile, we introduce the Dropout (Srivastava et al., 2014)

technique in Attention •
, which can mitigate the overfitting phenomenon and enhance the model's ability to generalize to new data.TP • represents temporal pooling, and according to previous research (Fan et al., 2020), selecting TP • max • yields better results.
We compared the classification results of frame-level feature, long short-term feature, and multi-scale aggregated feature.We found that long short-term feature performed better than frame level feature and multi-scale aggregated feature achieved the best classification results.By extracting frame-level and long short-term temporal features, it captures abstract features at different scale levels in the gait sequence, and then uses an attention mechanism to aggregate more distinctive temporal information.

Loss function and sample
During the training stage, both the separate batch all (BA+) triplet loss (Hermans et al., 2017) and the label smoothing cross entropy loss (Szegedy et al., 2016) were used to achieve more effective training results.The multiply loss function L mul can be defined as cro cro , where L tri and L cro represent the BA+ triplet loss and the label smoothing cross entropy loss, respectively.λ tri and λ cro represent the weight coefficients of the loss functions.Here, O tri 1 0 .= and O cro 0 2 . .The batch size was set to p k , , 4 6 , which represents that every batch includes p participants, and k gait image sequences will be picked up in every participant's footage.The length of the analyzing sequence is 80 frames.If the length of the original sequence is less than 15 frames, it is discarded; if the length is between 15 and 80 frames, it is repeatedly sampled.

Comparison, ablation, and classification
CASIA-B and Gait3D was used in the comparison of individual recognition accuracy among previous gait analysis methods and DO-GaitPart.To determine which component in our model led to better adaptation for the gait analysis mission, components were removed from the total pipeline in a process known as ablation.We set eight groups of different hyperparameters for experiments and compared accuracy with that of GaitPart (composed of three Block + HP + temporal pooling modules, where each layer includes two convolutional layers and one maximum pooling layer), as baseline, in the individual recognition task.A two-class classification for mild or worse cognitive impairment gait and healthy gait features was designed to evaluate the performance of models as cognitive classifiers for the WCHEG dataset.The ground truth state for all gait features in this experiment was labelled using a previously performed SPMSQ assessment.

Ablation study
We found that each component of our model is essential, and the addition of each component provides a positive gain in the identification results of both datasets.The best performance of the model was achieved when the three components were deployed simultaneously (Table 2).Furthermore, we conduct ablation studies on specific parameters of each module.

Analysis of different M numbers of STTG
The ablation experiments were designed to demonstrate the most appropriate choice of parameters for the STTG (Table 3), where the inter-frame similarity of the gait decreases as the value of M increases, and the same number of frames can contain more gait information, reaching an optimum at M = 4. Whereas, when the value of M is too large, it leads to a decrease in the continuity between frames and affects the learning of the complete action of the gait.Meanwhile in the WCHEG dataset, the introduction of STTG shows a more significant performance improvement because STTG allows the input to contain more complete gait cycles.

Analysis of different insertion positions of DS-Conv in DSFE
We conducted the ablation study by replacing the second Conv layer with DS-Conv in three different Blocks of DSFE, respectively, (Table 4).By comparison, it can be found that adding DS-Conv in Block1 has the best performance, because no pooling operation has been performed at this time, which can avoid the effects of input distortion and information loss, and better fuse contextual information and large receptive field information.Meanwhile, too much use of this module can lead to the loss of fine-grained information, which in turn leads to poorer model performance.

Effectiveness of MTA
In order to validate the effectiveness of MTA, we set up ablation experiments (Table 5).It can be found that BiLSTM will obtain better results compared to LSTM for extracting long and short-term features, because BiLSTM has the characteristic of bidirectional computation, which acquires more comprehensive temporal features.Meanwhile, the use of Attention better fuses the multi-scale features and reduces the risk of overfitting by dropout method.

Characterization of participants
We compared the background information of participants between the training/validation and test sets (Table 7).We found no

Classification
Table 8 presents a comparison of predictive performance among various methods for cognitive state classification, with a focus on gait features.Machine vision-based classification techniques, specifically DO-GaitPart, GaitSet, and GaitPart, exhibit notably superior performance when compared to approaches considering age, grip strength, and walking time characteristics.The significance levels for all methods, except for age and 3 m re-entry time, are less than 0.001, providing statistical evidence for the potential of these methods in identifying cognitive impairment.Notably, among these gait-based methods, DO-GaitPart achieves the highest ROCAUC value (0.876, Figure 6) with a 95% confidence interval of 0.852-0.900,indicating its robust predictive capability for cognitive impairment.This

Discussion
In the current study, a machine vision method based on visible light camera footage of walking was implemented to identify mild and worse cognitive impairment among older adults.First, walking video dataset labelled using a cutoff of three errors on the SPMSQ consisting of 158 participants aged 50 and older was created.All images of gait sequences were segmented, normalized, and refined.Skeleton point information was extracted from sequences by HRNet application.Gait skeleton points and silhouette information were used in a trained recognition network, DO-GaitPart.To decrease computational cost and minimize the loss of time information, STTG was applied in the template generation stage.DSFE was used to extract more spatial features and keep the relation between features.Attention mechanismbased MTA extracted more multi-scale temporal features, including frame-level and long short-term temporal features, and aggregated more characteristic features.
After training, machine vision methods achieved better predictive performance globally than age, grip strength, or 4 m walking time in the healthy and cognitive impairment classification task.Although silhouettes contain information regarding variation in walking appearance and movement, long clothing and carrying a backpack could mislead the feature extrication process in silhouette-only methods.Here, both skeleton points and silhouette information were used to generate gait features, as skeleton points characterize human joint movement and decrease the impact of clothing and carried objects.The data input into the analysis model should contain a full gait cycle, which has a large computational cost.Compared with the previous sampling method, random sampling, STTG greatly increases the information entropy that the input sequence contains and maintains the same computational cost.GaitPart developed FConv to extract part features, but it ignored the connection between part features.With the applied depth-wise dilation convolution and depthwise dilation convolution, DSFE comprehensively extracted contextual information and long-range features.GaitPart considered long-range features to have little effect, and provided a micro-motion capture module to extract short-range features.In our experiments, longrange features also have unique advantages in gait recognition, compared with short-range features.Therefore, we design an MTA module to aggregate multi-scale temporal features, including framelevel features, short-term features, and long-term features.Although DO-GaitPart exhibited good performance in cognitive identification task, long clothing that covered the participant's body could decrease the precision of skeleton point identification and segmentation, thus influencing the performance of the overall method.Like most nonlinear regression algorithms, part of the analysis process in the current study was not interpretable, understandable, and straightforward (Liang et al., 2022).
Research on cognitive MCI and Alzheimer's disease increasingly emphasizes the application of machine vision and modal fusion algorithms.Key techniques, including prior-guided adversarial learning, brain structure-function fusion, and multimodal representation learning, are being actively explored to improve diagnostic precision and enable earlier predictions of cognitive decline  Receiver operating characteristic/area under the curve (ROCAUC) of test set via DO-GaitPart, GaitPart, GaitSet, Grip strength, age, 4M walking time, 3M reentry time.Qin et al. 10.3389/fnagi.2024.1341227Frontiers in Aging Neuroscience 11 frontiersin.org(Zuo et al., 2021(Zuo et al., , 2023(Zuo et al., , 2024)).As these techniques evolve, they are poised to significantly advance our comprehension and treatment of neurodegenerative conditions.However, its performance in cognitive impairment classification tasks is still limited by the dataset size and the uncertainty of cognitive impairment labels.In future work, expanding the dataset and incorporating additional cognitive function screening scales, such as MMSE and MoCA, will ensure more accurate and stable data labeling.Additionally, the analysis of gait features should be extended to improve the model's ability to recognize different levels of cognitive impairment.

Conclusion
This study introduces DO-GaitPart, a machine vision method for identifying cognitive impairment in the elderly from walking videos, featuring three key advancements: STTG, DSFE, and MTA.Addressing the global challenge of managing progressive cognitive decline (Jia et al., 2021), this non-invasive, cost-effective tool optimizes elder healthcare by conserving manpower and broadening its scope (Newey et al., 2015;Reynolds et al., 2022).Utilizing affordable cameras, it enables high-frequency, long-term cognitive assessments, potentially inspiring self-reporting tests and telemedicine for cognitive health (Charalambous et al., 2020;Hernandez et al., 2022).The method's machine learning algorithms also show promise for detecting other geriatric conditions, enhancing the toolkit for geriatric care.

FIGURE 2
FIGURE 2Overview of proposed gait analysis model.Extract the original gait sequence from the raw gait footage, which includes silhouette and skeleton gait images.Then, input the gait sequence into STTG to generate the template sequence, and input it into DSFE to extract depth-wise spatial features.Then, horizontally cut the output into n parts to obtain depth-wise part features.Furthermore, input each part into MTM separately to obtain the output multi-scale spatial-temporal features.Obtain the feature matrix through full connection and batch normalization, train the model through a series of loss functions such as triplet loss and cross entropy loss, and test through evaluation indicators such as ROC to achieve cognitive assessment.

FIGURE 3
FIGURE 3Different temporal template generating methods, with M = 4 : (A) raw image sequence X x i t i |,,, 1 |,,,2|,,, |,,, , ^ (B) gait energy image method, (C) equidistant sampling method sampling the image with equal M − 1 spacers from the beginning, (D) simple random sampling every M images, and (E) short-term temporal template generator, which divides the whole gait sequence into M sets and randomly selects a set at a time.
features of the j th horizontal part.Then, the part is input into the MTM, as shown in Figure5, extracting both frame-level F f

FIGURE 4
FIGURE 4 The convolution part of Block 1 in frame of DSFE, including Conv2d, DS-Conv2d and LeakyReLU.The DS-Conv2d's convolution operation process of a pixel (pink cube) of a three-dimensional feature map of a single frame (the whole cube).The information (all color cubes) contained in the receptive field is weighted and aggregated into the pink cube.The H, W, and C of cube represent the height, width, and channel dimensions of the feature map.The dark cubes indicate the position of the convolution kernel.The convolution core size of Conv2d, DW-Conv2d, and DW-D-Conv2d are all 3 × 3, and the dilation rate of DW-Conv2d is 2. Note: The operation process has omitted the zero filling.

FIGURE 5
FIGURE 5The calculation process of MTA and the details of MTM.The input is the three-dimensional gait feature maps, where P represents the component dimension, S represents the time dimension, C represents the channel dimension, and a semi transparent cube represents the omission of the feature maps.Along the component dimensions, input F HP into the MTMs module to obtain multi-scale time features.

TABLE 1
Detailed parameters for depth-wise part feature extractor.Conv2d, two-dimensional convolutional network; In C, input channels; Out C, output channels; kernel, kernel size; dilation, dilation rate; padding, zero padding.

TABLE 2
Accuracy comparison (%) with different addition of the three components of our model on CASIA-B and WCHEG.NM, normal walking; BG, carrying bags; CL, wearing coats or jackets.Bold values mean best performance method, model, module or algorithm in comparison.

TABLE 3
Accuracy comparison (%) with different M numbers of STTG on CASIA-B and WCHEG.

TABLE 4
Accuracy comparison (%) with replacing Conv with DS-Conv in different blocks of DSFE on CASIA-B and WCHEG.
Bold values mean best performance method, model, module or algorithm in comparison.

TABLE 5
Accuracy comparison (%) with different algorithms used by MTA on CASIA-B and WCHEG.Bold values mean best performance method, model, module or algorithm in comparison.

TABLE 6
Accuracy in comparison with previous gait identification methods on CASIA-B and Gait3D.NM, normal walking; BG, carrying bags; CL, wearing coats or jackets.Bold values mean best performance method, model, module or algorithm in comparison.

TABLE 7
Characterization and cognitive status of participants among 158 older adults.stands significantly ahead of other methods, as evidenced by the substantially lower significance values.Moreover, DO-GaitPart operates with remarkable efficiency, consuming a mere 0.013 s per gait sequence, ensuring swift response to gait-related information.Conversely, methods relying on age and grip strength exhibit comparatively lower ROCAUC values, signaling their limited effectiveness in cognitive state classification.In summary, these result underscores the efficacy of machine vision-based gait feature classification particularly highlighting DO-GaitPart, in predicting cognitive impairment. performance

TABLE 8
Predictive performance of cognitive state classification via different method., receiver operating characteristic/area under the curve.Bold values mean best performance method, model, module or algorithm in comparison. ROCAUC