CSL-SHARE: A Multimodal Wearable Sensor-Based Human Activity Dataset

In this digital age, human activity recognition (HAR) plays an increasingly important role in almost all aspects of life to improve people’s quality of life, such as auxiliary medical care, rehabilitation technology, and interactive entertainment. Besides external sensing, sensor-based internal sensing for HAR is also intensively studied. A large body of research involves recognizing various kinds of everyday human activities, including walking, standing, jumping, and performing gestures. HAR research relies on large amounts of data, which includes the collection of laboratory data that meet in-house research goals, as well as the usage of external and public databases to verify models and methods. Therefore, data collection is an essential part of our entire HAR research work, for which we will detail this extensive progress. Many public HAR datasets are available online, providing various sorts of collected data, some of which have some similarities with our in-house data acquisition in terms of purpose, sensor selection, or protocol design. For instance, the Opportunity benchmark database (Chavarriaga et al., 2013) contains naturalistic daily living activities recorded with a large set of on-body sensors. The UniMiB SHAR dataset (Micucci et al., 2017) includes 11,771 samples of both human activities and falls divided into 17 fine-grained classes. The GaitAnalysisDataBase (Loose et al., 2020) contains 3D walking kinematics and muscle activity data from healthy adults walking at normal, slow or fast pace on the flat ground or at incremental speeds on a treadmill. The RealWorld dataset (Sztyler and Stuckenschmidt, 2016) covers acceleration, GPS, gyroscope, light, magnetic field, and sound level data of the activities climbing stairs down and up, jumping, lying, standing, sitting, running/jogging, and walking of 15 subjects. The FORTH-TRACE dataset (Karagiannaki et al., 2016) is collected from 15 participants wearing five Shimmer wearable sensor nodes on the left/right wrist, the torso, the right thigh, and the left ankle. The ENABL3S dataset (Hu et al., 2018) contains bilateral electromyography (EMG) and joint and limb kinematics recorded from wearable sensors for ten able-bodied individuals as they freely transitioned between sitting, standing, and five walking-related activities. In this article, we disclose our in-house collected sensor-based dataset, CSL-SHARE (Cognitive Systems Lab Sensor-based Human Activity REcordings). Based on the improvement of the recording plan and organization through the experience gathered from the pilot datasets’ collection of CSL17 (1 subject, 7 activities of daily living, 15 minutes) and CSL18 (4 subjects, 21 activities of daily living and sports, 90 minutes), the CSL-SHARE dataset covers 21 types of activities of daily living and sports from 20 subjects in a total time of 691 minutes, of which 363 minutes are segmented and annotated. In this dataset, we used two triaxial accelerometers, two triaxial gyroscopes, four surface electromyography (sEMG) sensors, one biaxial electrogoniometer, and one airborne microphone integrated into a knee bandage, bringing the total number of channels to 19, as these sensors can provide usable and reliable biosignals for HAR research, gait analysis, and health assessment according to existing studies, such as Whittle, (1996), Rowe et al. (2000), Mathie et al. (2003), Kwapisz et al. (2010), Rebelo et al. (2013), and Teague et al. (2016).We also tried to use a piezoelectric Edited by: Pekka Siirtola, University of Oulu, Finland

microphone and a force sensor for sensing the acoustic and physical pressure signals from the knee during the acquisition. Nevertheless, in subsequent analysis and research, we did not have evidence to support their contribution to HAR research. Therefore, we removed these two channels of signal from the public dataset. In addition, although our two pilot datasets mentioned above, CSL17 and CSL18, are not publicly available due to the relatively smaller data volume, they can also be obtained from us for scientific research purposes.

Devices, Sensors, and Sensor Placement
We chose the biosignalsplux Researcher Kit 1 with the selected various types of sensors supplied together. One hub from the kit records biosignals from eight channels, each up to 16 bits, simultaneously. Since we needed to record over 20 channels, we connected 3 hubs via synchronization cables that connect the hubs and synchronize all channels automatically between the hubs at the beginning of each recording session, which ensured the synchronicity during the entire recording sessions.
EMG and microphone signals were recorded with a sampling rate of 1000 Hz and all other signals with 100 Hz. The lowsampled channels at 100 Hz were up-sampled to 1000 Hz to be synchronized and aligned with high-sampled channels. All channels have a quantization resolution of 16 Bits.

Software for Data Acquisition
We developed a software called Activity Signal Kit (ASK) with a Graphical User Interface (GUI) and multi-functionalities using the driver library provided by biosignalsplux, as introduced in (Liu and Schultz, 2018). ASK automatically connects and synchronizes several recording hubs, then collects up to 24-channel sensor data from all hubs simultaneously and continuously. All recorded data are archived automatically in HDF5 files with the filename of dates and timestamps for further research.
A protocol-for-pushbutton mechanism of segmentation and annotation has been implemented in the ASK software, which will be introduced in Section 2.3. Moreover, the baseline ASK software also provides the functionalities of digital signal processing, feature extraction, modeling, training, and recognition by applying our in-house developed HMM-based decoder BioKIT (Telaar et al., 2014).

Annotation and Segmentation
The task of segmentation in HAR research is to split a relatively long sequence of activities into several segments of single activity, while annotation is the process of labeling each segment, such as "walk," "run," "stand-to-sit," among others. Segmentation, which can be performed manually (Rebelo et al., 2013), semi-supervised Barbič et al. (2004), or automatically (Guenterberg et al., 2009;Micucci et al., 2017), is undoubtedly a prerequisite for annotation, and its output will be input for digital signal processing and feature extraction. Annotation, which can be performed directly after each segmentation subtask, helps two follow-up operations: training and evaluation.
In our research, we applied the pushbutton of the biosignalsplux Research Kit in our proposed semi-automated segmentation and annotation solution. In subsequent research, the applicability of the semi-automated segmented data has been verified for our research purpose during numerous experiments (see Section 4), so we have been applying this mechanism to our successively acquired datasets.
The so-called protocol-for-pushbutton mechanism of segmentation and annotation has been implemented in the ASK software (Liu and Schultz, 2018). When the "segmentation and annotation" mode is switched on during the data acquisition, a predefined activity sequence protocol will be loaded into the software, which prompts the user to perform the activities one after the other. Each activity is displayed on the screen one-by-one while the user controls the activity recording by pushing, holding, and releasing the pushbutton. The user follows the instructions of the software step-by-step. For example, the prompted activity states "walk," the user sees the instruction "Please hold the pushbutton and do: walk." The user prepares for it, then pushes the button and starts to "walk." She/he keeps holding the pushbutton while walking for a duration at will, then releases the pushbutton to finish this activity. With the release, the system displays the next activity instruction, e.g., "stand-to-sit," the process continues until the predefined acquisition protocol is fully processed.
The ASK software records all timestamps/sample numbers of each button push and button release during the data recording. These data are archived in CSV files as segmentation and annotation results for each activity. Since we synchronized all data at 1,000 Hz, each sample represents data from 1 ms. For example, a line "sit, 3647, 6163" in a CSV file means that the activity segment labeled "sit" lasts 2,516 samples from the timestamp 3,647 to 6,162, which corresponds to 2.516 seconds. The corresponding 2,516 samples form one segment for training the activity model "sit," or for the recognition results evaluation. The protocol-for-pushbutton mechanism was implemented to reduce the time and labor costs of manual annotation. The resulting segmentations are excellent and required little to no manual correction, and lay a good foundation for subsequent research. Nevertheless, this mechanism has some limitations: • The mechanism can only be applied during acquisition and is incapable of segmenting archived data • Clear activity start-/endpoints need to be defined, which is impossible in cases like field studies • Activities requiring both hands are not possible due to participants holding the pushbutton • The pushbutton operation may consciously or subconsciously affect the activity execution • The participant forgetting to push or release the button results in subsequent segmentation errors.
None of these limitations, except forgetting to release the pushbutton, hold in a laboratory setting with clear instructions and protocols. Hence, misapplication of the pushbutton was addressed by real-time human monitoring of the incoming sensor signals, including the pushbutton channel, during acquisition. Additionally, a mobile phone video camera for post verification and adjustments was used (see Section 2.4).

Post Verification
Although the "segmentation and annotation" mode of the ASK software was switched on to segment and annotate the recorded data efficiently, a mobile phone video camera was used in addition to record the whole biosignal acquisition sessions to manually correct the misoperation of pushing/holding/releasing after the data recording.
After each recording event with one subject, the collected data and the automatically generated segments with annotation labels were examined thoroughly based on the video. Segments with minor human-factor errors were corrected by shifting the start-/endpoints forward/backward a short distance manually, while segments with problems that cannot be easily corrected were discarded, which is one of the reasons leading to the slight divergence among the activity occurrences in Table 1 (see Section 2.8 for another reason). A script to automatically detect the activity length outlier was also implemented to assist the post verification. After finishing the correction and verification, we deleted all recorded videos to preserve privacy.
The number of repetitions/to-record activities per each protocol is a pre-designed plan. In the post verification (see Section 2.4), a few non-conformity and erroneous segments were removed.
The meaning of most of the activities in the CSL-SHARE dataset can be self-explanatory from their names or the description in the protocols. The "spin-left/right" activity can be understood as the "left/right face" action in the army (but in daily situations, not so stressful as in military training). The "Vcut" activity is a step in which the body rotation (instead of the directional change) takes place, as shown in Figures 1F, G. Some activities in the CSL-SHARE dataset are the subdivision of original activities. For example, "spin-left" is divided into "spin-left-left-first" and "spin-left-right-first," denoting which foot should be moved first. Similarly, "spin-right," "V-cut-left," and "V-cut-right" are also divided into two activities in regard to the first-moved foot. The activities mentioned above are subdivided because they only involve one gait, and we only use the sensors placed on the right-leg-worn bandage. Therefore, the "left foot first" and "right foot first" of these activities will lead to very different signal patterns. On the contrary, for activities involving multiple (three) steps/gait cycles, such as "walk," "walk-curve left/right," "walk-upstairs," "walk-downstairs," "run," and "shuffle-left/right," we did not further subdivide them. Instead, we restricted in the protocols the number of gaits for each segment of these activities to three and defined the left foot as the start.

Subjects
Twenty subjects without any gait impairments, five female and fifteen male, aged between 23 and 43 (30.5 ± 5.8), participated in the data collection events, among which one subject had knee inflammation and could not perform certain activities. Each subject's participation time is approximately 2 h, including announcement and precautions, questions and answers, equipment wearing and adjusting, software preparation and test-running, acquisition following all protocols, taking breaks, and equipment release.

Privacy Preservation and Data Security
All subjects signed a written informed consent form, and the study was conducted in accordance with Helsinki's WMA (World Medical Association) Declaration (Association, 2013). According to the consent form, we only kept the wearable sensor data pseudonymized and did not leave any identification information of the participants. The to-share CSL-SHARE dataset is available in an anonymized form. As mentioned in Section 2.4, we used videos to verify the segmentation and annotation, and all videos have been deleted after the post verification to protect privacy.
In addition, the consent form stipulates that the use of the data is limited to non-commercial research purposes, and the data users guarantee not to attempt to identify the participating persons. Furthermore, the data users guarantee to pass on the data (or data derived from it) only to third parties who are bound by the same rules of use (for non-commercial research purposes, no identification attempts, restricted disclosure). Data users who violate the usage regulation mentioned above will bear the legal consequences themselves, where the dataset publisher takes no responsibility.

Data Format
We provide our CSL-SHARE dataset in an anonymized form in the following directory structure and file format: The root directory contains a total of 20 sub-directories with order numbers 1-20, representing the data of 20 subjects. In each subdirectory, there are 34 files. The seventeen.H5 files, named by the order number of protocols, use HDF5 format to save the raw recorded data of seventeen protocols, while the seventeen.CSV files are the corresponding annotation results.
Each row in the .H5 files is according to the following sensor order: EMG 1, EMG 2, EMG 3, EMG 4, airborne microphone, accelerometer upper X, accelerometer upper Y, accelerometer upper Z, electrogoniometer X, accelerometer lower X, accelerometer lower Y, accelerometer lower Z, electrogoniometer Y, gyroscope upper X, gyroscope upper Y, gyroscope upper Z, gyroscope lower X, gyroscope lower Y, gyroscope lower Z.
There are three sub-directories/sub-datasets with exceptions: • Sub-directory 2: The 02.CSV and 05.CSV files are different from protocols 2 and 5: the labels are mixed with each other. Subject 2 performed wrong angles when turning the body between activities. We were not aware of it during the data collection process, and the problem was first discovered through the video in the post-verification. However, this mixture affects neither the integrity of the dataset nor the number of times each activity should occur • Sub-directory 11: Protocol 13 is divided into two parts due to the device communication breaking • Sub-directory 16: Not all activities were performed due to the subject's knee inflammation, which is one of the reasons leading to the slight divergence among the activity occurrences in Table 1 (see Section 2.4 for another reason).

STATISTICAL ANALYSIS
The 22-activity CSL-SHARE dataset contains 11.52 hours of data (of which 6.03 hours have been segmented and annotated) from 20 subjects, 5 female and 15 male. Table 1 gives the number of activity segments, the total effective length over all segments, and the minimal/maximal/mean length of the 22 activities. By analyzing the duration distribution of each activity of all subjects in histograms, we find that all activities' duration over all segments approximately accords with the normal distribution. The distribution of the activities "sit" and "stand" deviates slightly, as they can last arbitrarily long.

CONCLUSION
We share our in-house collected dataset CSL-SHARE (Cognitive Systems Lab Sensor-based Human Activity REcordings) in this article and introduce its recording procedure and technical details. This 19-channel 22-activity 20-subject dataset applies two triaxial accelerometers, two triaxial gyroscopes, four EMG sensors, one biaxial electrogoniometer, and one airborne microphone with sampling rates up to 1000 Hz and uses a knee bandage as a novel wearable sensor carrier. Six-hour data of a totally 11.52-h recording are well segmented, annotated, and post-verified. The reliability and applicability of the CSL-SHARE dataset and its previous pilot data collection can be observed through literature in various research aspects here and there, such as HAR research pipeline (Liu and Schultz, 2018), real-time endto-end HAR system (Liu and Schultz, 2019), visualized verification of multimodal feature extraction (Barandas et al., 2020), feature dimensionality study (Hartmann et al., 2020;Hartmann et al., 2021), human activity modeling , among others.
To the best of our knowledge, CSL-SHARE is the first publicly available dataset recorded with sensors integrated into a knee bandage and one of the most comprehensive HAR datasets with an ample number of sensors, activities, and subjects, as well as complete synchronization, segmentation, and annotation.
Standing on the dataset robustness, we publish the CSL-SHARE dataset as an open-source sensor-based biosignals dataset for HAR, hoping to contribute research materials to the researchers in the same or similar fields.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://www.uni-bremen.de/en/csl/research/ motion-recognition/csl-share-corpus.

ETHICS STATEMENT
Ethical approval was not provided for this study on human participants because the study was conducted in accordance with Helsinki's WMA (World Medical Association) Declaration. The participants provided their written informed consent to participate in this study. According to the consent form, we only kept the wearable sensor data pseudonymized and did not leave any identification information of the participants. The dataset is shared in an anonymized form.

AUTHOR CONTRIBUTIONS
HL developed the software Activity Signal Kit (ASK), designed the protocols, organized the recording events, and processed the recorded data afterward. YH helped collect the pilot dataset CSL18 at the collaborative labor in Karlsruhe Institute of Technology and is responsible for disclosing and maintaining the data. HL and YL cooperatively researched human activity modeling, real-time recognition, feature dimensionality, and feature selection based on the published dataset and verify its completeness, correctness, and applicability. TS supervised and supported the work, advised the manuscript, and provided critical feedback. All authors read and approved the final version.

FUNDING
Open access was supported by the Open Access Initiative of the University of Bremen and the DFG via SuUB Bremen.