Move With the Theremin: Body Posture and Gesture Recognition Using the Theremin in Loose-Garment With Embedded Textile Cables as Antennas

We present a novel intelligent garment design approach for body posture/gesture detection in the form of a loose-fitting blazer prototype, “the MoCaBlazer.” The design is realized by leveraging conductive textile antennas with the capacitive sensing modality, supported by an open-source electronic theremin system (OpenTheremin). The use of soft textile antennas as the sensing element allows flexible garment design and seamless tech-garment integration for the specific structure of different clothes. Our novel approach is evaluated through two experiments involving defined movements (20 arm/torso gestures and eight dance movements). In cross-validation, the classification model yields up to 97.18% average accuracy and 92% f1-score, respectively. We have also explored real-time inference enabled by a radio frequency identification (RFID) synchronization method, yielding an f1-score of 82%. Our approach opens a new paradigm for designing motion-aware smart garments with soft conductive textiles beyond traditional approaches that rely on tight-fitting flexible sensors or rigid motion sensor accessories.


INTRODUCTION
Human activity recognition (HAR) is an umbrella term that gives shelter to various specific applications to understand human behavior. An essential piece of HAR is body postures and gestures (BPG) recognition. The popularity of BPG recognition is well earned due to the ability to describe human activities by a sequence of changing postures or by detecting specific gestures (Ding et al., 2020). BPG detection could lead to the generation of emotion and personality profiles (Noroozi et al., 2018;Junior et al., 2019), to understand implicit social interactions (Gaschler et al., 2012;Guedjou et al., 2016), to aid in sign language communication (Enikeev and Mustafina, 2020), and to predict people's intentions (Sanghvi et al., 2011).
Many wearables sensing applications have found their purpose in BPG, delivering highly developed solutions such as commercial motion capture systems (Schepers et al., 2018). The commercial and research markets for BPG recognition are mainly dominated by inertial measurement units (IMU) wearable-based techniques (Harms et al., 2008;Sarangi et al., 2015;Butt et al., 2019), and on the textile side by stretch or pressure sensors (Chander et al., 2020).
Most of the current solutions for BPG recognition have a common baseline requirement: the sensors need to be firmly attached to the body using tight garments or dedicated accessories, such as bracelets and straps. Therefore, we could argue that a reliable method for BPG recognition with loose garments remains a largely open problem. The present study explores further the simple and novel method for BPG detection in Bello et al. (2021), which proposes a loose garment solution based on non-contact capacitive sensing with off-the-shelf components. In addition to our previous study in Bello et al. (2021), a set of dance movements is evaluated to demonstrate the potential use case of our design as a sophisticated/elegant game controller. The inspiration came from the Nintendo Wii Rayman Raving Rabbids R : TV Party-ShakeTV (Wii, 2009).
The main component of our system is a modified electronic musical instrument, the theremin (Skeldon et al., 1998) for BPG recognition. The well-known musical instrument usually consists of one or two long metal rod/loop antennas emitting sub-MHz frequencies. As the thereminist moves inside the antennas' range, volume and pitch can be controlled by their hand's position. The theremin antennas are metallic, but any conductive wire/textile can be used as an antenna due to its intrinsic capacitive sensing. We substituted the metal rod with soft wires and integrated them inside a loose-fitting garment.
Our experiment design validates our system within two discrete gesture dictionaries: 20 generic/typical upper-body postures and gestures and a second dictionary with eight dance movements. A distinct aspect of our approach is that the theremin's antennas move with the wearer's body motion, changing the signals. Our main contribution is to expand our wearable and loose-fitting solution for BPG recognition in Bello et al. (2021). The prototype is based on off-theshelf components, such as a modified electronic musical instrument, the "OpenTheremin" (Gaudenz, 2016), which in conjunction with textile antennas, is embedded into a loose men's jacket. The system shows accuracy above 90% with an evaluation based on several deep neural network models. In this extended study, the "MoCaBlazer" was fused with Radio Frequency Identification (RFID) synchronization for a real-time and wireless recognition for one participant and six classes of a dance movements dictionary, obtaining an f1-score = 82%. Hence, our "MoCaBlazer" could be a promising alternative for an elegant/sophisticated game controller.
Our article structure is as follows; Section 2 introduces related study in the areas of loose-fitting wearables for BPG and capacitive sensing-based solutions. Next, Section 3 provides a detailed description of the electronic prototype, including details about the data collection options. Then, Section 4 describes the experimental design to evaluate our system. Subsequent, Section 5 illustrates the strategy for the evaluation of the system within the two experiment scenarios; a general gestures dictionary and a dance movements dictionary. Next, Sections 6, 7 present the results and discussion of the deep learning models used to verify the feasibility of our method.
Finally, in Section 8, we conclude our study and discuss further ideas.

Loose Fitting Wearables for BPG
Inertial measurement units (IMU) distributed in clothing or accessories for BPG recognition is a widely used technique (Harms et al., 2008;Sarangi et al., 2015;Butt et al., 2019). Another relevant approach for BPG analysis is called kinesiological electromyography (EMG) (Clarys and Cabri, 1993;Zhang et al., 2019). Such approaches are reliable and robust solutions with accuracy above 90%. One of the limitations they share is the need for stable sensor positions to avoid the effect of noise and motion artifacts on the signals. Furthermore, the placement of discrete and rigid sensors around the joints could be uncomfortable for the user. In Loke et al. (2021), the authors employed 100 microchips with memory and temperature sensors interconnected in a flexible fiber on a T-shirt, which is a solution to increase the flexibility and comfort of the user while wearing discrete sensors, a promising idea to explore in the future.
On the other hand, stretchable garments with strain-based or pressure sensing methods have been studied by many researchers (Boyali et al., 2012;Jung et al., 2015;Zhou et al., 2017;Skach et al., 2018;Mokhlespour Esfahani and Nussbaum, 2019;Lin et al., 2020;Ramalingame et al., 2021;Shin et al., 2021), which demonstrate their value in textile based BPG recognition. Fiber optic embedded in a jacket and pants was proposed in a limited study (one person) (Koyama et al., 2018); the transmitted light changes with the wearer's movements, creating a time series pattern due to the bending of the fiber optics. Wearable optical technology is growing rapidly with multiple hardware designs being proposed by Koyama et al. (2016), Abro et al. (2018), Koyama et al. (2018), Zeng et al. (2018), Leal-Junior et al. (2020), Swaminathan et al. (2020), and Li et al. (2021). A fabric-based triboelectric sleeve is proposed in Kiaghadi et al. (2018). Four Radio Frequency Identification (RFID) tags were proposed on the back, chest, and feet over the persons' clothes and shoes by Wang et al. (2016) to recognize a total of eight activities (standing, sitting, walking, along with others). The piezoelectric effect was employed in Cha et al. (2018), where four flexible piezoelectric sensors were placed on the knee and the hip in slack pants to detect walking, standing, and sitting activities. Table 1 shows a detailed comparison of state-of-the-art sensing on the garment for activity recognition solutions. At the bottom of the table, our system shows a quick and easy option to integrate e-textile components in loose-fitting garments such as the "MoCaBlazer." The "MoCaBlazer" uses commercial conductive textile parts as the antennas of the modified off-theshelf theremin (OpenTheremin) based on capacitive sensing.

Capacitive Sensing
Capacitive sensing is a well-developed technology, available in our everyday life since the invention of the first cellphone with a touch screen (Johnson, 1965). In the cellphone touch screen case, capacitive technology estimates touch or deformation caused by fingers. A capacitance measurement quantifies the electric charge  storage between two or more conductors, called electrodes. The electrodes are conductive plates that form a chamber, and when they are at different electric potentials (voltages), an electric field is generated. The ratio between the charge (Q) and the differential electric potential is called capacitance. Although the electrodes are usually made of metal, any two plates of conductive material like inks, foils, indium tin oxide (ITO), plastics, textiles, and even the human body can be used to build a capacitor (Grosse-Puppendahl et al., 2017). Capacitance is measured by frequency or duty cycle, which fluctuates when external electrodes disturb the status quo. Another method is by quantifying the charge balance, or with rising or falling time measurements (Perme, 2007). In wearable and ubiquitous computing for HAR, capacitive sensing has extensively proven its importance (Braun et al., 2015b;Ye et al., 2020). The applications extend from capacitive furniture (Wimmer et al., 2007;Braun et al., 2015a,b;Liu et al., 2019), to capacitive wristbands (Cohn et al., 2012a;Pouryazdan et al., 2016;Bian et al., 2019a,b), rings (Wilhelm et al., 2015), clothes (Holleis et al., 2008;Singh et al., 2015), collars (Cheng et al., 2010(Cheng et al., , 2013 and prosthesis (Zheng and Wang, 2016) up to an entire wall painted as a capacitive array  for posture gesture detection.
A textile design was evaluated as an on-body capacitance system by Cheng et al. (2010). The authors validated the technology for eating, head inclination, and arm/leg movements; sensors were placed on the neck, wrist, upper leg, and forearm, though not a loose-fitting solution. In Singh et al. (2015), a flexible textile capacitive matrix was placed on the volunteer's upper leg. The goal was to recognize swipe and hover gestures of paralysis patients. In Cohn et al. (2012b), a capacitive backpack was worn by eight volunteers for posture recognition. The bag works as a receiver of electromagnetic (EM) noise from the power lines and electronic devices inside a room. By measuring the disturbances on the EM field caused by the wearer's movements, the system achieved 93% accuracy for 12 gestures.
The above sensing studies (Cheng et al., 2010;Cohn et al., 2012b;Singh et al., 2015;Zhang et al., 2018) mainly employed tightly coupled or stationary electrodes. In this study, we proposed to use textile theremin antennas in a loose-fitting garment, the "MoCaBlazer" for BPG.

ELECTRONICS AND GARMENT PROTOTYPE
The principal component in our electronic garment prototype is an off-the-shelf electronic musical instrument, "The OpenTheremin V3" (Gaudenz, 2016) 1 . The theremin produces musical notes based on the frequency fluctuation of its antennas caused by the proximity of a person's hands. In a theremin, we could find two antennas, one for volume (loop antenna) and another for pitch control (rod antenna) (Skeldon et al., 1998). Capacitive sensing is the physical principle governing the behavior of the theremin. The human body could be modeled as a capacitor plate virtually connected to the earth and, in conjunction with the theremin's antennas (second plate), completes a capacitor (Singh et al., 2015). Thus, human proximity changes the effective capacitance of the Clapp LC oscillator in Figure 1D, affecting its frequency. Therefore, we could infer that relative differences between body parts and theremin's antennas could be used to distinguish body postures. In the present study, the pitch and volume antennas were embedded in a tailored garment (men's blazer); thus, the person's body moves with the theremin and "makes music" with different postures and gestures (frequency profiles).
To test our approach, we designed a prototype, the "MoCaBlazer, " as shown in Figure 1. We employed a Tom Tailor R L/52 size blazer (best suited for 184 cm tall persons). In Figures 1A,C, the positions and patterns of our four antennas are depicted. The antennas cover the chest, a small part of the shoulders, the arms, and the back, as seen in Figure 1A. This setting was appropriate for detecting upper-body postures and gestures without altering the tailored garment's main structure or hindering the wearer's motion.
The back antennas (standard 28 AWG cables) (AMPHENOL, 2015) in Figure 1A (Arm-Left, Arm-Right) start from the side pockets and, following a curving pattern (simulating a volume antenna), pass over the latissimus dorsi muscles toward the deltoids; they then turn sharply to go along the outer sleeve lines and terminate before the cuff buttons. The front antennas (TWC24004B textile cables) (Wear, 2021) in Figure 1C (Collar-Left, Collar-Right) were sewn inside the lining without modifying the structural design of the blazer (refer to Figure 1B) 2 . The Collar-Left and Collar-Right antennas were arranged to simulate a theremin's pitch antenna as close as possible. Thus, they begin on the side pockets and go to the front-top button, then turn to align with the inner crease of the lapels and reach the notch; consecutively lead out of the crease and climb around the shoulder to the back, and end at the middle edge of the shoulder pad. The antennas' lengths are 80 cm (front) and 100 cm (back) for this particular blazer size (L/52).
Two "OpenTheremin" boards were inside the side pockets of the "MoCaBlazer" (refer to Figure 1) to handle four channels. The channel frequencies were modified by changing the capacitor (C2) in the clap-oscillator circuit to minimize cross-talk between them, as depicted in Figure 1D. Then, the channels were sampled (frequency-count; Stoffregen, 2014) at 100 Hz by the Teensy R 4.1 (Stoffregen, 2020) Two options are available for the data collection: a UART serial (115,200 Baud rate) as a wired option and a Bluetooth serial (9,600 baud rate) as a wireless option. In the case of the wired alternative, the data is received by the serial port (USB) in a computer. The computer runs a python script with a graphical user interface (GUI) developed using Tkinter (Lundh, 1999), as depicted in Figure 1E upper element. For the wireless option, the data of the four channels is sent using the Huzzah-ESP32 Bluetooth serial protocol (Fried, 2022) (in the upper pocket) to a smartphone. The smartphone runs an android application, developed using the Flutter framework (Napoli, 2019), as shown in Figure 1E lower element.

EXPERIMENT DESIGN
Two experiments were conducted with our garment prototype, the "MoCaBlazer." The experiments were carried out in an office without user calibration, i.e., without tuning the antennas' base frequencies to reduce the impact of different body capacitances. Inside the office, there were few metal objects nearby, which are known to affect capacitive sensing (Osoinach, 2007). All participants signed an agreement following the policies of the university's committee for the protection of human subjects and in accordance with the Declaration of Helsinki. The experiment was video recorded for further confidential analysis. The observer and participant followed an ethical/hygienic protocol following the mandatory public health guidelines at the date of the experiment. The first experiment scenario was based on a general dictionary of posture and gestures in Figure 2. The second one was inspired by dance movements from the Rayman Raving Rabbids: TV Party-Nintendo Wii R as depicted in Figure 3.

General Dictionary Experiment
To study the flexibility of our system to adapt to an abroad type of gestures, a general dictionary of 20 upper-body postures/gestures was defined (see Figure 2). Fourteen participants mimicked the postures defined in the dictionary in a random sequence per session while wearing the unbuttoned "MoCaBlazer." The "MoCaBlazer" is based on a size L/52 blazer (Tom Tailor R ), a recommended size for 184 cm tall persons. All participants performed five sessions. One session consisted of four random appearances of each gesture inside the dictionary, giving 400 instances per volunteer. The starting and ending point of a gesture was marked by the null position (standing position). On average, the volunteer's resting period was at least 20 min (without wearing the blazer) in between sessions. For some volunteers, the experiment was completed in 2 days. The volunteers were seven women, 24-64 years old and 157-183 cm in height; seven men, 25-34 years old, and 178-183 cm in height.

Dance Movements Experiment
As an application-specific experiment, a dance movements dictionary containing the eight postures depicted in Figure 3 was defined. It is essential to highlight that the data transmission from the "MoCaBlazer" for this experiment was wireless. Therefore, the capacitive channels were floating (not connected to the ground). The dance movements were selected from the game Rayman Raving Rabbids: TV Party-Nintendo Wii R in order to test the feasibility of using the system as a sophisticated game controller. Three volunteers were asked to imitate the eight movements using the buttoned "MoCaBlazer." Three sessions were recorded per volunteer; each session contained five random appearances per gesture inside the dictionary for a total of 120 instances per participant. The volunteers were asked to rest (without wearing the blazer) for at least 10 min in between sessions. The participants were two men and one woman, 26-30 years old and 160-183 cm in height.

EVALUATION
As shown in Figure 1, the Clapp oscillators generated four data channels. The wearer's movements alter the channels' fundamental frequency. The channels' data is processed as a time sequence. The granularity of the evaluation was a complete gesture/instance. An instance was completed when it included a change from the standing position (starting point) and a return to the standing position (ending point). Furthermore, the impact of common and subtle disturbances on the four capacitive channels was reduced by normalizing the gesture/posture. The digital signal processing was slightly different for the two types of experiments. The videos of both experiments were used as ground truth in a manual labeling procedure.

General Dictionary Experiment Evaluation
The fundamental frequencies of the channels could be seen as a bias difference between the four channels. A normalization procedure was performed to remove these biases and reduce the capacitive sensing modality reliance on the ground. The normalization consisted of subtracting the average of the gesture's first (starting point) and last values (ending point). Then, the normalized four channels' time sequences of each posture/gesture were fed to a fourth-order Butterworth band-pass filter with pass frequencies between 1 and 10 Hz. The duration of gestures performed was not constant, which led to variations in the  number of samples per instance. The average duration of a gesture was around 2 s (200 samples at 100 Hz). A window of 4 s (400 samples at 100 Hz) was selected to guarantee the activity's capture. The signals were dilated or contracted depending on whether the gesture contained less or more than 400 samples. Due to the dynamic nature of the applied resampling procedure (dilation or contraction), this is called time-warping (Goldenstein and Gomes, 1999). The signals dynamically resampled (upsampled or downsampled) to 400time steps provided a fixed size input for the neural network.
The time-warping process was based on the Fourier method (Laird et al., 2004) implemented in the SciPy library (Virtanen et al., 2020). The normalization procedure forced the gesture to start and end circa the same value. Hence, the Fourier method was employed without a window function, which is a method customarily used to avoid ringing artifacts.
A total data of 5,600 gestures/instances of the dictionary in Figure 2 (14 participants) were processed.

Deep Learning Model
Deep learning models such as 1D- LeNet5 (LeCun et al., 1998;Sornam et al., 2017), DeepConvLSTM (Ordóñez and Roggen, 2016), and Conv2D (Khan et al., 2018;Shiranthika et al., 2020) were evaluated. The best trade-off between performance, parameters, and training time was obtained from a modified 1D-LeNet5 model (refer to Table 2). The modified 1D-LeNet5 was defined as a convolution (conv)-max pooling (maxpool)-convmaxpool-conv-fully connected (fc)-fc-softmax layers with batch normalization (Ioffe and Szegedy, 2015) and dropout (Srivastava et al., 2014) on the convolution layers. Leave-recording out (LRO) and Leave-person out (LPO) schemes were used as depicted in Figure 4A. The LRO paradigm studies the method's performance for a known group of people, while LPO evaluates the model's performance in the case of unknown persons. We ran all the person's permutations or recordings combinations within each run and summarized the confusion matrix together. That means a complete run of LRO has 5 and LPO has 14 × 13 train-valid-test cycles. The number of epochs used was 500, stopping when there were signs of overfitting. The three convolution layers are used with a kernel size of 41 and the activation function of ReLU. For max pooling, the pool size was (40, 40) for the first convolution (400, 40) and (4, 40) for the second convolution (40, 40). The third convolution was of size (4, 40) without pooling. A flattening layer of 160 was followed by a fully connected layer of 100. The twenty outputs for the different activities in Figure 2 are then converted into probabilities by a fully connected layer and softmax function. The categorical cross-entropy loss function and Adam optimizer (Kingma and Ba, 2017) were used in the optimization of the neural network.

Dance Movements Experiment Evaluation
In this experiment, the time sequences of the four channels were resampled/time-warped to 400-time steps using the same methodology described above in Section 5.1. The signals were normalized between 0 and 1, as x norm = x−min(X) max(X)−min(X) . Where x is a one-time step, X is a sequence of 400-time steps, and x norm is the normalized time step.
In total four deep learning models were generated 3 . Three individual models per volunteer were trained; two sessions from the same person were used as training, and the third session was for testing. Moreover, a fourth model was developed using two sessions from each participant (three in total) as training and the third session for testing as shown in Figure 4B. A total data of 360 gestures of the dictionary in Figure 3 (three participants) were fed into a one-dimension convolutional neural network as shown in Figure 5. The neural network's input layer was a time series of 400 samples per four channels/antennas (400,4,1). Two convolutional layers followed this with a max-pooling of 10, batch normalization, and dropout of 20%. A third convolutional layer was added but without max pooling. Next, a flattening layer of 160 was followed by a fully connected layer of 100. The eight outputs from the different activities in Figure 3 were converted to probabilities using a fully connected layer and a softmax function. The training consisted of 500 epochs for all the models. The optimization of the neural network used the categorical cross-entropy loss function and stochastic gradient descent (SGD) (Ruder, 2016) optimizer with learning-rate = 0.005 and momentum = 0.001.

Real-Time Recognition With RFID Synchronization
Following the training and testing paradigms in Figure 4B a group model was built for the three participants in the dance experiment. The resulted model considered an entire gesture when the person follows the sequence; standing-gesturestanding. Thus, this sequence needs to be matched to do a real-time evaluation. We proposed to use Radio Frequency Identification (RFID) as a synchronization technique to signal the starting and ending point of the gesture. The RFID synchronization was employed in the calibration of atmospheric pressure sensors to estimate the vertical position of the hand in Bello et al. (2019). The RFID system comprehends two parts; the reader and the tag. The most commonly used extension of RFID is the near field communication (NFC), which is available FIGURE 5 | Structure of the 1DConv neural network model used for the data of the eight dance movements. Input shape (time-steps, channels, 1) = (400,4,1) and output shape = 8 classes.
in most smartphones to make over-the-air payments. In Bello et al. (2019), the reader was on the wrist, and the tag was around the pocket to simulate the NFC systems.
It should be noted that there is already an NFC system in our smartphones and that the pocket is a common position to carry our phone. In addition, RFID stickers are nowadays a commonly used solution for tracking merchandize in stores in a ubiquitous and unobstructed manner. Hence, we propose a setting for the real-time evaluation as the one shown in Figure 6A. The wrist was the selected position for the reader, and the side pocket of the "MoCaBlazer" was the position for the RFID tag (Mifare Classic 13.56 MHz). Figure 6B shows a volunteer wearing the synchronization system. The RFID signal and the "MoCaBlazer" four-channel outputs were sent using Bluetooth serial (wireless) to a python script running the TensorFlow model. The python script follows the flow diagram in Figure 6C. The real-time evaluation was performed with participant number two of the three participants pool. The participant was asked to do five repetitions per dance gesture (40 motions).
It is worth mentioning that the real-time recognition with RFID synchronization did not include any pre-training stage with the RFID signal. The model used here was generated from the offline data without RFID. The input data to the offline model was manually labeled with a granularity of 50 fps (recorded video).

General Dictionary Experiment Results
In Table 2, the results for the three models; 1D-LeNet5, DeepConvLSTM, and Conv2D are compared. There is not a remarkable variation across the models. The confusion matrices using Conv2D for the Leave-recording out (LRO) and Leaveperson out (LPO) are depicted in Figure 7. The results confirmed a robust recognition of the 20 postures/gestures dictionary. The LRO or user-dependent case gave an average accuracy of 95%, refer to Figure 7A. There was a decrease of around 10% for the LPO or user-independent case, shown in Figure 7B. Furthermore, we achieved an average accuracy of 86.25%, with nine classes out of the 20 returning above 95% accuracy. Hence, we could conclude that these results are good enough to consider that our model will perform well for the stranger case; people not included in its training phase.

Dance Movements Experiment Results
Four models were generated using the neural network structure in Figure 5. The results for the three individual models are shown in the confusion matrices in Figure 8. Figure 8A presents the recognition for the first model trained (2 sessions) and tested (1 session) with the data from volunteer number one. The first participant obtained the lowest performance, f1-score = 93%. Figure 8B is the result for Leave-one recording out (LRO) of the second volunteer, showing an f1-score = 100%. For the third participant, the results are only 5% less than the perfect f1-score. With this performance, our design successfully recognized the gesture dictionary in Figure 8.
The data partition (train and test) of the fourth model is in Figure 4B, and the result is illustrated in Figure 9A with an f1-score = 92%. The fourth model was tested in real-time in conjunction with RFID synchronization and gave an f1score = 82% as shown in Figure 9B for six classes. In the confusion matrix in Figure 9B, the classes 4-5 and the classes 6-7 were merged, which gives a total of six classes. In the case of merged classes 4-5, the fourth class was completely confused, with half of its instances being recognized in class number 1 and the other half in class number 5. Moreover, in the case of the merged classes 6-7, the seventh class was recognized consistently as class number 6. The above indicates that the dance movements 4 and 7 in Figure 3 could not be recognized correctly with the combination of the fourth deep learning model (offline) and the RFID synchronization (online). Despite the negative cases of classes 4 and 7, the real-time recognition with RFID synchronization shows decent performance for the merged classes (six classes in total).

General Dictionary Experiment Discussion
To discuss our results, the confusion matrices in Figure 7 and the 20 gesture/posture dictionary in Figure 7 will be referenced as a duo. In the case of the Leave-recording out results in Figure 7A, the accuracy was above 90% for the 20 classes. On the other hand, in Figure 7B, the result for the Leave-person out scheme is depicted, and we could observe several pairs of false recognition. For the pairs of arms-up (Gesture 12)/openarms (10) and forearms-block (9)/frame-picture (19), the arm motions and directions are physically similar. For the case of lean-forward (1)/frame-picture (19), the similarity is seen in the signals in Figure 2; we believe it is a negative effect of participants of different body shapes wearing the same size blazer L/52, which leads to misclassification of 11%. Nonetheless, for forearmsblock (9)/hands-on-head(11) pair with similar signal patterns and elbow flexion, the misclassification is only 5%. It is worth noticing that the activities with shoulder motion, such as shrug (7), forearms-block (9), hands-on head (11), arms-up(12), and frame-picture(19), have a reduction in accuracy in the Leaveperson out (LPO) result compared to the Leave-recording out (LRO) case. The confusion could be due to the lack of antennas to cover the shoulders of the "MoCaBlazer" and that all fourteen volunteers (of different body shapes) were wearing the same one-size blazer.

Dance Movements Experiment Discussion
The result of the individual model of participant number one shows some misclassification (refer to Figure 8A). For the classes/dance movements 1 and 4, 20% of the gestures are confused; these two gestures have in common that the arms move to the same side of the body trunk but at a different height. The same happens to participant number three as seen in Figure 8C. The similarity between these two participants is that they are both men and have a difference in height of 8 cm. In the triplet consisting of dance movements 5, 6, and 7, the seventh and fifth gestures were falsely identified as number six for the case    of participant number one. In the case of the third participant, movement number seven has 20% of its instances confused with the fifth movement. Such gestures include moving both arms in between the legs. A significant difference in the activities is how the legs move; left/right leg in the air or both feet on the ground with the knee bent, and how the shoulders move. The lack of antennas on the shoulder blades and not antennas on the lower part of the body could be the sources of the misclassification. For the second participant, an f1-score = 100% was achieved. This volunteer is a woman with a height of 160 cm. The "MoCaBlazer" was looser for the second participant, which indicates the blazer has more flexibility and could be interpreted as more wrinkles on the garment while doing the movements.
The fourth model was developed using the LRO scheme depicted in Figure 4B. With this model, two tests were performed; LRO-Offline with the three participants and confusion matrix in Figure 9A, and the second test was a realtime (online) with RFID synchronization which performance is in Figure 9B.
For the first test of the fourth model (offline), the highest recognition error was observed for two pairs of classes, 4/1 and 3/0, with 13% of the instances being wrongly recognized. These two pairs of classes consisted of both arms moving from the standing position (starting point) to the right/left, with the main difference in how much height the arms reach, including a visually distinctive shoulder movement. As seen in the individual models in Figure 8, the classes number 5, 6, and 7 are confused between each other, which also occurs in the groupmodel/fourth-model, so it was a foreseen situation. An f1-score = 92% for the recognition of the gestures in the dance movements dictionary makes our system a good solution for a sophisticated and elegant dance game controller.
The second test result, the real-time with RFID synchronization in Figure 9B, shows perfect recognition for the dance movements 0, 1, and 2. This is not the case for movement number 3, with 40% of its instances being confused with the merged class 6-7. The merged-class 6-7 could be considered as activity number 6 in Figure 3G, due to the consistent recognition of dance movement number 7 as dance movement number 6. Therefore, the comparison between dance gesture number 3 in Figure 3D and gesture number 6 in Figure 3G applies. We suspect two reasons for the 40% wrongly recognized instances; the first could be the slight height difference in the arms' positions and the non-presence of antennas on the shoulders or around the legs. Second, it is essential to remark that this confusion is not present in the offline results, which concludes that our solution depends highly on excellent labeling to mark the gesture's starting point and ending point.
The offline results were obtained using labeling/marking the starting point and ending point with high accuracy in 50 fps/camera. The RFID labeling/marking of the starting point and ending point has an intrinsic error of a slight hand movement (location of RFID reader) to get close enough and detect the RFID tag (on the side pocket). In addition, the RFID solution has a granularity of seconds instead of milliseconds (video based labeling /offline-case).
The merged-class 4-5 has a 33% misclassification with class number 1, and this confusion can also be observed in the offline result of the three volunteers model. Despite the far from perfect RFID synchronization to signal a gesture sequence "standinggesture-standing" in comparison with the offline version (in the order of milliseconds), we could consider it a promising technique for real-time recognition. A solution to improve the RFID fusion results could be to train the model with data synchronized through the RFID in-situ labeling.

CONCLUSION AND OUTLOOK
This article has explored a method for posture and body gesture recognition based on a commercially available electronic theremin, the "OpenTheremin, " which, together with conductive textile antennas, was embedded in a loose-fitting garment, the "MoCaBlazer." Our solution can be deployed and integrated in a fashion and fast manner into loose garments. The "MoCaBlazer" was evaluated with fourteen participants (gender-balanced) mimicking a general dictionary of 20 upper-body movements. Additionally, as an application-specific evaluation, a pool of three volunteers participated in mimicking an eight dance movements dictionary inspired by the Rayman Raving Rabbids: TV Party-Nintendo Wii R game.
For the 20 gestures dictionary, different deep learning models were selected, such as 1D-LeNet5, DeepConvLSTM, and Conv2D. For the case of the eight dance movements dictionary, a one-dimension convolutional neural network was selected. In both evaluations, the system has offered competitive performance compared to state of the art in loose garments for BPG detection. In the experiment design, a repeated wearing of the "MoCaBlazer" was enforced (per session) to make the results robust against disturbances of re-wearing.
With our chosen sensing modality, the non-contact capacitive method, we use the advantages of being independent of muscular strength/pressure and, therefore, no need for tight or elastic garments. In addition, it is relatively not sensitive to sweat or skin dryness (Zheng and Wang, 2016). A limitation of the capacitive sensing modality is that it is sensitive to conductors, which includes persons/objects in close range with different dielectric properties compared to the antennas (Osoinach, 2007). To avoid the effect of environmental disturbances as much as possible, we normalized our data per gesture window, removing the dependency on absolute values, and built our system upon the relative differences between capacitive channels.
The "MoCaBlazer" data collection for the dance gesture experiment was wirelessly transmitted to an android phone application. With an f1-score = 92% for eight classes with wirelessly collected data, our design demonstrated robustness against capacitive channel drifting values due to floating ground conditions (typical case in wearables). Moreover, a real-time test with RFID synchronization was done (wireless-online) for one volunteer with an f1-score = 82% for six classes.
Our "MoCaBlazer" evaluation has shown promising results in loose garments as a body posture detection method. Hence, we would continue developing elaborated garment integration; with miniaturized sensing modules, more channels, stretchable antennas, and different antenna pattern designs. In the future, the fusion with other sensors such as IMU for continuous posture detection will be an exciting field to explore, in addition to real-time system deployment/evaluation at the edge (embedded devices).

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://github.com/ HymalaiDFKI/MoveWithTheTheremin.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by University of Kaiserslautern and DFKI. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
HB, BZ, and PL: conceptualization. HB and LS: data curation. HB and BZ: formal analysis, methodology, validation, and visualization. PL: funding acquisition and supervision. HB: investigation, data collection software/hardware, and writing original draft. HB, BZ, and SS: data analysis. HB, BZ, SS, and PL: writing review and editing. All authors have read and agreed to the published version of the manuscript.

FUNDING
This work has been partially supported by BMBF (German Federal Ministry of Education and Research) in the project SocialWear.