<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2022.806027</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Exploiting the Nature of Repetitive Actions for Their Effective and Efficient Recognition</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Bacharidis</surname> <given-names>Konstantinos</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1422401/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Argyros</surname> <given-names>Antonis</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1102272/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Computer Science Department, University of Crete</institution>, <addr-line>Heraklion</addr-line>, <country>Greece</country></aff>
<aff id="aff2"><sup>2</sup><institution>Institute of Computer Science, Foundation for Research and Technology - Hellas (FORTH)</institution>, <addr-line>Heraklion</addr-line>, <country>Greece</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Juergen Gall, University of Bonn, Germany</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Yazan Abu Farha, University of Bonn, Germany; Sovan Biswas, Intel, Germany</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Konstantinos Bacharidis <email>kbach&#x00040;ics.forth.gr</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Computer Vision, a section of the journal Frontiers in Computer Science</p></fn></author-notes>
<pub-date pub-type="epub">
<day>07</day>
<month>03</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>4</volume>
<elocation-id>806027</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>10</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>01</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Bacharidis and Argyros.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Bacharidis and Argyros</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license></permissions>
<abstract>
<p>In the field of human action recognition (HAR), the recognition of actions with large duration is hindered by the memorization capacity limitations of the standard probabilistic and recurrent neural network (R-NN) approaches that are used for temporal sequence modeling. The simplest remedy is to employ methods that reduce the input sequence length, by performing window sampling, pooling, or key-frame extraction. However, due to the nature of the frame selection criteria or the employed pooling operations, the majority of these approaches do not guarantee that the useful, discriminative information is preserved. In this work, we focus on the case of repetitive actions. In such actions, a discriminative, core execution motif is maintained throughout each repetition, with slight variations in execution style and duration. Additionally, scene appearance may change as a consequence of the action. We exploit those two key observations on the nature of repetitive actions to build a compact and efficient representation of long actions by maintaining the discriminative sample information and removing redundant information which is due to task repetitiveness. We show that by partitioning an input sequence based on repetition and by treating each repetition as a discrete sample, HAR models can achieve an increase of up to 4% in action recognition accuracy. Additionally, we investigate the relation between the dataset and action set attributes with this strategy and explore the conditions under which the utilization of repetitiveness for input sequence sampling, is a useful preprocessing step in HAR. Finally, we suggest deep NN design directions that enable the effective exploitation of the distinctive action-related information found in repetitiveness, and evaluate them with a simple deep architecture that follows these principles.</p></abstract>
<kwd-group>
<kwd>deep learning</kwd>
<kwd>repetition localization</kwd>
<kwd>video understanding</kwd>
<kwd>human activity recognition (HAR)</kwd>
<kwd>action recognition</kwd>
</kwd-group>
<counts>
<fig-count count="8"/>
<table-count count="6"/>
<equation-count count="2"/>
<ref-count count="33"/>
<page-count count="13"/>
<word-count count="9225"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Human activity analysis and understanding in videos is an important task in the field of computer vision. Its applications range from Human-Robot Collaboration (HRC), assistive technologies for daily living, up to surveillance and entertainment. The importance of these tasks is accompanied with significant challenges, due to the high-dimensionality of video data and the changes in the appearance characteristics due to scene/context and viewpoint variations (Herath et al., <xref ref-type="bibr" rid="B14">2017</xref>). These challenges become severe when discriminating among different <italic>fine-grained</italic> actions in action sets with high intra- and low inter-class appearance and motion similarities (Aggarwal and Ryoo, <xref ref-type="bibr" rid="B2">2011</xref>).</p>
<p>One particular challenge that becomes more evident as we proceed to model complex and/or fine-grained activities consisting of multiple actions or action steps, is the temporal execution extent of the action. Action execution duration varies for each human. The simpler the action, the more temporally constrained is its duration, and thus, the temporal extends of the short and long-term information that a model needs to assimilate. As the complexity increases, so does the duration and the execution variations. A robust action recognition model has to be able to model both short- and long-term appearance/motion information of the action execution (Aggarwal and Ryoo, <xref ref-type="bibr" rid="B2">2011</xref>; Kang and Wildes, <xref ref-type="bibr" rid="B15">2016</xref>).</p>
<p>Robust short-term modeling has been achieved in the last decades with elaborate hand-engineered, and more recently deep learning-based, feature descriptors. Long-term modeling is still an issue in the deep learning era, since the generation of robust representations through the hierarchical correlation of deep features does not scale well as the duration of an action increases (Zhu et al., <xref ref-type="bibr" rid="B33">2020</xref>). This has an additional impact on the computational cost both for training and inference. Therefore, it is important to investigate strategies that can provide a compact and discriminative representation of a long input sequence demonstrating an action, either by selecting the most informative action-specific temporal segments of the sequence or by leveraging cost-efficient and easy to compute aggregation of information along the action duration. Existing approaches orbit around sparse sampling, clip cropping or segment-wise processing and aggregation of the input sequence, favoring either short or long-term dependencies for the shake of computational efficiency.</p>
<p>One action/activity category that does not benefit from sparse sampling or clip cropping is <italic>repetitive actions</italic>. This is due to the fact that for these action cases, such approaches lead to temporal ordering disruption and/or redundant information processing. Repetitive actions are quite common, especially in daily living (e.g., cooking, physical exercise, etc.), with the core execution task being repeated with slight variations. Due to their nature, these actions contain redundant information regarding coarser appearance and motion characteristics. Moreover, in such actions, we can pinpoint the gradual effect of the repetitive task in the scene, if any. As a gradual effect we define any change that the repetitive action causes on the appearance state of the object(s) in use, of the scene or of the actor. The objective of this article, is to explore and exploit the nature of repetitive tasks as a way to (a) reduce sequence length by removing the repetitive executions, resulting in more distinct and compact representation of the action pattern and (b) highlight the gradual effects of the repetitive action in the scene and objects, allowing the model to consider them as an action-related attribute during learning. To the best of our knowledge this work is the first to study the characteristics of repetitive actions within the scope of HAR, and to propose a first pipeline that enables the effective distillation and exploitation of such information.</p></sec>
<sec id="s2">
<title>2. Related Work</title>
<p><bold>Input sequence sampling</bold>: HAR methodologies use two main strategies to perform sequence sampling, (a) randomly cropping a clip from the sequence and (b) uniformly splitting sequence into snippets and sampling a key-frame, either raw or by applying some pooling or temporal ranking operations. These techniques are applied to both hand-crafted and deep learning HAR approaches. Our analysis focuses on deep learning HAR due to its prevalence on the field.</p>
<p>Regardless of whether they perform random cropping or uniform splitting, existing deep learning approaches usually end-up with sampled sequences in the range of 16, 32, or 64 frames. Random cropping has strong short-term, but weak long-term information content, whereas uniform splitting is opposite in nature. In both cases, due to the partial observation of the action with these two sampling schemes, researchers designed models that are capable of highlighting and exploiting dependencies between sparse input sequences. This is achieved either with two-stream CNN models, using appearance (RGB) and motion (optical flow) inputs (Simonyan and Zisserman, <xref ref-type="bibr" rid="B25">2014</xref>; Feichtenhofer et al., <xref ref-type="bibr" rid="B12">2016</xref>) sometimes combined with memorization RNN cells to increase the long-range modeling capabilities (Donahue et al., <xref ref-type="bibr" rid="B10">2015</xref>; Varol et al., <xref ref-type="bibr" rid="B27">2017</xref>), or with the use of 3D convolutions, along with pooling operations, to directly learn spatio-temporal representations (Tran et al., <xref ref-type="bibr" rid="B26">2015</xref>; Carreira and Zisserman, <xref ref-type="bibr" rid="B6">2017</xref>). In complex action or activity cases, both short-term and long-term dependencies are important. Thus, increasing the portion of the input sequence to be processed becomes a necessity. Recent methods apply temporal pooling or deep encoding on snippets of the sequence to encode the short-term dependencies, and use these encoded segments as the input sequence components. Large-scale recognition is then performed in two ways, either using consensus criteria on per-snippet action estimates derived from each short-term temporal encoding (Wang et al., <xref ref-type="bibr" rid="B29">2016</xref>), or working with the short-term snippet-driven feature maps and applying to them temporal convolutional operations (Bai et al., <xref ref-type="bibr" rid="B4">2018</xref>; Zhang et al., <xref ref-type="bibr" rid="B31">2020</xref>) or generic non-local operations (Wang et al., <xref ref-type="bibr" rid="B30">2018</xref>).</p>
<p><bold>Periodicity estimation and repetition counting</bold>: Repetition localization is achieved <italic>via</italic> the robust periodicity estimation in time series. For video sequences, periodicity detection is performed by examining the spatio-temporal feature correlations in a self-similarity assessment fashion, with the most successful strategy being to create a Temporal Self-similarity Matrix (TSM), using hand-crafted, motion-related (Panagiotakis et al., <xref ref-type="bibr" rid="B21">2018</xref>) or deep learning-based (Karvounas et al., <xref ref-type="bibr" rid="B16">2019</xref>; Dwibedi et al., <xref ref-type="bibr" rid="B11">2020</xref>) frame-wise representations. The identification of periodicity in a sequence, allows us to count the repetitions of the task using period length predictions. Existing works on repetition counting (Levy and Wolf, <xref ref-type="bibr" rid="B18">2015</xref>; Runia et al., <xref ref-type="bibr" rid="B23">2018</xref>; Dwibedi et al., <xref ref-type="bibr" rid="B11">2020</xref>), formulate the problem as a multi-class classification task, with each class corresponding to a different period length. Repetition counting is then performed by evaluating the entropy of the per-frame period lengths predictions (Levy and Wolf, <xref ref-type="bibr" rid="B18">2015</xref>) as well as per-frame periodicity predictions (Dwibedi et al., <xref ref-type="bibr" rid="B11">2020</xref>). We built upon the work of Dwibedi et al. (<xref ref-type="bibr" rid="B11">2020</xref>) and utilize the counting process to localize and segment each repetition sequence.</p></sec>
<sec id="s3">
<title>3. Repetitiveness in Action Recognition</title>
<p>In repetitive actions, each repetition sequence preserves the core action motif, while deviations mainly consider the action execution tempo, and action impact on the scene and objects. This means that we can get a better understanding of the core pattern of the action and the action effects to the surrounding environment, by distinctly exploring each repetition sequence.</p>
<sec>
<title>3.1. Characteristics of Repetitive Actions</title>
<p>We pinpoint three characteristics of repetitive actions that are important when trying to exploit action repetitiveness for HAR. These are (a) the <italic>number of repetitions</italic>, (b) the <italic>variability of the repetitions</italic>, and (c) the <italic>presence/absence of action-imposed gradual effects on the surrounding scene</italic>.</p>
<p><bold>Number of repetitions:</bold> It is likely that as the number of repetitions increases, the information redundancy in the repetitive segments, also increases. Thus, in the case of few repetitions, it is more likely that it is necessary to model the entire sequence. In the limit, a single, non-repetitive action requires full modeling.</p>
<p><bold>Variability of repetitions:</bold> The number of repetitions is a simple indicator of information redundancy, which, nevertheless, needs to be accompanied by a measure of the variability of the repetitive segments. Repetitions are completely redundant if they are identical, independently of their number. The larger the variability among different repetitions, the more the information content they offer and the larger the need of being modeled.</p>
<p><bold>Gradual effects:</bold> Repetitive actions may (or may not) have an effect on the actor and/or the surrounding scene. For example, actions such as <italic>clapping</italic> do not impact the surrounding scene. Such repetitive tasks may exhibit variability (as explained before) due to, e.g., tempo change, differences in execution style, etc. On the other hand, the action of <italic>slicing a fruit/vegetable</italic>, has an additional gradual effect on the element/object it is applied to. Importantly, the nature of these gradual effects is quite characteristic for the action and may have strong discriminative power, especially in the disambiguation of actions that share similar motion such as <italic>cutting in slices</italic> and <italic>cutting in cubes</italic>. Based on this, to exploit repetitiveness, we need to consider (a) the <italic>definition of the core execution motif of the activity</italic> and (b) its <italic>gradual effects</italic>, i.e., the impact on the surrounding space.</p></sec>
<sec>
<title>3.2. Highlighting Action-Related Effects With Repetitiveness</title>
<p>The key advantage of sequence splitting based on repetitiveness is that it allows to decouple and highlight the gradual variations that may occur in the surrounding scene due to the effect of the performed action. In addition, it makes it easier to detect slight variations on the way the action is performed, such as tempo changes. As we previously mentioned this information can be important in actions with similar appearance/motion characteristics (low inter-class variance) which, however, differ on the gradual changes that the action imposes on the scene or on object-of-interest. For such cases our proposal is to work with the repetition sequences with the goal of highlighting the gradual action effects. We can consider the core execution pattern as the short-term action dynamics whereas the gradual changes to the object-of-interest or scene as the long-range ones. Ideally, we would like an HAR model to be able to access both information sources at the same time without any information loss, however, this is not feasible due to hardware limitations and model footprint (Zhu et al., <xref ref-type="bibr" rid="B33">2020</xref>). As a solution, we propose to restrict the repetition sequence lengths by employing sequence summarization or temporal encoding and rank pooling methods, such as Dynamic Images (DIs) (Fernando et al., <xref ref-type="bibr" rid="B13">2015</xref>; Bilen et al., <xref ref-type="bibr" rid="B5">2017</xref>), Motion History Images (MHIs) (Ahad et al., <xref ref-type="bibr" rid="B3">2012</xref>), or a deep encoder network (Wang et al., <xref ref-type="bibr" rid="B29">2016</xref>, <xref ref-type="bibr" rid="B28">2021</xref>). The resulting embedding encodes, in a compact way, the temporal action dynamics as well as the action impact on action-affected scene elements. For example, for the <italic>slicing onion</italic> case, DIs or MHIs capture the effect of the action on the onion, as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Repetitive action effect encoding using Dynamic Images (DIs) and Motion History Images (MHIs). <bold>(A)</bold> Keyframe from the first repetition, <bold>(B,C)</bold> DIs of 1st and 2nd repetition segments, and <bold>(D,E)</bold> MHIs of the 1st and 2nd repetition segments.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-04-806027-g0001.tif"/>
</fig>
<p>Under the above considerations, for the case of actions with <italic>no action-invoked gradual effects</italic>, we expect that a repetition-based feature encoding would result in representations that are mapped tightly/closely in the feature space. On the other hand, for actions with <italic>action-invoked gradual effects</italic>, we expect the mappings to be sparser. To verify this hypothesis, we select two videos from the action classes (a) <italic>running on treadmill</italic>, (no gradual effects) and (b) <italic>slicing onion</italic> (with gradual effects). We use a simple pre-trained I3D to generate the repetition segment-based temporal encodings, resulting in a 1 &#x000D7; 2, 048 feature vector per repetition segment. To visualize these representations, we applied Principal Component Analysis (PCA) Pearson (<xref ref-type="bibr" rid="B22">1901</xref>), Abdi and Williams (<xref ref-type="bibr" rid="B1">2010</xref>). <xref ref-type="fig" rid="F2">Figure 2</xref> illustrates the feature space defined by the first three principal components. We also provide the corresponding DIs purely for visualization purposes. As it can be verified, the illustrated mappings verify the presence of information redundancy for actions with no or subtle gradual effects on the surrounding space, as well as discriminative elements among the repetition segments of actions with gradual effects.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Example of potential information redundancy due to high inter-repetition segment similarity (blue case) and presence of repetition segment-wise discriminative information due to potential action-invoked effects on the surrounding environment (orange case). Each point corresponds to a repetition segment of the specific video sample.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-04-806027-g0002.tif"/>
</fig></sec></sec>
<sec id="s4">
<title>4. RepDI-Net: A Deep Architecture to Exploit Action Repetitiveness</title>
<p>In order to model the core execution pattern and the gradual scene changes due to repetitiveness, effectively and simultaneously, we propose an HAR deep neural network architecture, dubbed <italic>RepDI-Net</italic>. The proposed architecture comprises of two modules. The first is a data pre-processing module, whose goal is threefold, (a) to identify a reference execution of the action, (b) to estimate a set of coefficients that express the underlying similarity between the repetition segments, and (c) to generate a sequence consisting of temporal encodings of the repetition segments [for this our model exploits the temporal rank-pooling approach of <italic>Dynamic Images</italic> (DIs), (Fernando et al., <xref ref-type="bibr" rid="B13">2015</xref>)]. The second module is a two branch spatio-temporal sequence modeling DNN, that utilizes the aforementioned information streams to perform the action classification task. An overview of the architecture is shown in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p><italic>RepDI-Net</italic>: a deep HAR architecture to exploit the repetitive action segments. <bold>(A)</bold> Input sequence preprocessing module and <bold>(B)</bold> Spatio-temporal sequence modeling and classification module. Regarding the temporal encoding/ranking examples, each Dynamic Image (DI) and Motion History Image (MHI) corresponds to a single repetition.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-04-806027-g0003.tif"/>
</fig>
<p>In this section, we examine the key elements of the data preprocessing module of <italic>RepDI-Net</italic>, highlighting their role and importance in the effective modeling of a repetitive action in the second module of the proposed architecture. Details regarding the specifications for the spatio-temporal sequence modeling DNN module can be found in section Experiments.</p>
<sec>
<title>4.1. Reference RGB Execution Sequence</title>
<p>The input to the first sub-net of <italic>RepDI-Net</italic>&#x00027;s spatio-temporal sequence modeling module consists of the core temporal appearance information of the action using the raw RGB sequence of a reference execution of the action (can be the first or any other repetition). The use of appearance information is important for distinguishing actions, as it provides scene-centered, texture-related information. Any execution of the task can be used as the reference appearance input source. However, we would like the reference execution sequence to be (a) free from cases of background clutter or occlusions and (b) as similar as the rest of executions (repetitions). As a way to resolve this, we construct a <italic>Temporal Self-similarity Matrix</italic> (TSM), which comprises of the pairwise Euclidean distances of the repetition DIs. We define as the reference execution instance the one that, on average, is most similar to all the others. Thus, we select the one that has the minimum average Euclidean distance to the rest of the instances/repetitions.</p></sec>
<sec>
<title>4.2. Distilling Information From Repetitions</title>
<p>The input of the second sub-net consists of the temporal encodings of the repetition sequences. As hinted earlier, in our experiments, we use DIs for this task due to the richer encoding attributes they possess against MHIs (Bilen et al., <xref ref-type="bibr" rid="B5">2017</xref>). However, any temporal encoding or ranking approach can be employed (Wang et al., <xref ref-type="bibr" rid="B29">2016</xref>; Cherian et al., <xref ref-type="bibr" rid="B7">2017</xref>; Diba et al., <xref ref-type="bibr" rid="B9">2017</xref>; Lin et al., <xref ref-type="bibr" rid="B20">2018</xref>). The role of the second sub-net is to highlight the characteristic information that is present in the repetition sequences from the redundant information (consisting of the almost identical appearance and motion features between repetitions). To teach the model when it is useful to focus on this aspect, we include two factors as additional inputs to the second sub-net, which act as feature scaling components in the last Fully-Connected (FC) layer, <italic>f</italic><sub><italic>DI</italic></sub>, of the repetition sub-net:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In Equation (1), <inline-formula><mml:math id="M2"><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi><mml:mi>I</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>D</mml:mi><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, where <italic>d</italic> is the dimensionality of the output of the FC layer. Equation (1), takes into account two of the characteristics of repetitive actions that were identified in section Repetitiveness in Action Recognition, that is (a) the <italic>number of repetitions</italic> or <italic>repetition count</italic>, <italic>N</italic><sub><italic>rep</italic></sub> and (b) a measure of the <italic>variability of repetitions</italic> expressed by the mean repetition similarity &#x003BC;<sub><italic>sim</italic></sub>.</p>
<p>For the estimation of the number of repetitions, we exploit the repetition count estimate of <italic>Rep-Net</italic>, by Dwibedi et al. (<xref ref-type="bibr" rid="B11">2020</xref>). Regarding the localization of the repetition segments, the boundaries are defined at the frame indices, in which a change in the repetition count occurs, i.e., when <italic>Rep-Net</italic> detects another a repeated instance of the action.</p>
<p>As for the computation of &#x003BC;<sub><italic>sim</italic></sub>, this is performed by transforming the TSM, to an affinity matrix, <italic>Sim</italic>. Each cell (<italic>i, j</italic>) of <italic>Sim</italic> expresses the similarity of encoded repetitions <italic>i</italic>, <italic>j</italic> as:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>E</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>E</italic>(<italic>i, j</italic>) is the Euclidean distance between the representation of the encoded repetitions <italic>i, j</italic>. The row-wise mean &#x003BC;<sub><italic>i</italic></sub>, expresses the mean similarity of the <italic>i</italic>th repetition to the rest. The mean similarity between all repetitions, &#x003BC;<sub><italic>sim</italic></sub>, is computed as the mean of &#x003BC;<sub><italic>i</italic></sub>.</p>
<p>Intuitively, the number of repetitions <italic>N</italic><sub><italic>rep</italic></sub> highlights the potential presence of information redundancy due to several instances of the same repetitive segment. The mean repetition similarity &#x003BC;<sub><italic>sim</italic></sub> solidifies this by exploring the inter-repetition segment similarities. A high number of repetitions with a high inter-repetition segment similarity indicates that no additional information gains can be obtained by modeling the entire repetition segment set. In this case, we are dealing with repetitive actions that have minimum or no impact on the scene, such as the actions of jumping jacks or clapping, in which the repetitions bear little or no additional information about the action compared to the initial execution. This is manifested with the high similarity among the encoded repetition segments. In such cases, we do not need to pay attention to the features produced by the second sub-net and instead we should shift our interest to the spatio-temporal features that are generated from the RGB sequence of the initial repetition.</p>
<p>On the contrary, a low inter-repetition segment similarity, indicates the potential presence of action-invoked effects on the surround scene and objects such as the actions of wood chopping or onion slicing, for which the effect of the repetitive task (e.g., on the wood plank or on the onion) can be considered as a highly discriminative element. In such cases, our model should consider the features produced by the second sub-net, that is responsible for modeling the inter-repetition segment (long-term) differences.</p></sec></sec>
<sec id="s5">
<title>5. Experiments</title>
<p>The performed experiments aim to evaluate, (a) the effect of utilizing repetitions as a means to augment the data bank of an HAR model and constrain the input sequence, (b) the contribution of repetitiveness-based sequence splitting in datasets of repetitive actions with a variety of characteristics, and finally, and (c) the accuracy improvement due to the exploitation of the information regarding gradual scene changes due to repetitiveness. To account for hardware limitations, we sample each sequence using two widely employed window sampling approaches, (a) <italic>window-based uniform sampling</italic> (WS) and (b) <italic>random clip crop sampling</italic> (RCC). In WS, as a key-frame, we select the center frame. For both input sampling schemes, we only utilize the RGB frames, without any embedding generation stage and consider sequence lengths of {10, 25, 35, 64} frames for the generated input sequence.</p>
<sec>
<title>5.1. Datasets</title>
<p>The contribution of the proposed methodology is expected to be more evident in datasets with actions involving repetitive tasks. However, the amount of repetitive actions in the existing datasets varies depending on the complexity and action topic. We examine datasets that consist of (a) repetitive activities, only, with relatively high repetition number and (b) a small percentage of repetitive actions with low repetition number. Both manifest in unconstrained conditions.</p>
<p><bold><italic>Countix</italic></bold> (Dwibedi et al., <xref ref-type="bibr" rid="B11">2020</xref>): This is the largest repetitive actions dataset, consisting of in-the-wild videos with challenging recording conditions, such as camera motion, diverse periodicity and repetition ranges (<italic>min</italic> 2 and <italic>max</italic> 73 repetitions), and an average of approximately 7 repetitions per video. In our work, we generated two subsets of <italic>Countix</italic>, with the goal of evaluating (a) the contribution of repetition segmentation as a pre-processing module in HAR and (b) the effect of repetition count and the repetitive action characteristics (presence/absence of gradual effects), on the performance of the proposed HAR model, <italic>RepDI-Net</italic>.</p>
<list list-type="bullet">
<list-item><p><italic>CountixHAR</italic>, was generated with a strict repetition count margin (actions with minimum 2 and maximum 10 repetitions) under the constraint that each action class included in the dataset should contain at least 5 samples in order to ensure that a sufficient number of training data will be available for each action class. The resulting <italic>CountixHAR</italic> set comprises 28 action classes and consists of 718 training and 262 test videos<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref>. The performed data-pool augmentation (i.e., the consideration of each repetition as a discrete sample) increases the training sample number to 3, 284, indicating that we get on average 4.6 repetitions per sample. This dataset is used to evaluate the contribution of repetition-centered input sequence segmentation in HAR. <xref ref-type="fig" rid="F4">Figure 4B</xref>, presents the sequence length distribution change of the samples (train&#x00026;test) due to the repetition-based splitting.</p></list-item>
<list-item><p><italic>CountixEffects</italic>, was generated to evaluate (a) the impact of the number of repetitions and (b) the contribution of repetition-based segmentation in repetitive actions that impose gradual effects in the environment. <italic>CountixEffects</italic> expands the repetition count range to actions that exhibit up to 20 repetitions, and consists of 5 action classes. Two of them (<italic>sawing wood, slicing onion</italic>) produce gradual effects. The rest 3 actions (<italic>headbanging, doing aerobics, running on treadmill</italic>) do not produce gradual effects. The specific action classes subset of <italic>Countix</italic> were carefully selected so that (a) each action class contains samples for the majority of repetition counts (above 60%) and (b) there is at least one sample per repetition count. These conditions allowed the generation of a dataset that is fairly balanced with respect to the repetition count. It is noted that the original <italic>Countix</italic> dataset does not possess this characteristic, as it exhibits a right skewed sample per repetition class distribution [<bold>Figure 6</bold> in Dwibedi et al. (<xref ref-type="bibr" rid="B11">2020</xref>)]. The resulting <italic>CountixEffects</italic> dataset consists of 322 training and 100 test videos. Based on the ground-truth repetition counts provided in the original <italic>Countix</italic> dataset, the generated subset of <italic>CountixEffects</italic> exhibits an average of 9.67 repetitions per sample (regarding the training subset) and a augmented set of 3,124 training samples. An overview of the training sample distribution per repetition count class for <italic>CountixEffects</italic> is presented in <xref ref-type="fig" rid="F5">Figure 5</xref>.</p></list-item>
</list>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Sequence length distributions according to sequence splitting approaches for <bold>(A)</bold> <italic>HMDB-51</italic> and <bold>(B)</bold> <italic>CountixHAR</italic>. Purple: Initial (whole) sequence, Pink: First execution only, Yellow: All repetitions as distinct samples.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-04-806027-g0004.tif"/>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Distribution of video samples in <italic>CountixEffects</italic>, with respect to repetition count.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-04-806027-g0005.tif"/>
</fig>
<p><bold><italic>HMDB</italic></bold> (Kuehne et al., <xref ref-type="bibr" rid="B17">2011</xref>): This dataset contains 51 action categories, consisting of both repetitive and non-repetitive tasks. The action clips were sourced from movies, YouTube and a number of additional public video repositories. The dataset issues 3 official splits, however, in this work, we only report the top-1 classification accuracy on split 1. This dataset serves as a case of repetition count ground-truth agnostic case, in order to evaluate RepNet&#x00027;s generalization. The initial set consists of 4,697 training, 2,054 test videos and 51 action classes. The augmentation of the data-pool by considering each repetition as a discrete sample increases the training sample number to 8,053, indicating that the initial dataset consisted of action samples with on average 2 repetitions per sample. <xref ref-type="fig" rid="F4">Figure 4A</xref>, presents the sequence length distribution change of the samples (train&#x00026;test) due to the repetition-based splitting.</p></sec>
<sec>
<title>5.2. Spatio-Temporal DNN Specifications</title>
<p>We utilize the original I3D (Carreira and Zisserman, <xref ref-type="bibr" rid="B6">2017</xref>) design using the pre-trained weights on ImageNet (Deng et al., <xref ref-type="bibr" rid="B8">2009</xref>) and Kinetics (Carreira and Zisserman, <xref ref-type="bibr" rid="B6">2017</xref>), until the last receptive field up-sampling layer-block. As a top-level we include a Convolutional 3D layer (Conv3D) followed by an FC layer with ReLU activation function, plus Batch Normalization, and finally a soft-max activation layer, in order to fine-tune it on the new classification task for the new datasets.</p>
<p>Given the above, the spatio-temporal sequence modeling module of <italic>RepDI-Net</italic> is a two-stream, two branch NN architecture. Both branches follow almost the same design specifications as the baseline model with the following differences: (a) the number of channels of the Conv3D layer for the sub-net that uses the encoded repetition sequences are 1/4 of the ones used in the reference execution sub-network, (b) the output tensors of their FC layers (ReLU) are concatenated and the resulting tensor is passed into a set of two FC layers (ReLU and soft-max), and (c) the output tensor of the FC layer (ReLU) of the encoded repetitions sub-net is passed through a feature scaling layer (multiplication layer) that utilizes the scaling factors, mentioned in section REPDI-NET: A Deep Architecture to Exploit Action Repetitiveness.</p></sec>
<sec>
<title>5.3. Training Configurations</title>
<p>For repetition counting/segmentation we exploited the <italic>RepNet</italic> model (Dwibedi et al., <xref ref-type="bibr" rid="B11">2020</xref>), without any dataset-specific fine-tuning, using the off-shelf weights. It should be noted that the documentation of <italic>Countix</italic> does not provide the starting and ending frame indices of each repetition segment. The only available information is the repetition counts. According to Dwibedi et al. (<xref ref-type="bibr" rid="B11">2020</xref>), the performance of RepNet in estimating the correct repetition count, under the Off-by-One (OBO) repetition count error metric, leads a 0.3034 miss-classification error for the <italic>Countix</italic> test set. For the <italic>CountixHAR</italic> subset the miss-classification error has been found to be 0.4030 for the combined train and test splits. For <italic>HMDB</italic>, RepNet was applied in a repetition-agnostic manner.</p>
<p>For <italic>HMDB</italic> we applied the standard training/validation/test splits followed in the HAR literature. For <italic>CountixHAR</italic>, and, <italic>CountixEffects</italic>, we defined the dataset training/test split relying on the training/validation/test splits provided in the original <italic>Countix</italic> dataset, with the difference that the validation set was used in the place of the test set, since the test set of <italic>Countix</italic> does not provide any action labels. This resulted in a train-test split without the presence of a validation set.</p>
<p>The action recognition DNNs use the Adadelta optimizer, a learning rate of 0.01, with learning rate decay of 1<italic>e</italic> &#x02212; 4, and batch size 8 for 10, 25, 35 sequence lengths, and batch size 4 for a sequence length of 64. Input sequence length for the encoded repetition sub-net is set to 10 frames (max repetition number). For samples with fewer repetitions we duplicate (not loop) the DIs to the desired length. We did not utilize standard data augmentation schemes, such as horizontal flipping, zooming or region cropping. During testing, with the exception of <italic>RepDI-Net</italic>, we use the original test sets, without repetition localization and segmentation. For <italic>RepDI-Net</italic>, test samples were segmented based on repetition, and then used for the computation of the repetition segments DIs.</p></sec></sec>
<sec id="s6">
<title>6. Experimental Results</title>
<p>We present an evaluation of the impact of repetition segmentation in HAR in relation to the characteristics of the repetitive actions of the employed datasets. We proceed with a series of experiments that demonstrate the importance of correct repetition localization, when exploiting action repetitiveness. Finally, a series of experiments are presented that examine the contribution of key components in the proposed repetition-centered HAR deep architecture, to highlight the benefits and constraints of the proposed pipeline.</p>
<sec>
<title>6.1. Effect of Repetition Segmentation on HAR Accuracy</title>
<p>We present experiments performed on <italic>CountixHAR</italic> and <italic>HMDB</italic> to evaluate the utilization of a repetition-centered segmentation module as a pre-processing step of input sequence configuration, and its impact on HAR. In <xref ref-type="table" rid="T1">Table 1</xref>, we observe that considering only the initial action execution (1st, 5th rows) reduces the processing cost with an accuracy loss around 1% for sparsely and 2 &#x02212; 3% for densely sampled sequences as opposed to using the entire sequence (2nd, 6th rows). This result indicates that for repetitive tasks, each action repetition contains similar information regarding the general action pattern. The score difference between the cases where the entire sequence or only the first execution is considered can be potentially attributed to the long-term action effects on the scene or the object of interest. In addition, the utilization of all repetition segments as discrete samples (3rd, 7th rows) allows for an increase between 2 &#x02212; 4% in recognition accuracy, but with an additional computational cost during learning. This strategy is more beneficial, for datasets with highly repetitive actions (i.e., <italic>CountixHAR</italic>), with stronger contribution when using a sliding window sampling, as opposed to the utilization of a random clip cropping strategy for input sequence configuration. This is attributed to the fact that in this strategy, when sampling the entire video, the sparseness of the sampling process is likely to disrupt the action step temporal ordering due to the repetitive nature of the action. The impact of this is more severe in dataset cases with fewer repetitions per action (i.e., <italic>HMDB</italic>), due to the additional temporal ordering disruption cases produced by potentially erroneous repetition segmentations. The effect of the latter factor in the overall HAR accuracy is further assessed later in this section.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Accuracy (%) for HAR in HMDB / CountixHAR for different methodological variants.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th valign="top" align="left"><bold>HMDB / CountixHAR</bold></th>
<th valign="top" align="center"><bold><italic>10 frms</italic></bold></th>
<th valign="top" align="center"><bold><italic>25 frms</italic></bold></th>
<th valign="top" align="center"><bold><italic>35 frms</italic></bold></th>
<th valign="top" align="center"><bold><italic>64 frms</italic></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="left"><italic><italic>Rep</italic><sub>0</sub>-WS</italic></td>
<td valign="top" align="center">51.38 / 50.35</td>
<td valign="top" align="center">57.63 / 59.22</td>
<td valign="top" align="center">61.82 / 58.87</td>
<td valign="top" align="center">63.39 / 62.95</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="left"><italic>All frames-WS</italic></td>
<td valign="top" align="center">52.18 / 51.79</td>
<td valign="top" align="center">58.39 / 59.60</td>
<td valign="top" align="center">57.59 / 61.51</td>
<td valign="top" align="center"><bold>66.09</bold> / 63.36</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="left"><italic><italic>Rep</italic><sub><italic>all</italic></sub>-WS</italic></td>
<td valign="top" align="center">52.08 / 53.25</td>
<td valign="top" align="center">59.21 / <bold>63.32</bold></td>
<td valign="top" align="center"><bold>63.03</bold> / <bold>64.41</bold></td>
<td valign="top" align="center">64.20 / <bold>67.04</bold></td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="left"><italic><italic>R</italic><sub><italic>euc</italic></sub>, <italic>DIs, R</italic><sub><italic>sc</italic></sub></italic>-WS</td>
<td valign="top" align="center"><bold>53.45</bold> / <bold>54.92</bold></td>
<td valign="top" align="center"><bold>59.24</bold>/ 60.22</td>
<td valign="top" align="center">62.17 / 62.87</td>
<td valign="top" align="center">63.46 / 63.47</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="left"><italic><italic>Rep</italic><sub>0</sub>-RCC</italic></td>
<td valign="top" align="center">49.14 / 53.40</td>
<td valign="top" align="center">58.74 / 59.44</td>
<td valign="top" align="center">60.17 / 57.39</td>
<td valign="top" align="center">63.05 / 58.75</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="left"><italic>All frames-RCC</italic></td>
<td valign="top" align="center"><bold>52.08</bold> / 55.29</td>
<td valign="top" align="center"><bold>60.36</bold> / 60.22</td>
<td valign="top" align="center"><bold>62.24</bold> / 62.12</td>
<td valign="top" align="center"><bold>65.32</bold> / 64.39</td>
</tr>
<tr>
<td valign="top" align="left">7</td>
<td valign="top" align="left"><italic><italic>Rep</italic><sub><italic>all</italic></sub>-RCC</italic></td>
<td valign="top" align="center">51.80 / <bold>57.95</bold></td>
<td valign="top" align="center">59.28 / <bold>64.15</bold></td>
<td valign="top" align="center">61.91 / <bold>64.59</bold></td>
<td valign="top" align="center">63.62 / <bold>65.48</bold></td>
</tr>
<tr>
<td valign="top" align="left">8</td>
<td valign="top" align="left"><italic><italic>R</italic><sub><italic>euc</italic></sub>, <italic>DIs, R</italic><sub><italic>sc</italic></sub></italic>-RCC</td>
<td valign="top" align="center">49.61/ 56.18</td>
<td valign="top" align="center">58.78 / 62.35</td>
<td valign="top" align="center">59.66 / 63.87</td>
<td valign="top" align="center">61.83 / 63.89</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>WS, window sampling; RCC, random clip crop. Columns refer to sampled input sequence length for the reference execution and the initial sequence. Bold values indicates the best performing method</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>Moreover, as shown in <xref ref-type="table" rid="T1">Table 1</xref>, the utilization of dual-stream HAR DNN, in which the first branch is introduced with the most representative execution segment of the repetitive action sample (RGB sequence), and the second with a sequence of encoded frames, each corresponding to a summarization of a single repetition, allows to more effectively represent the discriminative information of the repetitive action. Specifically, in <xref ref-type="table" rid="T1">Table 1</xref> (4th, 8th rows), we observe that this strategy improves accuracy by 1 &#x02212; 3% compared to usage of the entire sequence (2nd, 6th rows), for small to mid-range inputs. The improvement in recognition accuracy is observed for both sampling schemes that are used in this study (uniform window sampling&#x02014;WS, random clip cropping&#x02014;RCC) for datasets that exhibit moderate to high number of repetitions, such as <italic>CountixHAR</italic>, and for sparse and moderate sampling densities of the input sequence. In datasets with low numbers of repetition such as <italic>HMDB</italic>, the proposed approach improves recognition accuracy, compared to the utilization of a naive sampling strategy on the entire sequence, only for the case of a uniform window-based sampling schemes, with the improvement being observed for sparsely to moderate sampling densities. When a random clip cropping sampling scheme is followed, the proposed approach exhibits lower performance. This is attributed to errors in the repetition segmentation process.</p>
<p>In the case of highly repetitive actions, <italic>RepDI-Net</italic> is capable of decoupling the main action pattern and the action-invoked effect on the scene/objects during learning. This is illustrated in <xref ref-type="fig" rid="F6">Figure 6</xref> where we used Grad-CAM (Zhou et al., <xref ref-type="bibr" rid="B32">2016</xref>; Selvaraju et al., <xref ref-type="bibr" rid="B24">2017</xref>) to visualize the activation maps of <italic>RepDI-Net</italic> for each input source as well as the ones of the baseline model that uses a sampled version of the entire video sequence. In the two Countix-sampled cases, the <italic>RepDI-Net</italic>&#x00027;s RGB input corresponding to a reference action execution focuses on the main motion pattern of the action, whereas the repetition-oriented part focuses on regions around the main motion pattern region. We would expect the HAR model to focus on regions that the action effects explicitly (around the object of interest) and not in action-unrelated regions. This is indeed true for actions such as onion slicing (<xref ref-type="fig" rid="F6">Figure 6D</xref>). Our model is not able to exhibit similar behavior for sequences with sudden viewpoint changes and severe occlusions such as the wood sawing sequence in <italic>CountixHAR</italic> (<xref ref-type="fig" rid="F6">Figure 6H</xref>). To explain this behavior we should consider that the original <italic>Countix</italic> dataset contains real-world videos with samples that exhibit sudden viewpoint changes and severe occlusions. In such cases, the examined repetition summarization technique (DIs) reduces but does not eliminate these effects. This &#x0201C;noise&#x0201D; in the data leads the DI sub-net to expand the range of the regions that focuses on.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Visualization of class activations for 1) <italic>Onion slicing</italic> <bold>(A&#x02013;D)</bold>, 2) <italic>Wood sawing</italic> <bold>(E&#x02013;H)</bold>. In each case, <italic>1st image</italic>: original frame, <italic>2nd image</italic>: using a window sampled sequence of the original video, and <italic>3rd, 4th images</italic>: reference RGB execution and DIs sources, respectively.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-04-806027-g0006.tif"/>
</fig>
</sec>
<sec>
<title>6.2. Impact of Repetition Segmentation Performance on Recognition Accuracy</title>
<p>The performance of an HAR model that exploits the repetitive nature of certain actions is expected to depend on the accuracy of the repetition segmentation task. Incorrect segmentation of the input actions can potentially disrupt the action step ordering that is encapsulated within each segmented repetition. As stated earlier, the <italic>Countix</italic> dataset provides only the estimated repetition counts of the included action samples, and does not include any information on the repetition segment start/end frame indices. Under those conditions, we design our experiments by focusing on the correctness of the repetition count estimate, considering that in each case there exist discrepancies between the estimated and expected repetition segment boundaries. Consequently, the cases that are considered are the following (see <xref ref-type="fig" rid="F7">Figure 7</xref> for a graphical illustration):</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Possible outcomes in repetition detection. A <italic>yellow box</italic> denotes a background action class segment (i.e., an action that is not part of the repetitive action), a <italic>red box</italic> denotes an erroneously segmented part, and a <italic>green box</italic> denotes a correct detection. First line: the ground truth on the number of the segments and their start/end frames. Second line: the number of repetitions is correctly estimated but the repetition segment boundaries are not. Third line: Both the number and the boundaries of repetition segments are not correct.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-04-806027-g0007.tif"/>
</fig>
<p><bold>Correct repetition count, unverified segment duration</bold>: To simulate the effect of unverified and perhaps erroneous repetition segmentation and assess its impact on the proposed DNN architecture for the case of <italic>CountixHAR</italic>, we uniformly segment the sample sequences into the estimated number of repetition segments that are provided in the dataset documentation. By comparing the performance of <italic>RepDI-Net</italic>, using this segmentation strategy against its performance with the utilization of <italic>RepNet</italic> as the repetition segmentation approach, we observe that erroneous localizations have a negative effect on the recognition accuracy, resulting in a decrease in accuracy, between 1.5 &#x02212; 3%, as shown in <xref ref-type="table" rid="T2">Table 2</xref> (row 1 vs. row 2, row 4 vs. row 5). This accuracy drop is more pronounced for sparse sampled sequences, in which redundant or absent action data has a larger impact on the classification, since discriminative keyframes can be discarded.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Accuracy (%) for CountixHAR (1, 4) using <italic>RepDI</italic>, (2, 5) correct repetition count but unverified repetition segment duration, and (3, 6) incorrect repetition counts and unverified repetition segment duration.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th valign="top" align="left"><bold>Repetition segmentation scheme</bold></th>
<th valign="top" align="center"><bold><italic>10 frms</italic></bold></th>
<th valign="top" align="center"><bold><italic>25 frms</italic></bold></th>
<th valign="top" align="center"><bold><italic>35 frms</italic></bold></th>
<th valign="top" align="center"><bold><italic>64 frms</italic></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="left">RepNet Dwibedi et al. (<xref ref-type="bibr" rid="B11">2020</xref>)-WS</td>
<td valign="top" align="center"><bold>54.92</bold></td>
<td valign="top" align="center"><bold>60.22</bold></td>
<td valign="top" align="center"><bold>62.87</bold></td>
<td valign="top" align="center"><bold>63.47</bold></td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="left">Correct rep. count/Unverified segment-WS</td>
<td valign="top" align="center">52.27</td>
<td valign="top" align="center">58.84</td>
<td valign="top" align="center">60.98</td>
<td valign="top" align="center">61.60</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="left">Incorrect rep. count/Unverified segment-WS</td>
<td valign="top" align="center">46.39</td>
<td valign="top" align="center">46.15</td>
<td valign="top" align="center">47.23</td>
<td valign="top" align="center">52.49</td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="left">RepNet Dwibedi et al. (<xref ref-type="bibr" rid="B11">2020</xref>)-RCC</td>
<td valign="top" align="center"><bold>56.18</bold></td>
<td valign="top" align="center"><bold>62.35</bold></td>
<td valign="top" align="center"><bold>63.87</bold></td>
<td valign="top" align="center"><bold>63.89</bold></td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="left">Correct rep. count/Unverified segment-RCC</td>
<td valign="top" align="center">53.51</td>
<td valign="top" align="center">59.44</td>
<td valign="top" align="center">59.09</td>
<td valign="top" align="center">62.12</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="left">Incorrect rep. count/Unverified segment-RCC</td>
<td valign="top" align="center">44.30</td>
<td valign="top" align="center">48.92</td>
<td valign="top" align="center">55.09</td>
<td valign="top" align="center">57.38</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Results are provided for both the WS (1, 2, 3) and RCC (4, 5, 6) schemes. Bold values indicates the best performing method</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p><bold>Incorrect repetition count, unverified segment duration</bold>: In this case, we randomly assign a value for the repetition count number and sample within the range of estimated repetition counts of each class. Moreover, to further increase the possibility of erroneous repetition segment duration estimates, we split the sequence into the assigned number of repetitions with a variable repetition segment size. As previously, we compare the performance of <italic>RepDI-Net</italic>, with this splitting strategy. Our results, shown in <xref ref-type="table" rid="T2">Table 2</xref> (row 1 vs. row 3, row 4 vs. row 6), indicate a decrease in performance between 6 to 15%. Higher performance drops are observed for sparser sampled input sequences, where an erroneous segmentation increases the possibility of selecting a frame that does not maintain the temporal order consistency of the action steps. Both experimental scenarios highlight the importance of an accurate repetition segmentation methodology.</p>
<p>From the aforementioned results, it can be observed that, we obtain better results if we rely on the (possibly wrong) number of repetitions estimated by RepNet, compared to the application of naive segmentation and sampling approaches. Therefore, it is expected that an improvement in the performance of the repetition count estimation and segmentation module will further improve the accuracy of an HAR model.</p></sec>
<sec>
<title>6.3. Effect of Repetition Temporal Encoding on Recognition Accuracy</title>
<p>The effectiveness of the repetition-driven sub-net in the proposed pipeline depends on the ability of the temporal encoding method that is used to represent the discriminative elements in each repetition. To better examine the importance of this factor, we compared the contribution of the Dynamic Images (DIs) encoding against a deep learning encoder, by examining the effect on the recognition accuracy. In the place of the deep temporal encoder, I3D was exploited, following similar directions in the HAR literature that exploit 3D Convolution-based encoders (Wang et al., <xref ref-type="bibr" rid="B29">2016</xref>; Lin et al., <xref ref-type="bibr" rid="B19">2019</xref>). Experimental results shown in <xref ref-type="table" rid="T3">Table 3</xref> (row 1 vs. row 3, row 4 vs. row 6) indicate that the repetition-oriented sub-net benefits from a more informative representation, with improvements in the range of 1 &#x02212; 4%.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Accuracy (%) for CountixHAR, due to (A) scaling factor absence (<italic>No</italic> <italic>R</italic><sub><italic>sc</italic></sub>), and (B) substitution of Dynamics Images (DIs) with a deep-based (I3D) repetition temporal encoder (<italic>I3D Enc</italic>).</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th valign="top" align="left"><bold>CountixHAR</bold></th>
<th valign="top" align="center"><bold><italic>10 frms</italic></bold></th>
<th valign="top" align="center"><bold><italic>25 frms</italic></bold></th>
<th valign="top" align="center"><bold><italic>35 frms</italic></bold></th>
<th valign="top" align="center"><bold><italic>64 frms</italic></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="left"><italic><italic>R</italic><sub><italic>euc</italic></sub>, <italic>DIs, R</italic><sub><italic>sc</italic></sub></italic>-WS</td>
<td valign="top" align="center">54.92</td>
<td valign="top" align="center">60.22</td>
<td valign="top" align="center">62.87</td>
<td valign="top" align="center">63.47</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="left"><italic><italic>R</italic><sub><italic>euc</italic></sub>, <italic>DIs, NoR</italic><sub><italic>sc</italic></sub></italic>-WS</td>
<td valign="top" align="center">50.02</td>
<td valign="top" align="center">57.91</td>
<td valign="top" align="center">59.22</td>
<td valign="top" align="center">60.89</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="left"><italic><italic>R</italic><sub><italic>euc</italic></sub>, <italic>I</italic>3<italic>DEnc, R</italic><sub><italic>sc</italic></sub></italic>-WS</td>
<td valign="top" align="center"><bold>55.27</bold></td>
<td valign="top" align="center"><bold>62.12</bold></td>
<td valign="top" align="center"><bold>63.66</bold></td>
<td valign="top" align="center"><bold>64.36</bold></td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="left"><italic><italic>R</italic><sub><italic>euc</italic></sub>, <italic>DIs, R</italic><sub><italic>sc</italic></sub></italic>-RCC</td>
<td valign="top" align="center">56.18</td>
<td valign="top" align="center">62.35</td>
<td valign="top" align="center">63.87</td>
<td valign="top" align="center">63.89</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="left"><italic><italic>R</italic><sub><italic>euc</italic></sub>, <italic>DIs, NoR</italic><sub><italic>sc</italic></sub></italic>-RCC</td>
<td valign="top" align="center">53.72</td>
<td valign="top" align="center">58.09</td>
<td valign="top" align="center">61.57</td>
<td valign="top" align="center">61.66</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="left"><italic><italic>R</italic><sub><italic>euc</italic></sub>, <italic>I</italic>3<italic>DEnc, R</italic><sub><italic>sc</italic></sub></italic>-RCC</td>
<td valign="top" align="center"><bold>61.60</bold></td>
<td valign="top" align="center"><bold>63.33</bold></td>
<td valign="top" align="center"><bold>64.39</bold></td>
<td valign="top" align="center"><bold>65.52</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Bold values indicates the best performing method</italic>.</p>
</table-wrap-foot>
</table-wrap></sec>
<sec>
<title>6.4. Effect of Scaling Factor on Recognition Accuracy</title>
<p><xref ref-type="table" rid="T3">Table 3</xref> (row 1 vs. row 2 and row 4 vs. row 5) presents the effect of the scaling factor, on the recognition accuracy. We observe that if this scaling factor is not employed, accuracy decreases significantly (between 2.3 and 4.9%). The proposed scaling factor tunes the representation learnt by the model according to the variability of the repetitive segments and the potential presence of discriminative information among them. This feature can be of importance when learning to recognize actions that impose gradual effects on the scene. To better evaluate the importance of <italic>f</italic><sub><italic>DIsc</italic></sub> with respect to the repetition count range, we formulate our experimental set up around <italic>CountixEffects</italic> as follows. We splitted <italic>CountixEffects</italic> into two subsets based on different repetition ranges, (a) <italic>limited repetitiveness</italic>, consisting of samples whose repetition count was in the range of [2, 7] and (b) <italic>moderate to high repetitiveness</italic>, whose repetition count was in the range of [8, 20]. This formulation resulted in the subset 2 &#x02212; 7, to consist of 134 training samples, whereas the subset 8 &#x02212; 20, to contain 189 training videos. For both subsets a common test set was created consisting of 100 videos with repetition counts in the range of 2&#x02013;20. Specifically, 64 test samples lie within the 2&#x02013;7 repetition range and 36 samples within the 8&#x02013;20 range. Moreover, we examine the performance of <italic>RepDI-Net</italic>, for repetitive actions (a) with <italic>no notable effect on the scene</italic> and (b) <italic>a gradual effect on the scene</italic>, by leveraging the action set layout of <italic>CountixEffects</italic>. Our experimental results, shown in <xref ref-type="table" rid="T4">Table 4</xref>, indicate that the inclusion of the repetition count and similarity-driven scaling factor for the repetition-based DI branch is beneficial for actions that impose a gradual effect on the scene. Moreover, for both repetition count ranges, the scaling factor improves recognition accuracy in both action subsets.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Accuracy on <italic>CountixEffects</italic>, due to the presence of the scaling factor <italic>f</italic><sub><italic>DIsc</italic></sub>, for actions (A) without any effects on an object due to repetitiveness, <italic>NoR</italic><sub><italic>sc</italic></sub>, (B) with effects on an object due to repetitiveness, <italic>R</italic><sub><italic>sc</italic></sub>.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>CountixEffects action subset</bold></th>
<th valign="top" align="center"><bold><italic>NoR</italic><sub><italic>sc</italic></sub></bold></th>
<th valign="top" align="center"><bold><italic>R</italic><sub><italic>sc</italic></sub></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">No gradual effect set 2-7</td>
<td valign="top" align="center">81.00</td>
<td valign="top" align="center">86.33</td>
</tr>
<tr>
<td valign="top" align="left">Gradual effect 2-7</td>
<td valign="top" align="center">73.00</td>
<td valign="top" align="center">79.50</td>
</tr>
<tr>
<td valign="top" align="left">No gradual effect set 8-20</td>
<td valign="top" align="center">80.33</td>
<td valign="top" align="center">88.67</td>
</tr>
<tr>
<td valign="top" align="left">Gradual effect set 8-20</td>
<td valign="top" align="center">76.00</td>
<td valign="top" align="center">78.00</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Table also depicts the effect of repetition count (2&#x02013;7, 8&#x02013;20) on each subset</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="table" rid="T5">Table 5</xref> shows the accuracy on <italic>CountixEffects</italic> using a 10-frame input sequence, and a uniform sampling scheme, (a) using all frames, (b) using the <italic>RepDI</italic> approach, and (c) using all repetitions as samples. Results are reported for the cases of actions in two different repetition ranges (2&#x02013;7 and 8&#x02013;20). As it can be verified, the application of a repetition-based segmentation stage [columns (b) and (c)] leads to an improvement in recognition accuracy compared to a naive use of the input sequence [column (a)]. Moreover, the consideration of each repetition as a distinct sample performs better compared to the <italic>RepDI</italic> approach. However, as presented in the following section, <italic>RepDI</italic> involves a more compact representation of the input sequences and, as such, is computationally more efficient compared to using all repetitions. This is because when all repetitions are used, the training time is proportional to their number.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Accuracy on <italic>CountixEffects</italic>, for lower (2&#x02013;7) and higher (8&#x02013;20) repetition counts, using a 10-frame input sequence, and a uniform sampling scheme, (A) using all frames (B) using the <italic>RepDI</italic> approach and (C) using all repetitions as samples.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>CountixEffects action subset</bold></th>
<th valign="top" align="center"><bold>(A) <italic>All frames</italic></bold></th>
<th valign="top" align="center"><bold>(B) <italic>RepDI</italic></bold></th>
<th valign="top" align="center"><bold>(C) <italic>Rep</italic><sub><italic>all</italic></sub></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Repetitions count 2-7</td>
<td valign="top" align="center">75.67</td>
<td valign="top" align="center">82.91</td>
<td valign="top" align="center">85.40</td>
</tr>
<tr>
<td valign="top" align="left">Repetitions count 8-20</td>
<td valign="top" align="center">80.80</td>
<td valign="top" align="center">83.34</td>
<td valign="top" align="center">88.60</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Approaches (B), (C) exploit repetitiveness and outperform (A) that does not</italic>.</p>
</table-wrap-foot>
</table-wrap></sec>
<sec>
<title>6.5. Learning Efficiency When Exploiting Action Repetitiveness</title>
<p>In the previous experiments it is clear that the exploitation of action repetitiveness for better configuring the input sequence can improve the model performance in HAR. Based on the obtained experimental results, the most effective approach is to consider all repetitions as distinct training samples. However, this approach is not the most efficient since the training time per epoch increases proportionally to the repetition count. As shown in <xref ref-type="table" rid="T6">Table 6</xref> for the case of <italic>CountixHAR</italic>, a mean of 4.7 repetitions per sample leads to a &#x000D7;4 increase in the per epoch computation time compared to exploiting the repetitions with the proposed deep pipeline, with the computation time discrepancy increasing proportionally as the number of repetitions increases. Moreover, based on <xref ref-type="fig" rid="F8">Figure 8</xref> it is evident that the proposed repetition segment summarization scheme, achieves the best trade-off regarding efficiency and efficacy in the learning process when dealing with repetitive actions.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Time per epoch (sec&#x02014;mean duration over 5 epochs).</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Dataset, Learning method</bold></th>
<th valign="top" align="center"><bold>Time (sec) per epoch</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">CountixHAR 2-10, <italic>Rep</italic><sub><italic>all</italic></sub></td>
<td valign="top" align="center">548</td>
</tr>
<tr>
<td valign="top" align="left">CountixHAR 2-10, RepDI</td>
<td valign="top" align="center">125</td>
</tr>
<tr>
<td valign="top" align="left">CountixEffects, 2-20 <italic>Rep</italic><sub><italic>all</italic></sub></td>
<td valign="top" align="center">521</td>
</tr>
<tr>
<td valign="top" align="left">CountixEffects, 2-20 RepDI</td>
<td valign="top" align="center">62</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Training was performed in an RTX 3070 GPU, batch size 8, learning rate 0.8, Adadelta, input length 10 frames, RCC sampling</italic>.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Training accuracy change with (A) no exploitation of the repetitiveness (red), (B) exploitation of all repetitions as distinct samples (blue), and (C) using the proposed method (green).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-04-806027-g0008.tif"/>
</fig></sec></sec>
<sec id="s7">
<title>7. Discussion and Future Work</title>
<p>We considered and evaluated the repetitive nature of certain actions in HAR under two perspectives. First, we investigated the effect of redundant information presence, due to task repetitiveness, on the ability to learn discriminative action-specific representations using common sampling techniques. Additionally, we proposed ways to highlight, <italic>via</italic> effective repetition sequence localization and processing, the gradual effects of the repetitive action on the actor or on the involved objects, and evaluated their contribution/importance on the action recognition task. Our findings indicate that for actions exhibiting moderate to high number of repetitions, localizing and using repetitions allows a deep learning HAR model to access more informative and discriminative representations, thus, improving the recognition performance. Exploiting repetitions as discrete samples leads to slower learning rates but allows the model to better capture the temporal ordering of the action as well as the scene/actor/object-of-interest appearance changes.</p>
<p>Repetitions can also be used to highlight the gradual effects of the action on the scene, an ability that can be useful to discriminate between <italic>fine-grained</italic> actions that exhibit high appearance and motion similarities. When adopting this strategy, it is evident that HAR should focus on the action-affected regions. Our findings indicate that the presence of action background motions or occlusions that are unrelated to the action, tend to be captured by summarization methods and are, therefore, being considered as action-induced consequences. A remedy to this could be to focus on the action-related objects and generate temporal summarizations/encodings only for these regions. This encoding scheme, when accompanied with the use of more recent state-of-the-art deep HAR models, will allow for more informative representations, and is expected to increase the effectiveness of exploiting repetitiveness on the HAR task.</p>
<p>One of the most important issues in the direction of exploiting repetitiveness in HAR, is the accuracy in repetition localization. This is still open for improvement, since only a few works have tackled the problem, all of them under the perspective of periodicity estimation and repetition counting. As indicated by our experiments, an HAR model that exploits repetitiveness is expected to benefit from a more robust repetition localization and repetition count estimation method.</p></sec>
<sec sec-type="data-availability" id="s8">
<title>Data Availability Statement</title>
<p>In this study, the authors generated and analyzed two subsets of Countix (Dwibedi et al., <xref ref-type="bibr" rid="B11">2020</xref>), dubbed CountixHAR, and CountixEffects. Guidelines to generate the dataset, as well as any important documentation, can be found in the github repository Repetitive-Action-Recognition: <ext-link ext-link-type="uri" xlink:href="https://github.com/Bouclas/Repetitive-Action-Recognition">https://github.com/Bouclas/Repetitive-Action-Recognition</ext-link>.</p></sec>
<sec id="s9">
<title>Author Contributions</title>
<p>KB introduced the conceptualization, implemented the software, performed the experiments, and wrote most of the paper. AA assisted in refining the methodology, validating the results, and contributed to the writing of the paper. All authors have read and agreed to the published version of the manuscript.</p></sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>This research project was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I) under the 1st Call for H.F.R.I Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment Project I.C.Humans, Number: 91.</p></sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p></sec>
</body>
<back>
<ack><p>We gratefully acknowledge the support of NVIDIA Corporation with the donation of a GPU. The authors would like to thank Kostas Papoutsakis and Aggeliki Tsoli for their helpful suggestions.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abdi</surname> <given-names>H.</given-names></name> <name><surname>Williams</surname> <given-names>L. J.</given-names></name></person-group> (<year>2010</year>). <article-title>Principal component analysis</article-title>. <source>Wiley Interdiscipl. Rev. Comput. Stat.</source> <volume>2</volume>, <fpage>433</fpage>&#x02013;<lpage>459</lpage>. <pub-id pub-id-type="doi">10.1002/wics.101</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aggarwal</surname> <given-names>J. K.</given-names></name> <name><surname>Ryoo</surname> <given-names>M. S.</given-names></name></person-group> (<year>2011</year>). <article-title>Human activity analysis: a review</article-title>. <source>ACM Comput. Surveys (CSUR)</source> <volume>43</volume>, <fpage>1</fpage>&#x02013;<lpage>43</lpage>. <pub-id pub-id-type="doi">10.1145/1922649.1922653</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahad</surname> <given-names>M. A. R.</given-names></name> <name><surname>Tan</surname> <given-names>J. K.</given-names></name> <name><surname>Kim</surname> <given-names>H.</given-names></name> <name><surname>Ishikawa</surname> <given-names>S.</given-names></name></person-group> (<year>2012</year>). <article-title>Motion history image: its variants and applications</article-title>. <source>Mach. Vis. Appl.</source> <volume>23</volume>, <fpage>255</fpage>&#x02013;<lpage>281</lpage>. <pub-id pub-id-type="doi">10.1007/s00138-010-0298-4</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bai</surname> <given-names>S.</given-names></name> <name><surname>Kolter</surname> <given-names>J. Z.</given-names></name> <name><surname>Koltun</surname> <given-names>V.</given-names></name></person-group> (<year>2018</year>). <article-title>An empirical evaluation of generic convolutional and recurrent networks for sequence modeling</article-title>. <source>arXiv:1803.01271</source>.</citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bilen</surname> <given-names>H.</given-names></name> <name><surname>Fernando</surname> <given-names>B.</given-names></name> <name><surname>Gavves</surname> <given-names>E.</given-names></name> <name><surname>Vedaldi</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Action recognition with dynamic image networks</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>40</volume>, <fpage>2799</fpage>&#x02013;<lpage>2813</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2017.2769085</pub-id><pub-id pub-id-type="pmid">29990080</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Carreira</surname> <given-names>J.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Quo vadis, action recognition? a new model and the kinetics dataset,</article-title> in <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Honolulu, HI</publisher-loc>), <fpage>4724</fpage>&#x02013;<lpage>4733</lpage>.</citation></ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cherian</surname> <given-names>A.</given-names></name> <name><surname>Fernando</surname> <given-names>B.</given-names></name> <name><surname>Harandi</surname> <given-names>M.</given-names></name> <name><surname>Gould</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>Generalized rank pooling for activity recognition,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>), <fpage>3222</fpage>&#x02013;<lpage>3231</lpage>.</citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Dong</surname> <given-names>W.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Li</surname> <given-names>L.-J.</given-names></name> <name><surname>Li</surname> <given-names>K.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2009</year>). <article-title>Imagenet: a large-scale hierarchical image database,</article-title> in <source>2009 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Miami, FL</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>248</fpage>&#x02013;<lpage>255</lpage>.</citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Diba</surname> <given-names>A.</given-names></name> <name><surname>Sharma</surname> <given-names>V.</given-names></name> <name><surname>Van Gool</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>). <article-title>Deep temporal linear encoding networks,</article-title> in <source>Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>), <fpage>2329</fpage>&#x02013;<lpage>2338</lpage>.</citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Donahue</surname> <given-names>J.</given-names></name> <name><surname>Anne Hendricks</surname> <given-names>L.</given-names></name> <name><surname>Guadarrama</surname> <given-names>S.</given-names></name> <name><surname>Rohrbach</surname> <given-names>M.</given-names></name> <name><surname>Venugopalan</surname> <given-names>S.</given-names></name> <name><surname>Saenko</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Long-term recurrent convolutional networks for visual recognition and description,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Boston, MA</publisher-loc>), <fpage>2625</fpage>&#x02013;<lpage>2634</lpage>.<pub-id pub-id-type="pmid">27608449</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dwibedi</surname> <given-names>D.</given-names></name> <name><surname>Aytar</surname> <given-names>Y.</given-names></name> <name><surname>Tompson</surname> <given-names>J.</given-names></name> <name><surname>Sermanet</surname> <given-names>P.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Counting out time: class agnostic video repetition counting in the wild,</article-title> in <source>IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Seattle, WA</publisher-loc>).</citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Feichtenhofer</surname> <given-names>C.</given-names></name> <name><surname>Pinz</surname> <given-names>A.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Convolutional two-stream network fusion for video action recognition,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>1933</fpage>&#x02013;<lpage>1941</lpage>.</citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fernando</surname> <given-names>B.</given-names></name> <name><surname>Gavves</surname> <given-names>E.</given-names></name> <name><surname>Oramas</surname> <given-names>J. M.</given-names></name> <name><surname>Ghodrati</surname> <given-names>A.</given-names></name> <name><surname>Tuytelaars</surname> <given-names>T.</given-names></name></person-group> (<year>2015</year>). <article-title>Modeling video evolution for action recognition,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Boston, MA</publisher-loc>), <fpage>5378</fpage>&#x02013;<lpage>5387</lpage>.<pub-id pub-id-type="pmid">27030844</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Herath</surname> <given-names>S.</given-names></name> <name><surname>Harandi</surname> <given-names>M.</given-names></name> <name><surname>Porikli</surname> <given-names>F.</given-names></name></person-group> (<year>2017</year>). <article-title>Going deeper into action recognition: a survey</article-title>. <source>Image Vis. Comput.</source> <volume>60</volume>, <fpage>4</fpage>&#x02013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1016/j.imavis.2017.01.010</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Kang</surname> <given-names>S. M.</given-names></name> <name><surname>Wildes</surname> <given-names>R. P.</given-names></name></person-group> (<year>2016</year>). <source>Review of action recognition and detection methods. arXiv</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1610.06906">https://arxiv.org/abs/1610.06906</ext-link></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karvounas</surname> <given-names>G.</given-names></name> <name><surname>Oikonomidis</surname> <given-names>I.</given-names></name> <name><surname>Argyros</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Reactnet: Temporal localization of repetitive activities in real-world videos</article-title>. <source>arXiv preprint</source> arXiv:1910.06096.</citation></ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kuehne</surname> <given-names>H.</given-names></name> <name><surname>Jhuang</surname> <given-names>H.</given-names></name> <name><surname>Garrote</surname> <given-names>E.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name> <name><surname>Serre</surname> <given-names>T.</given-names></name></person-group> (<year>2011</year>). <article-title>Hmdb: a large video database for human motion recognition,</article-title> in <source>2011 International Conference on Computer Vision</source> (<publisher-loc>Barcelona</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2556</fpage>&#x02013;<lpage>2563</lpage>.</citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Levy</surname> <given-names>O.</given-names></name> <name><surname>Wolf</surname> <given-names>L.</given-names></name></person-group> (<year>2015</year>). <article-title>Live repetition counting,</article-title> in <source>2015 IEEE International Conference on Computer Vision (ICCV)</source> (<publisher-loc>Santiago</publisher-loc>), <fpage>3020</fpage>&#x02013;<lpage>3028</lpage>.</citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>J.</given-names></name> <name><surname>Gan</surname> <given-names>C.</given-names></name> <name><surname>Han</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>Tsm: temporal shift module for efficient video understanding,</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Seoul</publisher-loc>), <fpage>7083</fpage>&#x02013;<lpage>7093</lpage>.<pub-id pub-id-type="pmid">33035158</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>R.</given-names></name> <name><surname>Xiao</surname> <given-names>J.</given-names></name> <name><surname>Fan</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Nextvlad: an efficient neural network to aggregate frame-level features for large-scale video classification,</article-title> in <source>Proceedings of the European Conference on Computer Vision (ECCV) Workshops</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://openaccess.thecvf.com/content_eccv_2018_workshops/w22/html/Lin_NeXtVLAD_An_Efficient_Neural_Network_to_Aggregate_Frame-level_Features_for_ECCVW_2018_paper.html">https://openaccess.thecvf.com/content_eccv_2018_workshops/w22/html/Lin_NeXtVLAD_An_Efficient_Neural_Network_to_Aggregate_Frame-level_Features_for_ECCVW_2018_paper.html</ext-link></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Panagiotakis</surname> <given-names>C.</given-names></name> <name><surname>Karvounas</surname> <given-names>G.</given-names></name> <name><surname>Argyros</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>Unsupervised detection of periodic segments in videos,</article-title> in <source>2018 25th IEEE International Conference on Image Processing (ICIP)</source> (<publisher-loc>Athens</publisher-loc>), <fpage>923</fpage>&#x02013;<lpage>927</lpage>.</citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><collab>Pearson</collab></person-group> (<year>1901</year>). <article-title>On lines and planes of closest fit to systems of points in space</article-title>. <source>London Edinburgh Dublin Philosoph. Mag. J. Sci.</source> <volume>2</volume>, <fpage>559</fpage>&#x02013;<lpage>572</lpage>.</citation></ref>
<ref id="B23">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Runia</surname> <given-names>T. F. H.</given-names></name> <name><surname>Snoek</surname> <given-names>C. G. M.</given-names></name> <name><surname>Smeulders</surname> <given-names>A. W. M.</given-names></name></person-group> (<year>2018</year>). <article-title>Real-world repetition estimation by div, grad and curl,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://openaccess.thecvf.com/content_cvpr_2018/html/Runia_Real-World_Repetition_Estimation_CVPR_2018_paper.html">https://openaccess.thecvf.com/content_cvpr_2018/html/Runia_Real-World_Repetition_Estimation_CVPR_2018_paper.html</ext-link></citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Selvaraju</surname> <given-names>R. R.</given-names></name> <name><surname>Cogswell</surname> <given-names>M.</given-names></name> <name><surname>Das</surname> <given-names>A.</given-names></name> <name><surname>Vedantam</surname> <given-names>R.</given-names></name> <name><surname>Parikh</surname> <given-names>D.</given-names></name> <name><surname>Batra</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>Grad-cam: visual explanations from deep networks via gradient-based localization,</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Venice</publisher-loc>), <fpage>618</fpage>&#x02013;<lpage>626</lpage>.</citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Two-stream convolutional networks for action recognition in videos</article-title>. <source>arXiv preprint</source> arXiv:1406.2199.<pub-id pub-id-type="pmid">33291759</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tran</surname> <given-names>D.</given-names></name> <name><surname>Bourdev</surname> <given-names>L.</given-names></name> <name><surname>Fergus</surname> <given-names>R.</given-names></name> <name><surname>Torresani</surname> <given-names>L.</given-names></name> <name><surname>Paluri</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>Learning spatiotemporal features with 3d convolutional networks,</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Santiago</publisher-loc>), <fpage>4489</fpage>&#x02013;<lpage>4497</lpage>.</citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Varol</surname> <given-names>G.</given-names></name> <name><surname>Laptev</surname> <given-names>I.</given-names></name> <name><surname>Schmid</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>Long-term temporal convolutions for action recognition</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>40</volume>, <fpage>1510</fpage>&#x02013;<lpage>1517</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2017.2712608</pub-id><pub-id pub-id-type="pmid">28600238</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Tong</surname> <given-names>Z.</given-names></name> <name><surname>Ji</surname> <given-names>B.</given-names></name> <name><surname>Wu</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <article-title>Tdn: temporal difference networks for efficient action recognition,</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Nashville, TN</publisher-loc>), <fpage>1895</fpage>&#x02013;<lpage>1904</lpage>.</citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Xiong</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Qiao</surname> <given-names>Y.</given-names></name> <name><surname>Lin</surname> <given-names>D.</given-names></name> <name><surname>Tang</surname> <given-names>X.</given-names></name> <name><surname>Van Gool</surname> <given-names>L.</given-names></name></person-group> (<year>2016</year>). <article-title>Temporal segment networks: towards good practices for deep action recognition,</article-title> in <source>European Conference on Computer Vision</source> (<publisher-loc>Amsterdam</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>20</fpage>&#x02013;<lpage>36</lpage>.</citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Gupta</surname> <given-names>A.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name></person-group> (<year>2018</year>). <article-title>Non-local neural networks,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>), <fpage>7794</fpage>&#x02013;<lpage>7803</lpage>.</citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Guo</surname> <given-names>S.</given-names></name> <name><surname>Huang</surname> <given-names>W.</given-names></name> <name><surname>Scott</surname> <given-names>M. R.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name></person-group> (<year>2020</year>). <article-title>V4d: 4d convolutional neural networks for video-level representation learning</article-title>. <source>arXiv preprint</source> arXiv:2002.07442.</citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>B.</given-names></name> <name><surname>Khosla</surname> <given-names>A.</given-names></name> <name><surname>Lapedriza</surname> <given-names>A.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name> <name><surname>Torralba</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Learning deep features for discriminative localization,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>,) <fpage>2921</fpage>&#x02013;<lpage>2929</lpage>.</citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Zolfaghari</surname> <given-names>M.</given-names></name> <name><surname>Xiong</surname> <given-names>Y.</given-names></name> <name><surname>Wu</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>A comprehensive study of deep video action recognition</article-title>. <source>arXiv preprint</source> arXiv:2012.06567.</citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>We only used publicly available YouTube videos.</p></fn>
</fn-group>
</back>
</article>