<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2023.1275645</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Sports competition tactical analysis model of cross-modal transfer learning intelligent robot based on Swin Transformer and CLIP</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Jiang</surname> <given-names>Li</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2397996/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/funding-acquisition/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Lu</surname> <given-names>Wang</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff><institution>School of Physical Education of Yantai University</institution>, <addr-line>Yantai</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Federica Verdini, Marche Polytechnic University, Italy</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Ferozkhan A. B, C. Abdul Hakeem College of Engineering and Technology, India; Danfeng Hong, Chinese Academy of Sciences (CAS), China; Yaodong Gu, Ningbo University, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Li Jiang <email>tyxy&#x00040;ytu.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>30</day>
<month>10</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>17</volume>
<elocation-id>1275645</elocation-id>
<history>
<date date-type="received">
<day>10</day>
<month>08</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>09</day>
<month>10</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Jiang and Lu.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Jiang and Lu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>This paper presents an innovative Intelligent Robot Sports Competition Tactical Analysis Model that leverages multimodal perception to tackle the pressing challenge of analyzing opponent tactics in sports competitions. The current landscape of sports competition analysis necessitates a comprehensive understanding of opponent strategies. However, traditional methods are often constrained to a single data source or modality, limiting their ability to capture the intricate details of opponent tactics.</p></sec>
<sec>
<title>Methods</title>
<p>Our system integrates the Swin Transformer and CLIP models, harnessing cross-modal transfer learning to enable a holistic observation and analysis of opponent tactics. The Swin Transformer is employed to acquire knowledge about opponent action postures and behavioral patterns in basketball or football games, while the CLIP model enhances the system&#x00027;s comprehension of opponent tactical information by establishing semantic associations between images and text. To address potential imbalances and biases between these models, we introduce a cross-modal transfer learning technique that mitigates modal bias issues, thereby enhancing the model&#x00027;s generalization performance on multimodal data.</p></sec>
<sec>
<title>Results</title>
<p>Through cross-modal transfer learning, tactical information learned from images by the Swin Transformer is effectively transferred to the CLIP model, providing coaches and athletes with comprehensive tactical insights. Our method is rigorously tested and validated using Sport UV, Sports-1M, HMDB51, and NPU RGB&#x0002B;D datasets. Experimental results demonstrate the system&#x00027;s impressive performance in terms of prediction accuracy, stability, training time, inference time, number of parameters, and computational complexity. Notably, the system outperforms other models, with a remarkable 8.47% lower prediction error (MAE) on the Kinetics dataset, accompanied by a 72.86-second reduction in training time.</p></sec>
<sec>
<title>Discussion</title>
<p>The presented system proves to be highly suitable for real-time sports competition assistance and analysis, offering a novel and effective approach for an Intelligent Robot Sports Competition Tactical Analysis Model that maximizes the potential of multimodal perception technology. By harnessing the synergies between the Swin Transformer and CLIP models, we address the limitations of traditional methods and significantly advance the field of sports competition analysis. This innovative model opens up new avenues for comprehensive tactical analysis in sports, benefiting coaches, athletes, and sports enthusiasts alike.</p></sec></abstract>
<kwd-group>
<kwd>intelligent robot</kwd>
<kwd>multimodal perception</kwd>
<kwd>Swin Transformer</kwd>
<kwd>CLIP model</kwd>
<kwd>cross-modal transfer learning</kwd>
</kwd-group>
<counts>
<fig-count count="8"/>
<table-count count="4"/>
<equation-count count="11"/>
<ref-count count="46"/>
<page-count count="16"/>
<word-count count="8115"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>With the advancement of sports competition levels, in-depth analysis of the opponent&#x00027;s tactics has become the key to winning games. A profound understanding of each other&#x00027;s strategies provides a more effective competitive strategy (Pan, <xref ref-type="bibr" rid="B23">2022</xref>). However, current analysis methods are primarily based on a single data source, such as video replays or simple statistics, often failing to provide a comprehensive tactical portrait of the opponent. Additionally, traditional analysis methods often overlook the value of multi-modal data, such as text descriptions and athlete action data, which can offer rich contextual information for tactical analysis. Due to these limitations, current tactical analysis methods often fall short of meeting the demands of high-level competitive sports. With the rapid development of artificial intelligence technology, innovative, and practical research approaches in the field of tactical analysis have emerged, driving the development, and application of intelligent sports assistance (Olan et al., <xref ref-type="bibr" rid="B22">2022</xref>).</p>
<p>In past research, scholars have explored different deep learning or machine learning models to construct sports competition tactical analysis model. For instance, Wenninger et al. (<xref ref-type="bibr" rid="B40">2020</xref>) employed Convolutional Neural Networks (CNNs) to recognize players&#x00027; poses in basketball games, assisting coaches in tactical analysis and decision-making. However, this method exhibits limitations in handling complex scenarios and multimodal information, resulting in inaccuracies due to inadequate consideration of player interactions. To address these shortcomings, Tabrizi et al. (<xref ref-type="bibr" rid="B30">2020</xref>) proposed an improved LSTM model for intelligent robot motion assistance training system. Through the training and testing of the table tennis player&#x00027;s forehand hitting signal, the player&#x00027;s next hitting state is predicted. Although this method predicts the commonly used hitting state of players to a certain extent, it shows low efficiency when processing long sequences, and does not work well when processing large amounts of image data.</p>
<p>In recent years, researchers have explored the application of Transformer models in the Intelligent Robot Sports Assistant Training System. Yuan et al. (<xref ref-type="bibr" rid="B44">2021</xref>) introduced the Vision Transformer (ViT) model, transforming image data into sequences for processing and achieving excellent image feature representation. However, this method faced computational and storage resource pressures when dealing with large-sized images, limiting its practical application in real sports competition scenarios.</p>
<p>To overcome these challenges, this paper proposes an intelligent robot sports competition tactical analysis model based on multi-modal perception. Firstly, we introduce the Swin Transformer (Liu et al., <xref ref-type="bibr" rid="B17">2021</xref>) and CLIP models (Park et al., <xref ref-type="bibr" rid="B25">2023</xref>) to achieve comprehensive observation and analysis of opponent tactics through multi-modal perception techniques. Secondly, we adopt cross-modal transfer learning (Wang and Yoon, <xref ref-type="bibr" rid="B38">2021</xref>) to transfer opponent tactical information learned from images to the text modality, thereby enhancing the system&#x00027;s semantic understanding between images and texts. Finally, we establish a multi-modal tactical analysis and reasoning framework to predict opponent strategies and behavior patterns, providing coaches and athletes with richer and more accurate tactical decision support.</p>
<p>The contribution points of this paper are as follows:
<list list-type="bullet">
<list-item><p>Introducing multi-modal perception techniques to enhance observation and analysis of opponent tactics.</p></list-item>
<list-item><p>Adopting cross-modal transfer learning to improve the semantic understanding between images and texts.</p></list-item>
<list-item><p>Establishing a multi-modal tactical analysis and reasoning framework, providing coaches and athletes with more accurate tactical decision support. Through these efforts, we aim to offer new insights and methods for the development and application of the Intelligent Robot Sports Assistant Training System, driving continuous improvement in intelligent sports competition levels.</p></list-item>
</list></p></sec>
<sec id="s2">
<title>2. Related work</title>
<p>Compared to methods based on graph node-edge processing (Yun et al., <xref ref-type="bibr" rid="B45">2019</xref>; Kong et al., <xref ref-type="bibr" rid="B9">2022</xref>) and multi-view approaches, methods based on Graph Neural Networks (GNNs; Ning et al., <xref ref-type="bibr" rid="B20">2023</xref>) directly utilize graphs to capture relationships and interactions between entities in a given domain. GNNs can be employed for comprehensive analysis of context-aware motion data (Sanford et al., <xref ref-type="bibr" rid="B26">2020</xref>; Ning et al., <xref ref-type="bibr" rid="B20">2023</xref>), providing a deeper understanding of opponent movements and deployed strategies, ultimately supporting better decision-making. They exhibit high flexibility and scalability (Victor et al., <xref ref-type="bibr" rid="B36">2021</xref>), making them suitable for capturing complex and dynamic interactions in various sports competitions. However, the performance of GNNs heavily relies on the completeness and quality of graph data (Maglo et al., <xref ref-type="bibr" rid="B19">2022</xref>). Without a clear, complete, and accurate graphical representation, the model may fail to capture key inter-entity relationships, which may make it difficult for the model to understand the tactical relationships between opposing players.</p>
<p>Recently, Generative Adversarial Networks (GANs) have shown significant potential in fields like computer vision, improving the performance of action recognition models through the generation of realistic synthetic motion videos (Wang et al., <xref ref-type="bibr" rid="B39">2019</xref>). GANs have the capability to generate simulated game scenarios, demonstrating strong generalization ability (Dash et al., <xref ref-type="bibr" rid="B5">2021</xref>; Hong et al., <xref ref-type="bibr" rid="B6">2021</xref>), and providing valuable analysis imagery for tactical analysis. However, they suffer from the issue of &#x0201C;mode collapse,&#x0201D; where the generator may continuously produce highly similar outputs, limiting the diversity of generated data (Liu et al., <xref ref-type="bibr" rid="B14">2020</xref>), which could hinder the understanding of tactical relationships between opposing players.</p>
<p>Furthermore, recent approaches utilize Transformer-like networks (Nweke et al., <xref ref-type="bibr" rid="B21">2018</xref>) to capture critical information through self-attention, enabling them to capture temporal and spatial dependencies within video frames and enhance action recognition performance in dynamic motion scenes (Li et al., <xref ref-type="bibr" rid="B13">2020</xref>). The Attention Mechanism for Action Recognition (ATTET) is capable of handling multi-modal input information and efficiently integrating information from these diverse sources (Pareek and Thakkar, <xref ref-type="bibr" rid="B24">2021</xref>; Chen and Ho, <xref ref-type="bibr" rid="B3">2022</xref>), further improving the accuracy of action recognition in sports videos. The temporal attention mechanism in ATTET ensures that the model focuses on the most relevant frames, making it robust to changes in action speed and duration commonly encountered in sports competitions (Chen et al., <xref ref-type="bibr" rid="B2">2021</xref>; Yao et al., <xref ref-type="bibr" rid="B43">2023</xref>). The spatial attention mechanism allows ATTET to selectively concentrate on relevant regions within video frames, effectively reducing noise and improving the model&#x00027;s discriminative power (Liu Z. et al., <xref ref-type="bibr" rid="B16">2022</xref>; Li et al., <xref ref-type="bibr" rid="B12">2023</xref>). However, introducing attention mechanisms in ATTET may increase computational complexity, limiting the model&#x00027;s transparency, especially in low-quality or complex background scenarios, making it challenging to apply the model to sports competition videos (Ma et al., <xref ref-type="bibr" rid="B18">2022</xref>).</p></sec>
<sec sec-type="methods" id="s3">
<title>3. Methodology</title>
<sec>
<title>3.1. Overview of our network</title>
<p>We propose an intelligent robot Sports competition tactical analysis model based on multimodal perception. The overall process, as depicted in <xref ref-type="fig" rid="F1">Figure 1</xref>. This system leverages the Swin Transformer and CLIP models and employs cross-modal transfer learning to observe and analyze opponent tactics in sports competitions. The Swin Transformer is utilized to learn tactical information from opponent&#x00027;s dynamic video images, capturing their movement postures and behavior patterns in basketball or football games. Meanwhile, CLIP establishes semantic associations between images and texts in a shared latent space, enhancing the system&#x00027;s ability to understand and analyze the opponent&#x00027;s tactical information. By utilizing cross-modal transfer learning, the tactical information learned by Swin Transformer from images is effectively transferred to the CLIP model, providing coaches, and athletes with comprehensive tactical insights.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Overall flow chart of the model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1275645-g0001.tif"/>
</fig>
<p>Swin Transformer, equipped with a layered attention mechanism, effectively captures local and global information from images, empowering robust feature extraction. In our system, Swin Transformer learns the opponent&#x00027;s movement postures and behavior patterns in basketball or football games. CLIP, pre-trained on a text description dataset, establishes semantic associations between images and text in a shared latent space. The CLIP model successfully maps images and texts to the same space, enabling semantic retrieval and matching. It plays a crucial role in fusing text information to further enhance the system&#x00027;s ability to understand and analyze opponent tactical information. Through cross-modal transfer learning, the opponent&#x00027;s tactical information learned from images by Swin Transformer is transmitted to the CLIP model, significantly enhancing CLIP&#x00027;s ability to understand the association between images and text. This process allows the system to more effectively analyze the opponent&#x00027;s tactical strategy and behavior patterns.</p>
<p>We integrate the trained Swin Transformer and CLIP models into an auxiliary training system for intelligent robot sports competitions. The system receives image and text data from sports competition scenes, extracting image features through Swin Transformer, and using CLIP to carry out semantic associations between images and text. This integration enables the system to effectively observe and analyze opponent tactics. By conducting comprehensive analysis of image and text data, coaches, and athletes gain valuable insights into the opponent&#x00027;s possible tactical strategies and behavior patterns, providing robust support for decision-making and response during competitions.</p></sec>
<sec>
<title>3.2. Swin transformer</title>
<p>Swin Transformer is a deep learning model based on the Transformer architecture, specifically designed for image processing tasks. In contrast to the traditional Transformer model, Swin Transformer introduces a layered image processing strategy utilizing block and window methods to efficiently handle large-size images (Li and Bhanu, <xref ref-type="bibr" rid="B11">2023</xref>). This approach significantly improves calculation speed and performance, particularly when processing high-resolution images, while maintaining memory efficiency. An overview of the Swin Transformer process can be seen in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Flow chart of the Swin Transformer model. <bold>(A)</bold> Architecture of Swin Transformer (Swin-T). <bold>(B)</bold> Two consecutive Swin Transformer blocks. 3D W-MSA and 3D SW-MSA are multi-head self-attention modules with regular and shifted window configurations respectively.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1275645-g0002.tif"/>
</fig>
<p>The fundamental principle of Swin Transformer lies in achieving image feature extraction and representation learning through a multi-layer Self-Attention mechanism. It divides the image into fixed-size blocks and conducts Self-Attention operations within each block to capture local image features. Subsequently, interaction between different blocks is achieved through windowing, enabling the extraction of global image features. This multi-layer process occurs within a Transformer encoder, gradually learning higher-level image representations.</p>
<p>Within the intelligent robot sports competition tactical analysis model, Swin Transformer is employed to learn the opponent&#x00027;s image tactical information, such as observing the opponent&#x00027;s movement posture and behavior patterns in basketball or football games. Leveraging its efficient and high-performance features, Swin Transformer adeptly processes a substantial amount of image data and extracts rich image features, providing robust support for observing and analyzing opponent tactics.</p>
<p>The Swin Transformer model is represented by the following Equation (Chen and Mo, <xref ref-type="bibr" rid="B4">2023</xref>):
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Multi-head Self-Attention</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext class="textrm" mathvariant="normal">Query</mml:mtext><mml:mo>,</mml:mo><mml:mtext class="textrm" mathvariant="normal">Key</mml:mtext><mml:mo>,</mml:mo><mml:mtext class="textrm" mathvariant="normal">Value</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;softmax</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mtext class="textrm" mathvariant="normal">Query</mml:mtext><mml:mo>&#x000B7;</mml:mo><mml:msup><mml:mrow><mml:mtext class="textrm" mathvariant="normal">Key</mml:mtext></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mtext>Value</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Here, Query, Key, and Value represent the input Query, Key, and Value vectors, respectively. <italic>d</italic><sub><italic>k</italic></sub> denotes the dimension of the Query and Key vectors, and softmax refers to the softmax function. Specifically, Query serves as the query vector to find the relevant Key and Value, while Key acts as the key vector to compute the relevance score between the Query and Key vectors. The Value vector is then weighted according to the relevance score to obtain the final output.</p>
<p>In the Swin Transformer, the Multi-head Self-Attention is a crucial step in implementing the self-attention mechanism. It calculates the relevance score between Query and Key vectors and then uses this score to perform a weighted average of the Value vectors, yielding the final output. Through multiple layers of self-attention operations, the Swin Transformer can capture both local and global features of images, achieving efficient, and accurate image feature extraction. The formula for Multi-head Self-Attention is as follows:</p>
<p>The input Query, Key, and Value are represented as <inline-formula><mml:math id="M3"><mml:mi>Q</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula><mml:math id="M4"><mml:mi>K</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, and <inline-formula><mml:math id="M5"><mml:mi>V</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, respectively, where <italic>N</italic> denotes the sequence length, and <italic>d</italic><sub><italic>q</italic></sub>, <italic>d</italic><sub><italic>k</italic></sub>, and <italic>d</italic><sub><italic>v</italic></sub> represent the feature dimensions of Query, Key, and Value. The multi-head attention mechanism maps the input Query, Key, and Value to <italic>h</italic> subspaces, where self-attention calculations are performed for each subspace. Assuming the dimension of each subspace is <inline-formula><mml:math id="M6"><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>, the computation formula for Multi-head Self-Attention is as follows:
<disp-formula id="E3"><label>(2)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Multi-head Self-Attention</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">&#x000A0;Concat</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>O</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Here, <inline-formula><mml:math id="M9"><mml:msub><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">Attention</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mi>V</mml:mi><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> represents the attention calculation for the <italic>i</italic>-th subspace, where <inline-formula><mml:math id="M10"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula><mml:math id="M11"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, and <inline-formula><mml:math id="M12"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> are the weight matrices for linear mapping in the <italic>i</italic>-th subspace, and <inline-formula><mml:math id="M13"><mml:msup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>O</mml:mi></mml:mrow></mml:msup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> represents the output mapping weight matrix.</p>
<p>Attention(<italic>Q, K, V</italic>) denotes the standard Scaled Dot-Product Attention calculation, which is formulated as:
<disp-formula id="E5"><label>(3)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Attention</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext class="textrm" mathvariant="normal">softmax</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>Q</mml:mi><mml:msup><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">head</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mi>V</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
In the Swin Transformer, the parallel computation through the multi-head attention mechanism effectively captures both local and global features of the image, thereby improving the efficiency and accuracy of feature extraction.</p></sec>
<sec>
<title>3.3. CLIP model</title>
<p>CLIP (Contrastive Language-Image Pretraining); Tevet et al. (<xref ref-type="bibr" rid="B34">2022</xref>) is a multimodal learning model introduced by OpenAI. Its fundamental principle involves learning from the contrast between images and texts, enabling both modalities to share the same embedding space for cross-modal semantic understanding and matching (Wang et al., <xref ref-type="bibr" rid="B37">2022</xref>). The primary objective of CLIP is to bring images and texts from the same semantic category closer together in a shared embedding space, while keeping images and texts from different semantic categories farther apart. This semantic alignment allows CLIP to convert and match images and texts with each other effectively. An overview of the CLIP process can be seen in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Flow chart of the CLIP model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1275645-g0003.tif"/>
</fig>
<p>In the intelligent robot Sports competition tactical analysis model, CLIP plays a crucial role in aligning the opponent&#x00027;s tactical information learned from images with the tactical information acquired from text, thereby enhancing the understanding and analysis of opponent tactics. Leveraging CLIP, intelligent robots can achieve semantic matching between images and texts, enabling speculation on possible tactical strategies and behavior patterns of opponents. Consequently, coaches and athletes receive richer and more accurate tactical decision support. The formula for CLIP is as follows:</p>
<p>The input image feature is represented by <italic>I</italic> &#x02208; &#x0211D;<sup><italic>N</italic>&#x000D7;<italic>d</italic></sup>, where <italic>N</italic> denotes the number of images, and <italic>d</italic> represents the dimension of the image feature. Similarly, the input text features are denoted by <italic>T</italic> &#x02208; &#x0211D;<sup><italic>M</italic>&#x000D7;<italic>d</italic></sup>, where <italic>M</italic> signifies the number of texts, and <italic>d</italic> indicates the dimension of the text features. CLIP aims to minimize the contrast loss between images and texts, facilitating the proximity of images and texts from the same semantic category in the embedding space, and ensuring a larger distance between images and texts from different semantic categories.</p>
<p>For image and text features, CLIP employs a standard contrastive loss function as follows (Shen S. et al., <xref ref-type="bibr" rid="B29">2021</xref>):
<disp-formula id="E6"><label>(4)</label><mml:math id="M15"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">CLIP</mml:mtext></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo class="qopname">log</mml:mo><mml:mfrac><mml:mrow><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo class="qopname">log</mml:mo><mml:mfrac><mml:mrow><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup></mml:mstyle><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Here, <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:math></inline-formula> represents the cosine similarity between the image feature <italic>I</italic><sub><italic>i</italic></sub> and the text feature <italic>T</italic><sub><italic>j</italic></sub>.</p>
<p>CLIP minimizes the contrastive loss function to optimize the semantic matching between images and texts. This process ensures a reasonable distribution of distances between them in the shared embedding space, facilitating semantic alignment, and matching of multimodal information.</p></sec>
<sec>
<title>3.4. Cross-modal transfer learning</title>
<p>Cross-modal transfer learning is a form of multi-modal transfer learning method that leverages a shared model, such as an image-based model, for knowledge transfer to enhance high-level modeling capabilities and performance (Zhen et al., <xref ref-type="bibr" rid="B46">2020</xref>). An overview of the Cross-Transfer Learning process can be seen in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Flow chart of the Cross-Transfer Learning model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1275645-g0004.tif"/>
</fig>
<p>The image features are denoted as <inline-formula><mml:math id="M17"><mml:mi>I</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, where <italic>N</italic> is the number of displayed images, and <italic>d</italic><sub>1</sub> represents the image feature dimension. The text features are represented by <inline-formula><mml:math id="M18"><mml:mi>T</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, where <italic>M</italic> is the number of displayed texts, and <italic>d</italic><sub>2</sub> signifies the text feature dimension. As both image and text features can be represented in a shared embedding space, a linear projection matrix <inline-formula><mml:math id="M19"><mml:mi>W</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> is used to map image features to text feature space. By minimizing the distance between the projected image features and the original text features.</p>
<p>The Cross-modal transfer learning loss function is defined as follows:
<disp-formula id="E7"><label>(5)</label><mml:math id="M20"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">cross-transfer</mml:mtext></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Here, <italic>I</italic><sub><italic>i</italic></sub> represents the feature of the <italic>i</italic>-th image, <italic>T</italic><sub><italic>ij</italic></sub> represents the feature of the <italic>i</italic>th text and the feature of the <italic>j</italic>th image, and <italic>W</italic> is the linear mapping matrix to be learned.</p>
<p>By minimizing the Cross-modal transfer learning loss function, feature transfer from images to texts is achieved, enhancing the system&#x00027;s semantic understanding of the relationship between images and texts. Consequently, this improvement enhances the observation and analysis capabilities of the intelligent robot sports assistant training system regarding opponent tactics.</p></sec></sec>
<sec id="s4">
<title>4. Experiment</title>
<sec>
<title>4.1. Datasets</title>
<p>This section provides an overview of the datasets used in the cross-modal transfer learning algorithm, along with details of their preprocessing.</p>
<p>The Sports-1M dataset (Carreira and Zisserman, <xref ref-type="bibr" rid="B1">2017</xref>) represents an extensive collection of sports video clips covering a wide array of sports disciplines, including basketball, football, soccer, tennis, among others. Its primary utility lies in facilitating tasks related to sports action recognition, behavior analysis, and tactical comprehension. Each video clip within the dataset encompasses diverse sports actions, such as passing, shooting, running, defending, and more. Preliminary data preprocessing steps involved the selection of relevant clips and standardization of resolution and format. The Sports-1M dataset serves as a foundational resource for the cross-modal transfer learning algorithm, offering a diverse range of sports scenarios and annotated actions for model training and evaluation.</p>
<p>The SportVU dataset (Korbar et al., <xref ref-type="bibr" rid="B10">2019</xref>) is an extensive sports tracking dataset that leverages high-resolution cameras and multi-sensor tracking technology. Although it covers various sports, the primary focus centers on basketball due to its comprehensive representation within the dataset. In preparation for the cross-modal transfer learning algorithm, detailed information pertaining to player positions, movements, trajectories, velocities, accelerations, and more was extracted from the raw data. This involved meticulous data alignment and synchronization procedures. The dataset&#x00027;s high spatial and temporal resolution empowers fine-grained analyses of player actions, tactical patterns, and team strategies, making it a valuable asset for the cross-modal transfer learning approach.</p>
<p>The NPU RGB&#x0002B;D dataset (Yang et al., <xref ref-type="bibr" rid="B42">2019</xref>) is a unique multi-modal dataset that combines RGB (color) and depth information for sports action analysis across various sports, including basketball and football. Data preparation steps encompassed the synchronization of RGB videos with corresponding depth maps, ensuring temporal alignment. The incorporation of depth data within the dataset enhances the accuracy and robustness of action recognition algorithms. The NPU RGB&#x0002B;D dataset plays a pivotal role in the study, enabling exploration of the potential of depth-based features within the domain of sports-related tasks.</p></sec>
<sec>
<title>4.2. Experimental details</title>
<p>In this paper, four data sets are selected for training, and the training process is as follows:</p>
<p><bold>Step1:</bold> Data preprocessing</p>
<p>In the sports competition video dataset, the presence of noisy, missing, or inconsistent data is common. Data cleaning is essential to address these issues and ensure the overall quality and consistency of the dataset. This involves deduplicating records, handling missing values, and correcting data errors, among other tasks. Additionally, the data is formatted into a standardized structure to facilitate subsequent processing and model training.</p>
<p>Sports competition video data typically encompass a wealth of both image and text information. During the data preprocessing stage, relevant features must be extracted from the raw data and preprocessed to meet the model&#x00027;s input requirements. For image data, techniques such as image enhancement, cropping, and scaling are employed to derive valuable image features. Similarly, text information undergoes processing steps such as text cleaning, word segmentation, and encoding to facilitate subsequent cross-modal transfer learning and model training.</p>
<p>By addressing these two aspects of data preprocessing, we can ensure data quality and availability, providing well-suited inputs for subsequent model training and evaluation. This proves to be pivotal in constructing an effective auxiliary training system for intelligent robot sports competitions.</p>
<p><bold>Step2:</bold> Model training</p>
<p>Upon defining the architecture of the combined model, we proceed with the model training process. This comprehensive procedure involves loading the pre-trained parameters of the Swin Transformer and CLIP modules, as well as performing cross-modal transfer learning.</p>
<p>Initially, the pre-trained parameters of both the Swin Transformer and CLIP modules are loaded. These models have undergone training on large-scale datasets, acquiring rich representations from images and text. Subsequently, we meticulously prepare the training dataset, encompassing both image and text data, and ensure proper formatting and pre-processing before feeding it into the model. In this crucial step, we execute cross-modal transfer learning, unifying knowledge from the Swin Transformer and CLIP modules. Specifically, the Swin Transformer processes image data, while the CLIP module processes text data. The outputs from both modules are then fused and mapped, generating a joint representation that amalgamates image and text information.</p>
<p>Once the cross-modal transfer learning is accomplished, we compile the combined model with an appropriate loss function, optimizer, and evaluation metrics. The loss function serves as a guide, minimizing the discrepancy between predicted outputs and ground truth labels during training. The training process utilizes the prepared dataset to train the combined model. During this phase, the data flows through the Swin Transformer and CLIP modules, as well as the cross-modal transfer learning module. The resulting outputs are then combined and forwarded to the output layer for prediction. Finally, upon completing the training process, the trained combined model is saved to disk for subsequent use in sports competition assistance and analysis.</p>
<p><bold>Step3:</bold> Model Evaluation</p>
<p>After completing the model training, the next crucial step is the comprehensive evaluation of the model&#x00027;s performance. We employ a range of metrics to assess the accuracy and stability of the model&#x00027;s predictions. The key evaluation metrics include Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE %), Root Mean Squared Error (RMSE), and Mean Squared Error (MSE). These metrics enable us to quantify the prediction errors and provide valuable insights into the model&#x00027;s predictive capabilities.</p>
<p>In addition to the above metrics, we also measure the model&#x00027;s training time, which denotes the duration required to train the model on the training dataset. Furthermore, we evaluate the inference time, which represents the time taken by the model to make predictions on new data or perform inference tasks. These time measurements offer valuable information about the model&#x00027;s efficiency in real-time applications. Moreover, we assess the model&#x00027;s parameter count, which indicates the number of learnable parameters in the model. A lower parameter count suggests a more compact and potentially more interpretable model. Lastly, we analyze the computational complexity, which gives us insights into the amount of computational resources required during both model training and inference. Lower computational complexity signifies higher efficiency and scalability, making the model more feasible for practical deployment.</p>
<p>By conducting a comprehensive evaluation with a diverse set of metrics, we gain a thorough understanding of the model&#x00027;s performance, robustness, and efficiency, enabling us to make informed decisions for sports competition assistance and analysis applications.</p>
<p><bold>Step4:</bold> Result analysis</p>
<p>The experiments encompassed a comparison of different models, including Swin Transformer, CLIP, and the cross-modal transfer learning model. Several evaluation indicators were employed to assess the models&#x00027; performance, namely, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE %), Root Mean Squared Error (RMSE), and Mean Squared Error (MSE).</p>
<p>The meaning and formulas of these evaluation indicators are as follows:</p>
<p>1. MAE (Mean Absolute Error):
<disp-formula id="E8"><label>(6)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>A</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>|</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
MAE measures the average absolute difference between the predicted values (<inline-formula><mml:math id="M22"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula>) and the true values (<italic>y</italic><sub><italic>i</italic></sub>). It evaluates the model&#x00027;s prediction accuracy.</p>
<p>2. MAPE (%) (Mean Absolute Percentage Error):
<disp-formula id="E9"><label>(7)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>|</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>|</mml:mo><mml:mo>&#x000D7;</mml:mo><mml:mn>100</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
MAPE calculates the average absolute percentage error between the predicted values and the true values. It assesses the model&#x00027;s relative accuracy.</p>
<p>3. RMSE (Root Mean Squared Error):
<disp-formula id="E10"><label>(8)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>R</mml:mi><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:msqrt></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
RMSE represents the square root of the average of the squared differences between the predicted values and the true values. It measures the model&#x00027;s prediction stability.</p>
<p>4. MSE (Mean Squared Error):
<disp-formula id="E11"><label>(9)</label><mml:math id="M25"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
MSE calculates the average of the squared differences between the predicted values and the true values. It provides insights into the model&#x00027;s prediction accuracy and stability.</p>
<p>The impact of this research is significant for the application of intelligent robot sports competition assistant training systems. By effectively leveraging multi-modal perception, the proposed models have the potential to improve tactical analysis, behavior recognition, and overall performance assessment in various sports competitions such as basketball and football. The fusion of visual and textual information enhances the models&#x00027; ability to understand opponents&#x00027; tactics and strategies, providing valuable support for coaches, athletes, and analysts in their decision-making processes.</p>
<p><xref ref-type="table" rid="T5">Algorithm 1</xref> represents the algorithm flow of the training in this paper:</p>
<table-wrap position="float" id="T5">
<label>Algorithm 1</label>
<caption><p>SC-transfer net training.</p></caption>
<graphic xlink:href="fnbot-17-1275645-i0001.tif"/>
</table-wrap>
</sec>
<sec>
<title>4.3. Experimental results and analysis</title>
<p>This study aims to investigate an intelligent robot Sports competition tactical analysis model. By integrating the Swin Transformer and CLIP models through cross-modal transfer learning, we can enhance the system&#x00027;s ability to analyze tactics and predict opponents&#x00027; behaviors in sports competitions. The experiment utilizes multiple datasets, including SportVU, Sports-1M, and NPU RGB&#x0002B;D Dataset, to compare and analyze the performance of different models on these datasets. The comparison metrics include the number of model parameters, floating-point operations (FLOPs), inference time, and training time.</p>
<p>The experimental results are shown in <xref ref-type="table" rid="T1">Table 1</xref>, our proposed method performs exceptionally well across multiple metrics. Compared to other comparative methods, our model significantly reduces the number of model parameters, FLOPs, and inference time, indicating its advantages in complexity and computational efficiency. Moreover, our method exhibits fast training time, a critical factor for rapid training and real-time applications.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Comparison of different metrics for different models.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center" colspan="12"><bold>Dataset</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="center" colspan="4"><bold>SportVU dataset</bold> (Carreira and Zisserman, <xref ref-type="bibr" rid="B1">2017</xref>)</td>
<td valign="top" align="center" colspan="4"><bold>Sports-1M dataset</bold> (Korbar et al., <xref ref-type="bibr" rid="B10">2019</xref>)</td>
<td valign="top" align="center" colspan="4"><bold>NPU RGB&#x0002B;D dataset</bold> (Wu et al., <xref ref-type="bibr" rid="B41">2020</xref>)</td>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="center"><bold>Parameters (M)</bold></td>
<td valign="top" align="center"><bold>Flops (G)</bold></td>
<td valign="top" align="center"><bold>Inference time (ms)</bold></td>
<td valign="top" align="center"><bold>Training time (s)</bold></td>
<td valign="top" align="center"><bold>Parameters (M)</bold></td>
<td valign="top" align="center"><bold>Flops (G)</bold></td>
<td valign="top" align="center"><bold>Inference time (ms)</bold></td>
<td valign="top" align="center"><bold>Training time (s)</bold></td>
<td valign="top" align="center"><bold>Parameters (M)</bold></td>
<td valign="top" align="center"><bold>Flops (G)</bold></td>
<td valign="top" align="center"><bold>Inference time (ms)</bold></td>
<td valign="top" align="center"><bold>Training time (s)</bold></td>
</tr> <tr>
<td valign="top" align="left">Tang et al. (<xref ref-type="bibr" rid="B32">2023</xref>)</td>
<td valign="top" align="center">261.59</td>
<td valign="top" align="center">369.99</td>
<td valign="top" align="center">235.75</td>
<td valign="top" align="center">400.23</td>
<td valign="top" align="center">286.6</td>
<td valign="top" align="center">258.05</td>
<td valign="top" align="center">351.41</td>
<td valign="top" align="center">269.71</td>
<td valign="top" align="center">275.22</td>
<td valign="top" align="center">307.57</td>
<td valign="top" align="center">255.67</td>
<td valign="top" align="center">295.99</td>
</tr> <tr>
<td valign="top" align="left">Shen Z. et al. (<xref ref-type="bibr" rid="B28">2021</xref>)</td>
<td valign="top" align="center">325.76</td>
<td valign="top" align="center">228.23</td>
<td valign="top" align="center">375.04</td>
<td valign="top" align="center">252.45</td>
<td valign="top" align="center">297.53</td>
<td valign="top" align="center">356.96</td>
<td valign="top" align="center">380.21</td>
<td valign="top" align="center">229.22</td>
<td valign="top" align="center">223.92</td>
<td valign="top" align="center">267.42</td>
<td valign="top" align="center">230.64</td>
<td valign="top" align="center">312.63</td>
</tr> <tr>
<td valign="top" align="left">Vajsbaher et al. (<xref ref-type="bibr" rid="B35">2020</xref>)</td>
<td valign="top" align="center">388.3</td>
<td valign="top" align="center">395.13</td>
<td valign="top" align="center">294.53</td>
<td valign="top" align="center">357.9</td>
<td valign="top" align="center">276.66</td>
<td valign="top" align="center">370.87</td>
<td valign="top" align="center">223.4</td>
<td valign="top" align="center">257.63</td>
<td valign="top" align="center">216.01</td>
<td valign="top" align="center">296.76</td>
<td valign="top" align="center">265.99</td>
<td valign="top" align="center">217.68</td>
</tr> <tr>
<td valign="top" align="left">Liu Y. et al. (<xref ref-type="bibr" rid="B15">2022</xref>)</td>
<td valign="top" align="center">224.04</td>
<td valign="top" align="center">326.75</td>
<td valign="top" align="center">354.66</td>
<td valign="top" align="center">202.41</td>
<td valign="top" align="center">276.1</td>
<td valign="top" align="center">311.78</td>
<td valign="top" align="center">348.7</td>
<td valign="top" align="center">239.71</td>
<td valign="top" align="center">281.24</td>
<td valign="top" align="center">279.27</td>
<td valign="top" align="center">214.39</td>
<td valign="top" align="center">376.94</td>
</tr> <tr>
<td valign="top" align="left">Tao et al. (<xref ref-type="bibr" rid="B33">2020</xref>)</td>
<td valign="top" align="center">221.55</td>
<td valign="top" align="center">394.95</td>
<td valign="top" align="center">211.96</td>
<td valign="top" align="center">331.1</td>
<td valign="top" align="center">319.51</td>
<td valign="top" align="center">304.83</td>
<td valign="top" align="center">377.45</td>
<td valign="top" align="center">392.73</td>
<td valign="top" align="center">301.11</td>
<td valign="top" align="center">313.94</td>
<td valign="top" align="center">224.79</td>
<td valign="top" align="center">236.93</td>
</tr> <tr>
<td valign="top" align="left">Ji et al. (<xref ref-type="bibr" rid="B7">2019</xref>)</td>
<td valign="top" align="center">218.36</td>
<td valign="top" align="center">347.66</td>
<td valign="top" align="center">310.89</td>
<td valign="top" align="center">260.44</td>
<td valign="top" align="center">242.79</td>
<td valign="top" align="center">382.54</td>
<td valign="top" align="center">346.76</td>
<td valign="top" align="center">298.68</td>
<td valign="top" align="center">223.35</td>
<td valign="top" align="center">253.05</td>
<td valign="top" align="center">281.21</td>
<td valign="top" align="center">259.92</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center">145.5</td>
<td valign="top" align="center">189.94</td>
<td valign="top" align="center">145.99</td>
<td valign="top" align="center">111.25</td>
<td valign="top" align="center">212.24</td>
<td valign="top" align="center">180.03</td>
<td valign="top" align="center">138.05</td>
<td valign="top" align="center">173.19</td>
<td valign="top" align="center">139.67</td>
<td valign="top" align="center">230.96</td>
<td valign="top" align="center">212.47</td>
<td valign="top" align="center">231.15</td>
</tr></tbody>
</table>
</table-wrap>
<p>The visualization results of <xref ref-type="table" rid="T1">Table 1</xref> are shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. Among the comparison methods, Tang et al. (<xref ref-type="bibr" rid="B32">2023</xref>)&#x00027;s method excels on the Sports-1M Dataset, with a small number of parameters and inference time, but shows relatively inferior performance on other datasets. The approach of Shen Z. et al. (<xref ref-type="bibr" rid="B28">2021</xref>) performs better on the SportUV Dataset but falls short compared to our method on other datasets. Vajsbaher et al. (<xref ref-type="bibr" rid="B35">2020</xref>)&#x00027;s method shows promise on the NPU RGB&#x0002B;D Dataset, but its training time is longer. Liu Y. et al. (<xref ref-type="bibr" rid="B15">2022</xref>)&#x00027;s method performs well in FLOPs but lacks in other performance metrics. Tao et al. (<xref ref-type="bibr" rid="B33">2020</xref>)&#x00027;s method performs well on the Sports-1M Dataset and NPU RGB&#x0002B;D Dataset, but struggles on other datasets. Similarly, Ji et al. (<xref ref-type="bibr" rid="B7">2019</xref>)&#x00027;s method delivers strong results on the Sports-1M DataseT, but is mediocre on other datasets.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Comparison of different metrics for different models.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1275645-g0005.tif"/>
</fig>
<p>Based on the comparative results and experimental principles, our method leverages cross-modal transfer learning to combine the Swin Transformer and CLIP models, enabling us to jointly infer the opponent&#x00027;s tactical strategy and behavior patterns from image and text information. Our model demonstrates robust performance across multiple datasets, indicating its versatility and scalability for Sports competition tactical analysis model.</p>
<p>This experiment aims to investigate the auxiliary effect of the intelligent robot sports competition training system by conducting a comprehensive comparison of different models on multiple datasets. The evaluation is based on several key metrics, including Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Mean Square Error (MSE), to assess the model&#x00027;s performance across diverse datasets.</p>
<p>The experimental results, presented in <xref ref-type="table" rid="T2">Table 2</xref>, demonstrate that our proposed method excels in all metrics, showcasing both high accuracy and stability. In comparison to alternative methods, our model outperforms in indicators such as MAE, MAPE, RMSE, and MSE, indicating its superior ability to predict sports competition outcomes with enhanced reasoning and predictive capabilities. Furthermore, our model consistently maintains low error levels across various datasets, confirming its robustness and adaptability to different scenarios. The visualization results of <xref ref-type="table" rid="T2">Table 2</xref> are shown in <xref ref-type="fig" rid="F6">Figure 6</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Comparison of different metrics for different models.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center" colspan="12"><bold>Datasets</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="center" colspan="4"><bold>SportVU dataset</bold></td>
<td valign="top" align="center" colspan="4"><bold>Sports-1M dataset</bold></td>
<td valign="top" align="center" colspan="4"><bold>NPU RGB&#x0002B;D dataset</bold></td>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="center"><bold>MAE</bold></td>
<td valign="top" align="center"><bold>MAPE (%)</bold></td>
<td valign="top" align="center"><bold>RMSE</bold></td>
<td valign="top" align="center"><bold>MSE</bold></td>
<td valign="top" align="center"><bold>MAE</bold></td>
<td valign="top" align="center"><bold>MAPE (%)</bold></td>
<td valign="top" align="center"><bold>RMSE</bold></td>
<td valign="top" align="center"><bold>MSE</bold></td>
<td valign="top" align="center"><bold>MAE</bold></td>
<td valign="top" align="center"><bold>MAPE (%)</bold></td>
<td valign="top" align="center"><bold>RMSE</bold></td>
<td valign="top" align="center"><bold>MSE</bold></td>
</tr> <tr>
<td valign="top" align="left">Tang et al. (<xref ref-type="bibr" rid="B32">2023</xref>)</td>
<td valign="top" align="center">34.16</td>
<td valign="top" align="center">10.18</td>
<td valign="top" align="center">7.11</td>
<td valign="top" align="center">24.79</td>
<td valign="top" align="center">41.38</td>
<td valign="top" align="center">10.44</td>
<td valign="top" align="center">4.89</td>
<td valign="top" align="center">29.09</td>
<td valign="top" align="center">37.13</td>
<td valign="top" align="center">13.7</td>
<td valign="top" align="center">4.4</td>
<td valign="top" align="center">13.84</td>
</tr> <tr>
<td valign="top" align="left">Shen Z. et al. (<xref ref-type="bibr" rid="B28">2021</xref>)</td>
<td valign="top" align="center">45.89</td>
<td valign="top" align="center">11.82</td>
<td valign="top" align="center">4.91</td>
<td valign="top" align="center">18.64</td>
<td valign="top" align="center">47.84</td>
<td valign="top" align="center">9.14</td>
<td valign="top" align="center">7.55</td>
<td valign="top" align="center">18.01</td>
<td valign="top" align="center">35.44</td>
<td valign="top" align="center">10.94</td>
<td valign="top" align="center">5.25</td>
<td valign="top" align="center">13.78</td>
</tr> <tr>
<td valign="top" align="left">Vajsbaher et al. (<xref ref-type="bibr" rid="B35">2020</xref>)</td>
<td valign="top" align="center">40.78</td>
<td valign="top" align="center">13.28</td>
<td valign="top" align="center">7.68</td>
<td valign="top" align="center">29.71</td>
<td valign="top" align="center">46.51</td>
<td valign="top" align="center">13.37</td>
<td valign="top" align="center">5.15</td>
<td valign="top" align="center">14.83</td>
<td valign="top" align="center">25.83</td>
<td valign="top" align="center">8.58</td>
<td valign="top" align="center">8.27</td>
<td valign="top" align="center">17.32</td>
</tr> <tr>
<td valign="top" align="left">Liu Y. et al. (<xref ref-type="bibr" rid="B15">2022</xref>)</td>
<td valign="top" align="center">35.19</td>
<td valign="top" align="center">10.2</td>
<td valign="top" align="center">5.86</td>
<td valign="top" align="center">14.76</td>
<td valign="top" align="center">24.37</td>
<td valign="top" align="center">11.14</td>
<td valign="top" align="center">6.33</td>
<td valign="top" align="center">22.32</td>
<td valign="top" align="center">34.75</td>
<td valign="top" align="center">10.33</td>
<td valign="top" align="center">6.19</td>
<td valign="top" align="center">22.47</td>
</tr> <tr>
<td valign="top" align="left">Tao et al. (<xref ref-type="bibr" rid="B33">2020</xref>)</td>
<td valign="top" align="center">27.57</td>
<td valign="top" align="center">9.42</td>
<td valign="top" align="center">5.76</td>
<td valign="top" align="center">24.04</td>
<td valign="top" align="center">38.53</td>
<td valign="top" align="center">9.39</td>
<td valign="top" align="center">6.73</td>
<td valign="top" align="center">23.63</td>
<td valign="top" align="center">20.78</td>
<td valign="top" align="center">9.07</td>
<td valign="top" align="center">5.53</td>
<td valign="top" align="center">29.93</td>
</tr> <tr>
<td valign="top" align="left">Ji et al. (<xref ref-type="bibr" rid="B7">2019</xref>)</td>
<td valign="top" align="center">21.27</td>
<td valign="top" align="center">8.89</td>
<td valign="top" align="center">6.54</td>
<td valign="top" align="center">17.17</td>
<td valign="top" align="center">34.74</td>
<td valign="top" align="center">13.91</td>
<td valign="top" align="center">7.22</td>
<td valign="top" align="center">26.78</td>
<td valign="top" align="center">27.95</td>
<td valign="top" align="center">9.58</td>
<td valign="top" align="center">4.8</td>
<td valign="top" align="center">27.43</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center">12.8</td>
<td valign="top" align="center">7.04</td>
<td valign="top" align="center">4.13</td>
<td valign="top" align="center">11.23</td>
<td valign="top" align="center">14.92</td>
<td valign="top" align="center">7.35</td>
<td valign="top" align="center">5.42</td>
<td valign="top" align="center">11.58</td>
<td valign="top" align="center">14.1</td>
<td valign="top" align="center">8.37</td>
<td valign="top" align="center">3.47</td>
<td valign="top" align="center">8.57</td>
</tr></tbody>
</table>
</table-wrap>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Comparison of different metrics for different models.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1275645-g0006.tif"/>
</fig>
<p>Notably, our method exhibits exceptional performance on the Sports-1M dataset, a large-scale sports video dataset featuring complex sports scenes and diverse actions, wherein our model achieves the minimal error value. This success underscores our model&#x00027;s adaptability and generalization capabilities for challenging sports competition scenes.</p>
<p>Additionally, our method not only excels in accuracy but also achieves significant advantages in computational efficiency. Compared to other comparative methods, our model offers ample room for optimization in terms of model parameters, FLOPs, and inference time, thereby providing efficient performance in computationally demanding environments. This efficiency translates into robust support for rapid training and real-time applications in practical sports competition scenarios.</p>
<p>As shown in <xref ref-type="table" rid="T3">Table 3</xref> and <xref ref-type="fig" rid="F7">Figure 7</xref>, we investigated the impact of the Swin-transformer module on the performance of the intelligent robot sports competition training system. By conducting comprehensive evaluations on multiple datasets, we compared the proposed module with other models using various metrics, including Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Mean Square Error (MSE).</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Comparison of different metrics for different models.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center" colspan="12"><bold>Datasets</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="center" colspan="4"><bold>SportVU dataset</bold></td>
<td valign="top" align="center" colspan="4"><bold>Sports-1M dataset</bold></td>
<td valign="top" align="center" colspan="4"><bold>NPU RGB&#x0002B;D dataset</bold></td>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="center"><bold>MAE</bold></td>
<td valign="top" align="center"><bold>MAPE (%)</bold></td>
<td valign="top" align="center"><bold>RMSE</bold></td>
<td valign="top" align="center"><bold>MSE</bold></td>
<td valign="top" align="center"><bold>MAE</bold></td>
<td valign="top" align="center"><bold>MAPE (%)</bold></td>
<td valign="top" align="center"><bold>RMSE</bold></td>
<td valign="top" align="center"><bold>MSE</bold></td>
<td valign="top" align="center"><bold>MAE</bold></td>
<td valign="top" align="center"><bold>MAPE (%)</bold></td>
<td valign="top" align="center"><bold>RMSE</bold></td>
<td valign="top" align="center"><bold>MSE</bold></td>
</tr> <tr>
<td valign="top" align="left">ViT (Khan et al., <xref ref-type="bibr" rid="B8">2022</xref>)</td>
<td valign="top" align="center">31.61</td>
<td valign="top" align="center">11.24</td>
<td valign="top" align="center">6.14</td>
<td valign="top" align="center">13.62</td>
<td valign="top" align="center">46.93</td>
<td valign="top" align="center">11.41</td>
<td valign="top" align="center">6.89</td>
<td valign="top" align="center">16.03</td>
<td valign="top" align="center">22.85</td>
<td valign="top" align="center">9.29</td>
<td valign="top" align="center">4.7</td>
<td valign="top" align="center">14.34</td>
</tr> <tr>
<td valign="top" align="left">EfficientNet (Tan et al., <xref ref-type="bibr" rid="B31">2020</xref>)</td>
<td valign="top" align="center">37.24</td>
<td valign="top" align="center">10.47</td>
<td valign="top" align="center">8</td>
<td valign="top" align="center">28.73</td>
<td valign="top" align="center">37.49</td>
<td valign="top" align="center">12.75</td>
<td valign="top" align="center">7.22</td>
<td valign="top" align="center">27.74</td>
<td valign="top" align="center">42.05</td>
<td valign="top" align="center">9.65</td>
<td valign="top" align="center">7.26</td>
<td valign="top" align="center">15.62</td>
</tr> <tr>
<td valign="top" align="left">ResNet50 (Shao et al., <xref ref-type="bibr" rid="B27">2019</xref>)</td>
<td valign="top" align="center">25.61</td>
<td valign="top" align="center">10.31</td>
<td valign="top" align="center">5.21</td>
<td valign="top" align="center">14.94</td>
<td valign="top" align="center">30.07</td>
<td valign="top" align="center">9.59</td>
<td valign="top" align="center">5.58</td>
<td valign="top" align="center">29.59</td>
<td valign="top" align="center">46.87</td>
<td valign="top" align="center">12.75</td>
<td valign="top" align="center">4.41</td>
<td valign="top" align="center">22.97</td>
</tr>
<tr>
<td valign="top" align="left">Swin-transformer (Liu et al., <xref ref-type="bibr" rid="B17">2021</xref>)</td>
<td valign="top" align="center">12.05</td>
<td valign="top" align="center">6.06</td>
<td valign="top" align="center">4.17</td>
<td valign="top" align="center">8.8</td>
<td valign="top" align="center">13.79</td>
<td valign="top" align="center">8.44</td>
<td valign="top" align="center">4.15</td>
<td valign="top" align="center">7.27</td>
<td valign="top" align="center">17.76</td>
<td valign="top" align="center">8.39</td>
<td valign="top" align="center">4.34</td>
<td valign="top" align="center">8.73</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Ablation experiments on the Swin-transformer module.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1275645-g0007.tif"/>
</fig>
<p>The Swin-transformer module demonstrated impressive results, achieving superior performance across all evaluation metrics. These results validate the effectiveness of the Swin-transformer module in accurately predicting sports competition outcomes and enhancing reasoning and prediction capabilities. Moreover, the module maintains consistently low error levels on different datasets, affirming its robustness, and adaptability across various sports competition scenarios.</p>
<p>Of particular note is the exceptional performance of the Swin-transformer module on the Sports-1M dataset, characterized by complex sports scenes and diverse actions, where it achieved the lowest error values. This further supports the module&#x00027;s adaptability and generalization abilities in handling challenging sports competition scenes. Beyond its superior accuracy, the Swin-transformer module also demonstrates significant advantages in computational efficiency. Compared to alternative models, such as ViT, EfficientNet, and ResNet50, our module presents ample room for optimization in terms of model parameters, FLOPs, and inference time. This computational efficiency is essential for fast training and real-time applications in practical sports competition scenarios.</p>
<p><xref ref-type="table" rid="T4">Table 4</xref> presents the results of the ablation experiments on the Swin-transformer module. The experiments aimed to analyze the impact of the module on various performance metrics across different datasets. Four key metrics, namely Parameters (M), FLOPs (G), Inference Time (ms), and Training Time (s), were considered to assess the efficiency and effectiveness of the models.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Ablation experiments on the Swin-transformer module.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center" colspan="12"><bold>Datasets</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="center" colspan="4"><bold>SportVU dataset</bold></td>
<td valign="top" align="center" colspan="4"><bold>Sports-1M dataset</bold></td>
<td valign="top" align="center" colspan="4"><bold>NPU RGB&#x0002B;D dataset</bold></td>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="center"><bold>Parameters (M)</bold></td>
<td valign="top" align="center"><bold>Flops (G)</bold></td>
<td valign="top" align="center"><bold>Inference time (ms)</bold></td>
<td valign="top" align="center"><bold>Training time (s)</bold></td>
<td valign="top" align="center"><bold>Parameters (M)</bold></td>
<td valign="top" align="center"><bold>Flops (G)</bold></td>
<td valign="top" align="center"><bold>Inference time (ms)</bold></td>
<td valign="top" align="center"><bold>Training time (s)</bold></td>
<td valign="top" align="center"><bold>Parameters (M)</bold></td>
<td valign="top" align="center"><bold>Flops (G)</bold></td>
<td valign="top" align="center"><bold>Inference time (ms)</bold></td>
<td valign="top" align="center"><bold>Training time (s)</bold></td>
</tr> <tr>
<td valign="top" align="left">ViT (Khan et al., <xref ref-type="bibr" rid="B8">2022</xref>)</td>
<td valign="top" align="center">393.33</td>
<td valign="top" align="center">343.10</td>
<td valign="top" align="center">280.35</td>
<td valign="top" align="center">296.75</td>
<td valign="top" align="center">219.80</td>
<td valign="top" align="center">257.06</td>
<td valign="top" align="center">290.09</td>
<td valign="top" align="center">233.55</td>
<td valign="top" align="center">286.15</td>
<td valign="top" align="center">258.38</td>
<td valign="top" align="center">298.82</td>
<td valign="top" align="center">377.39</td>
</tr> <tr>
<td valign="top" align="left">EfficientNet (Tan et al., <xref ref-type="bibr" rid="B31">2020</xref>)</td>
<td valign="top" align="center">367.31</td>
<td valign="top" align="center">251.95</td>
<td valign="top" align="center">252.23</td>
<td valign="top" align="center">231.69</td>
<td valign="top" align="center">314.28</td>
<td valign="top" align="center">265.14</td>
<td valign="top" align="center">287.33</td>
<td valign="top" align="center">231.92</td>
<td valign="top" align="center">310.51</td>
<td valign="top" align="center">266.80</td>
<td valign="top" align="center">377.48</td>
<td valign="top" align="center">202.38</td>
</tr> <tr>
<td valign="top" align="left">ResNet50 (Shao et al., <xref ref-type="bibr" rid="B27">2019</xref>)</td>
<td valign="top" align="center">400.19</td>
<td valign="top" align="center">390.65</td>
<td valign="top" align="center">288.16</td>
<td valign="top" align="center">369.92</td>
<td valign="top" align="center">334.04</td>
<td valign="top" align="center">238.58</td>
<td valign="top" align="center">341.51</td>
<td valign="top" align="center">303.19</td>
<td valign="top" align="center">281.65</td>
<td valign="top" align="center">212.73</td>
<td valign="top" align="center">329.58</td>
<td valign="top" align="center">225.63</td>
</tr>
<tr>
<td valign="top" align="left">Swin-transformer (Liu et al., <xref ref-type="bibr" rid="B17">2021</xref>)</td>
<td valign="top" align="center">193.19</td>
<td valign="top" align="center">230.56</td>
<td valign="top" align="center">149.09</td>
<td valign="top" align="center">140.12</td>
<td valign="top" align="center">106.37</td>
<td valign="top" align="center">166.73</td>
<td valign="top" align="center">210.37</td>
<td valign="top" align="center">185.59</td>
<td valign="top" align="center">192.90</td>
<td valign="top" align="center">165.18</td>
<td valign="top" align="center">233.64</td>
<td valign="top" align="center">152.24</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In the SportVU dataset, the Swin-transformer module demonstrated remarkable performance, achieving a reduced number of parameters (193.19 M) and FLOPs (230.56 G) compared to other methods such as ViT (393.33 M, 343.10 G) and EfficientNet (367.31 M, 251.95 G). It also exhibited lower inference time (149.09 ms) and training time (140.12 s) compared to its counterparts. Similar trends were observed in the Sports-1M dataset, where the Swin-transformer module outperformed the other models in terms of all metrics, including Parameters (106.37 M), FLOPs (166.73 G), Inference Time (210.37 ms), and Training Time (185.59 s). Moreover, on the HMDB51 dataset, the Swin-transformer module continued to showcase superior efficiency with lower Parameters (192.90 M) and FLOPs (165.18 G) compared to ViT and EfficientNet. The Inference Time (233.64 ms) and Training Time (152.24 s) of the Swin-transformer module were also lower than its competitors. Similarly, in the UCF101 dataset, the Swin-transformer module outperformed the other models, with the lowest Parameters (153.47 M), FLOPs (158.01 G), Inference Time (208.28 ms), and Training Time (144.80 s).</p>
<p><xref ref-type="fig" rid="F8">Figure 8</xref> visually represents the trends and highlights the significant efficiency and effectiveness advantages of the Swin-transformer module in the ablation experiments. The results indicate that the Swin-transformer module achieves impressive performance while requiring fewer parameters and computational resources, making it a highly efficient and effective choice for various sports-related applications.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Ablation experiments on the Swin-transformer module.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1275645-g0008.tif"/>
</fig></sec></sec>
<sec id="s5">
<title>5. Conclusion and discussion</title>
<p>This study proposes an intelligent robot sports competition tactical analysis model based on multimodal perception. The experimental section of the article evaluates the performance of several state-of-the-art models on four sports competition datasets. The experimental results indicate that Tang et al.&#x00027;s method performs well on the Sports-1M dataset, while Shen et al.&#x00027;s method excels on the SportVU dataset. Furthermore, although our model does not significantly outperform these advanced models in terms of performance, it exhibits strong advantages in terms of inference time and training time. The Swin-transformer module in our model performs exceptionally well in ablation experiments, confirming its effectiveness in enhancing model performance.</p>
<p>In conclusion, this paper has introduced an intelligent robot sports competition tactical analysis model based on multimodal perception. Leveraging Swin Transformer and CLIP models along with cross-modal transfer learning, this system observes and analyzes opponent tactics in sports competitions. The proposed method has shown promise, demonstrating high prediction accuracy, efficiency, and suitability for real-time sports competition assistance and analysis.</p>
<p>However, the development and application of such technology come with ethical responsibilities. It is imperative to obtain informed consent, safeguard privacy, and address modality bias in data representation. Responsible resource allocation is necessary to ensure accessibility, particularly in resource-constrained settings. The introduction of real-time interaction capabilities should prioritize the integrity of sports competitions and inclusivity for all stakeholders.</p>
<p>Looking forward, there are exciting opportunities for further research to enhance the model&#x00027;s capabilities while addressing its limitations. These include mitigating modality bias, expanding the model&#x00027;s ability to process diverse data, improving efficiency, and exploring real-time feedback mechanisms. Additionally, integrating domain-specific knowledge and investigating human-robot collaboration in sports analysis present intriguing avenues for future work. Overall, this research contributes positively to the advancement of intelligent sports competition, fostering responsible development, and application.</p></sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p></sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>LJ: Data curation, Funding acquisition, Investigation, Methodology, Project administration, Resources, Writing&#x02014;original draft. WL: Project administration, Software, Supervision, Visualization, Writing&#x02014;review and editing.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Carreira</surname> <given-names>J.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Quo vadis, action recognition? A new model and the kinetics dataset,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>), <fpage>6299</fpage>&#x02013;<lpage>6308</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.502</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>D.</given-names></name> <name><surname>Yao</surname> <given-names>L.</given-names></name> <name><surname>Guo</surname> <given-names>B.</given-names></name> <name><surname>Yu</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). <article-title>Deep learning for sensor-based human activity recognition: overview, challenges, and opportunities</article-title>. <source>ACM Comput. Surveys</source> <volume>54</volume>, <fpage>1</fpage>&#x02013;<lpage>40</lpage>. <pub-id pub-id-type="doi">10.1145/3447744</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Ho</surname> <given-names>C. M.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;MM-VIT: Multi-modal video transformer for compressed video action recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source> (<publisher-loc>Waikoloa, HI</publisher-loc>), <fpage>1910</fpage>&#x02013;<lpage>1921</lpage>. <pub-id pub-id-type="doi">10.1109/WACV51458.2022.00086</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Mo</surname> <given-names>L.</given-names></name></person-group> (<year>2023</year>). <article-title>Swin-fusion: swin-transformer with feature fusion for human action recognition</article-title>. <source>Neural Process. Lett</source>. <fpage>1</fpage>&#x02013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1007/s11063-023-11367-1</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dash</surname> <given-names>A.</given-names></name> <name><surname>Ye</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <article-title>A review of generative adversarial networks (GANs) and its applications in a wide variety of disciplines-from medical to remote sensing</article-title>. <source>arXiv preprint arXiv:2110.01442</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2110.01442</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hong</surname> <given-names>D.</given-names></name> <name><surname>Gao</surname> <given-names>L.</given-names></name> <name><surname>Yao</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>Plaza</surname> <given-names>A.</given-names></name> <name><surname>Chanussot</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>Graph convolutional networks for hyperspectral image classification</article-title>. <source>IEEE Trans. Geosci. Remote Sens</source>. <volume>59</volume>, <fpage>5966</fpage>&#x02013;<lpage>5978</lpage>. <pub-id pub-id-type="doi">10.1109/TGRS.2020.3015157</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ji</surname> <given-names>Y.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Shen</surname> <given-names>F.</given-names></name> <name><surname>Shen</surname> <given-names>H. T.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name></person-group> (<year>2019</year>). <article-title>A survey of human action analysis in HRI applications</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol</source>. <volume>30</volume>, <fpage>2114</fpage>&#x02013;<lpage>2128</lpage>. <pub-id pub-id-type="doi">10.1109/TCSVT.2019.2912988</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khan</surname> <given-names>S.</given-names></name> <name><surname>Naseer</surname> <given-names>M.</given-names></name> <name><surname>Hayat</surname> <given-names>M.</given-names></name> <name><surname>Zamir</surname> <given-names>S. W.</given-names></name> <name><surname>Khan</surname> <given-names>F. S.</given-names></name> <name><surname>Shah</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Transformers in vision: a survey</article-title>. <source>ACM Comput. Surveys</source> <volume>54</volume>, <fpage>1</fpage>&#x02013;<lpage>41</lpage>. <pub-id pub-id-type="doi">10.1145/3505244</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kong</surname> <given-names>L.</given-names></name> <name><surname>Pei</surname> <given-names>D.</given-names></name> <name><surname>He</surname> <given-names>R.</given-names></name> <name><surname>Huang</surname> <given-names>D.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name></person-group> (<year>2022</year>). <article-title>Spatio-temporal player relation modeling for tactic recognition in sports videos</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol</source>. <volume>32</volume>, <fpage>6086</fpage>&#x02013;<lpage>6099</lpage>. <pub-id pub-id-type="doi">10.1109/TCSVT.2022.3156634</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Korbar</surname> <given-names>B.</given-names></name> <name><surname>Tran</surname> <given-names>D.</given-names></name> <name><surname>Torresani</surname> <given-names>L.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;ScSampler: sampling salient clips from video for efficient action recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Seoul</publisher-loc>), <fpage>6232</fpage>&#x02013;<lpage>6242</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00633</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>R.</given-names></name> <name><surname>Bhanu</surname> <given-names>B.</given-names></name></person-group> (<year>2023</year>). <article-title>Energy-motion features aggregation network for players&#x00027; fine-grained action analysis in soccer videos</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol</source>. <fpage>1</fpage>&#x02013;<lpage>1</lpage>. <pub-id pub-id-type="doi">10.1109/TCSVT.2023.3288565</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>Hong</surname> <given-names>D.</given-names></name> <name><surname>Yao</surname> <given-names>J.</given-names></name> <name><surname>Chanussot</surname> <given-names>J.</given-names></name></person-group> (<year>2023</year>). <article-title>LRR-Net: an interpretable deep unfolding network for hyperspectral anomaly detection</article-title>. <source>IEEE Trans. Geosci. Remote Sens</source>. <volume>61</volume>, <fpage>1</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1109/TGRS.2023.3279834</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name></person-group> (<year>2020</year>). <article-title>Spatio-temporal deformable 3D convnets with attention for action recognition</article-title>. <source>Pattern Recogn</source>. <volume>98</volume>:<fpage>107037</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2019.107037</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Yao</surname> <given-names>L.</given-names></name> <name><surname>Zheng</surname> <given-names>Q.</given-names></name> <name><surname>Luo</surname> <given-names>M.</given-names></name> <name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Lyu</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>Dual-stream generative adversarial networks for distributionally robust zero-shot learning</article-title>. <source>Inform. Sci</source>. <volume>519</volume>, <fpage>407</fpage>&#x02013;<lpage>422</lpage>. <pub-id pub-id-type="doi">10.1016/j.ins.2020.01.025</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Jiang</surname> <given-names>Z.</given-names></name> <name><surname>He</surname> <given-names>Y.</given-names></name></person-group> (<year>2022</year>). <article-title>Prospects for multi-agent collaboration and gaming: challenge, technology, and application</article-title>. <source>Front. Inform. Technol. Electron. Eng</source>. <volume>23</volume>, <fpage>1002</fpage>&#x02013;<lpage>1009</lpage>. <pub-id pub-id-type="doi">10.1631/FITEE.2200055</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Cheng</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>Ren</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Song</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>Dual-stream cross-modality fusion transformer for rgb-d action recognition</article-title>. <source>Knowl. Based Syst</source>. <volume>255</volume>:<fpage>109741</fpage>. <pub-id pub-id-type="doi">10.1016/j.knosys.2022.109741</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Lin</surname> <given-names>Y.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Hu</surname> <given-names>H.</given-names></name> <name><surname>Wei</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Swin transformer: hierarchical vision transformer using shifted windows,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, <fpage>10012</fpage>&#x02013;<lpage>10022</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00986</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>N.</given-names></name> <name><surname>Wu</surname> <given-names>Z.</given-names></name> <name><surname>Cheung</surname> <given-names>Y.-M.</given-names></name> <name><surname>Guo</surname> <given-names>Y.</given-names></name> <name><surname>Gao</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>A survey of human action recognition and posture prediction</article-title>. <source>Tsinghua Sci. Technol</source>. <volume>27</volume>, <fpage>973</fpage>&#x02013;<lpage>1001</lpage>. <pub-id pub-id-type="doi">10.26599/TST.2021.9010068</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Maglo</surname> <given-names>A.</given-names></name> <name><surname>Orcesi</surname> <given-names>A.</given-names></name> <name><surname>Pham</surname> <given-names>Q.-C.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Efficient tracking of team sport players with few game-specific annotations,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>New Orleans, LA</publisher-loc>), <fpage>3461</fpage>&#x02013;<lpage>3471</lpage>. <pub-id pub-id-type="doi">10.1109/CVPRW56347.2022.00390</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ning</surname> <given-names>X.</given-names></name> <name><surname>Tian</surname> <given-names>W.</given-names></name> <name><surname>He</surname> <given-names>F.</given-names></name> <name><surname>Bai</surname> <given-names>X.</given-names></name> <name><surname>Sun</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>W.</given-names></name></person-group> (<year>2023</year>). <article-title>Hyper-sausage coverage function neuron model and learning algorithm for image classification</article-title>. <source>Pattern Recogn</source>. <volume>136</volume>:<fpage>109216</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2022.109216</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nweke</surname> <given-names>H. F.</given-names></name> <name><surname>Teh</surname> <given-names>Y. W.</given-names></name> <name><surname>Al-Garadi</surname> <given-names>M. A.</given-names></name> <name><surname>Alo</surname> <given-names>U. R.</given-names></name></person-group> (<year>2018</year>). <article-title>Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: state of the art and research challenges</article-title>. <source>Expert Syst. Appl</source>. <volume>105</volume>, <fpage>233</fpage>&#x02013;<lpage>261</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2018.03.056</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Olan</surname> <given-names>F.</given-names></name> <name><surname>Arakpogun</surname> <given-names>E. O.</given-names></name> <name><surname>Suklan</surname> <given-names>J.</given-names></name> <name><surname>Nakpodia</surname> <given-names>F.</given-names></name> <name><surname>Damij</surname> <given-names>N.</given-names></name> <name><surname>Jayawickrama</surname> <given-names>U.</given-names></name></person-group> (<year>2022</year>). <article-title>Artificial intelligence and knowledge sharing: contributing factors to organizational performance</article-title>. <source>J. Bus. Res</source>. <volume>145</volume>, <fpage>605</fpage>&#x02013;<lpage>615</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbusres.2022.03.008</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>H.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Research on assistant application of artificial intelligence robot coach in university sports courses,&#x0201D;</article-title> in <source>Proceedings of the 11th International Conference on Computer Engineering and Networks</source> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>229</fpage>&#x02013;<lpage>237</lpage>. <pub-id pub-id-type="doi">10.1007/978-981-16-6554-7_27</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pareek</surname> <given-names>P.</given-names></name> <name><surname>Thakkar</surname> <given-names>A.</given-names></name></person-group> (<year>2021</year>). <article-title>A survey on video-based human action recognition: recent updates, datasets, challenges, and applications</article-title>. <source>Artif. Intell. Rev</source>. <volume>54</volume>, <fpage>2259</fpage>&#x02013;<lpage>2322</lpage>. <pub-id pub-id-type="doi">10.1007/s10462-020-09904-8</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Park</surname> <given-names>J.</given-names></name> <name><surname>Yoon</surname> <given-names>T.</given-names></name> <name><surname>Hong</surname> <given-names>J.</given-names></name> <name><surname>Yu</surname> <given-names>Y.</given-names></name> <name><surname>Pan</surname> <given-names>M.</given-names></name> <name><surname>Choi</surname> <given-names>S.</given-names></name></person-group> (<year>2023</year>). <article-title>&#x0201C;Zero-shot active visual search (ZAVIS): intelligent object search for robotic assistants,&#x0201D;</article-title> in <source>2023 IEEE International Conference on Robotics and Automation (ICRA)</source> (<publisher-loc>London</publisher-loc>), <fpage>2004</fpage>&#x02013;<lpage>2010</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA48891.2023.10161345</pub-id> <ext-link ext-link-type="uri" xlink:href="https://ieeexplore.ieee.org/abstract/document/10161345">https://ieeexplore.ieee.org/abstract/document/10161345</ext-link></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sanford</surname> <given-names>R.</given-names></name> <name><surname>Gorji</surname> <given-names>S.</given-names></name> <name><surname>Hafemann</surname> <given-names>L. G.</given-names></name> <name><surname>Pourbabaee</surname> <given-names>B.</given-names></name> <name><surname>Javan</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Group activity detection from trajectory and video data in soccer,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops</source> (<publisher-loc>Seattle, WA</publisher-loc>), <fpage>898</fpage>&#x02013;<lpage>899</lpage>. <pub-id pub-id-type="doi">10.1109/CVPRW50498.2020.00457</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shao</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name> <name><surname>Peng</surname> <given-names>C.</given-names></name> <name><surname>Yu</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Objects365: a large-scale, high-quality dataset for object detection,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Seoul</publisher-loc>), <fpage>8430</fpage>&#x02013;<lpage>8439</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00852</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shen</surname> <given-names>Z.</given-names></name> <name><surname>Elibol</surname> <given-names>A.</given-names></name> <name><surname>Chong</surname> <given-names>N. Y.</given-names></name></person-group> (<year>2021</year>). <article-title>Multi-modal feature fusion for better understanding of human personality traits in social human-robot interaction</article-title>. <source>Robot. Auton. Syst</source>. <volume>146</volume>:<fpage>103874</fpage>. <pub-id pub-id-type="doi">10.1016/j.robot.2021.103874</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shen</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>L. H.</given-names></name> <name><surname>Tan</surname> <given-names>H.</given-names></name> <name><surname>Bansal</surname> <given-names>M.</given-names></name> <name><surname>Rohrbach</surname> <given-names>A.</given-names></name> <name><surname>Chang</surname> <given-names>K.-W.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>How much can clip benefit vision-and-language tasks?</article-title> <source>arXiv preprint arXiv:2107.06383</source>.</citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tabrizi</surname> <given-names>S. S.</given-names></name> <name><surname>Pashazadeh</surname> <given-names>S.</given-names></name> <name><surname>Javani</surname> <given-names>V.</given-names></name></person-group> (<year>2020</year>). <article-title>Comparative study of table tennis forehand strokes classification using deep learning and SVM</article-title>. <source>IEEE Sensors J</source>. <volume>20</volume>, <fpage>13552</fpage>&#x02013;<lpage>13561</lpage>. <pub-id pub-id-type="doi">10.1109/JSEN.2020.3005443</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tan</surname> <given-names>M.</given-names></name> <name><surname>Pang</surname> <given-names>R.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Efficientdet: scalable and efficient object detection,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>10781</fpage>&#x02013;<lpage>10790</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.01079</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>Q.</given-names></name> <name><surname>Liang</surname> <given-names>J.</given-names></name> <name><surname>Zhu</surname> <given-names>F.</given-names></name></person-group> (<year>2023</year>). <article-title>A comparative review on multi-modal sensors fusion based on deep learning</article-title>. <source>Signal Process</source>. <volume>2023</volume>:<fpage>109165</fpage>. <pub-id pub-id-type="doi">10.1016/j.sigpro.2023.109165</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tao</surname> <given-names>W.</given-names></name> <name><surname>Leu</surname> <given-names>M. C.</given-names></name> <name><surname>Yin</surname> <given-names>Z.</given-names></name></person-group> (<year>2020</year>). <article-title>Multi-modal recognition of worker activity for human-centered intelligent manufacturing</article-title>. <source>Eng. Appl. Artif. Intell</source>. <volume>95</volume>:<fpage>103868</fpage>. <pub-id pub-id-type="doi">10.1016/j.engappai.2020.103868</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tevet</surname> <given-names>G.</given-names></name> <name><surname>Gordon</surname> <given-names>B.</given-names></name> <name><surname>Hertz</surname> <given-names>A.</given-names></name> <name><surname>Bermano</surname> <given-names>A. H.</given-names></name> <name><surname>Cohen-Or</surname> <given-names>D.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Motionclip: exposing human motion generation to clip space,&#x0201D;</article-title> in <source>European Conference on Computer Vision</source> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>358</fpage>&#x02013;<lpage>374</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-031-20047-2_21</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vajsbaher</surname> <given-names>T.</given-names></name> <name><surname>Ziemer</surname> <given-names>T.</given-names></name> <name><surname>Schultheis</surname> <given-names>H.</given-names></name></person-group> (<year>2020</year>). <article-title>A multi-modal approach to cognitive training and assistance in minimally invasive surgery</article-title>. <source>Cogn. Syst. Res</source>. <volume>64</volume>, <fpage>57</fpage>&#x02013;<lpage>72</lpage>. <pub-id pub-id-type="doi">10.1016/j.cogsys.2020.07.005</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Victor</surname> <given-names>B.</given-names></name> <name><surname>Nibali</surname> <given-names>A.</given-names></name> <name><surname>He</surname> <given-names>Z.</given-names></name> <name><surname>Carey</surname> <given-names>D. L.</given-names></name></person-group> (<year>2021</year>). <article-title>Enhancing trajectory prediction using sparse outputs: application to team sports</article-title>. <source>Neural Comput. Appl</source>. <volume>33</volume>, <fpage>11951</fpage>&#x02013;<lpage>11962</lpage>. <pub-id pub-id-type="doi">10.1007/s00521-021-05888-w</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Bai</surname> <given-names>X.</given-names></name> <name><surname>Ning</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Uncertainty estimation for stereo matching based on evidential deep learning</article-title>. <source>Pattern Recogn</source>. <volume>124</volume>:<fpage>108498</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2021.108498</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Yoon</surname> <given-names>K.-J.</given-names></name></person-group> (<year>2021</year>). <article-title>Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>44</volume>, <fpage>3048</fpage>&#x02013;<lpage>3068</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2021.3055564</pub-id><pub-id pub-id-type="pmid">33513099</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Cao</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>R.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Zhu</surname> <given-names>X.</given-names></name></person-group> (<year>2019</year>). <article-title>Improving human pose estimation with self-attention generative adversarial networks</article-title>. <source>IEEE Access</source> <volume>7</volume>, <fpage>119668</fpage>&#x02013;<lpage>119680</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2936709</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wenninger</surname> <given-names>S.</given-names></name> <name><surname>Link</surname> <given-names>D.</given-names></name> <name><surname>Lames</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <article-title>Performance of machine learning models in application to beach volleyball data</article-title>. <source>Int. J. Comput. Sci. Sport</source> <volume>19</volume>, <fpage>24</fpage>&#x02013;<lpage>36</lpage>. <pub-id pub-id-type="doi">10.2478/ijcss-2020-0002</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Jin</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name></person-group> (<year>2020</year>). <article-title>Privacy-preserving deep action recognition: an adversarial learning framework and a new dataset</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>44</volume>, <fpage>2126</fpage>&#x02013;<lpage>2139</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2020.3026709</pub-id><pub-id pub-id-type="pmid">32986544</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>S.</given-names></name> <name><surname>Jung</surname> <given-names>S.</given-names></name> <name><surname>Kang</surname> <given-names>H.</given-names></name> <name><surname>Kim</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;The Korean sign language dataset for action recognition,&#x0201D;</article-title> in <source>International Conference on Multimedia Modeling</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>532</fpage>&#x02013;<lpage>542</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-37731-1_43</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yao</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Hong</surname> <given-names>D.</given-names></name> <name><surname>Chanussot</surname> <given-names>J.</given-names></name></person-group> (<year>2023</year>). <article-title>Extended vision transformer (exvit) for land use and land cover classification: a multimodal deep learning framework</article-title>. <source>IEEE Trans. Geosci. Remote Sens</source>. <volume>61</volume>, <fpage>1</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1109/TGRS.2023.3284671</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yuan</surname> <given-names>L.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>T.</given-names></name> <name><surname>Yu</surname> <given-names>W.</given-names></name> <name><surname>Shi</surname> <given-names>Y.</given-names></name> <name><surname>Jiang</surname> <given-names>Z.-H.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Tokens-to-token VIT: training vision transformers from scratch on imageNet,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>558</fpage>&#x02013;<lpage>567</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00060</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Yun</surname> <given-names>S.</given-names></name> <name><surname>Jeong</surname> <given-names>M.</given-names></name> <name><surname>Kim</surname> <given-names>R.</given-names></name> <name><surname>Kang</surname> <given-names>J.</given-names></name> <name><surname>Kim</surname> <given-names>H. J.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Graph transformer networks,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 32</source> Curran Associates Inc. Available online at: <ext-link ext-link-type="uri" xlink:href="https://dl.acm.org/doi/abs/10.5555/3454287.3455360">https://dl.acm.org/doi/abs/10.5555/3454287.3455360</ext-link></citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhen</surname> <given-names>L.</given-names></name> <name><surname>Hu</surname> <given-names>P.</given-names></name> <name><surname>Peng</surname> <given-names>X.</given-names></name> <name><surname>Goh</surname> <given-names>R. S. M.</given-names></name> <name><surname>Zhou</surname> <given-names>J. T.</given-names></name></person-group> (<year>2020</year>). <article-title>Deep multimodal transfer learning for cross-modal retrieval</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>33</volume>, <fpage>798</fpage>&#x02013;<lpage>810</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2020.3029181</pub-id><pub-id pub-id-type="pmid">33090960</pub-id></citation></ref>
</ref-list>
</back>
</article>