<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2024.1242690</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A novel multi-scale violence and public gathering dataset for crowd behavior classification</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Elzein</surname> <given-names>Almiqdad</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/2346199/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Basaran</surname> <given-names>Emrah</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Yang</surname> <given-names>Yin David</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1968406/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Qaraqe</surname> <given-names>Marwa</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/964957/overview"/>
</contrib>
</contrib-group>
<aff><institution>College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation</institution>, <addr-line>Doha</addr-line>, <country>Qatar</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Pier Luigi Mazzeo, National Research Council (CNR), Italy</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Yunxue Shao, Nanjing Tech University, China</p>
<p>Rui-Yang Ju, National Taiwan University, Taiwan</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Marwa Qaraqe <email>mqaraqe&#x00040;hbku.edu.qa</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>05</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>6</volume>
<elocation-id>1242690</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>06</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>29</day>
<month>02</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2024 Elzein, Basaran, Yang and Qaraqe.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Elzein, Basaran, Yang and Qaraqe</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Dependable utilization of computer vision applications, such as smart surveillance, requires training deep learning networks on datasets that sufficiently represent the classes of interest. However, the bottleneck in many computer vision applications lies in the limited availability of adequate datasets. One particular application that is of great importance for the safety of cities and crowded areas is smart surveillance. Conventional surveillance methods are reactive and often ineffective in enable real-time action. However, smart surveillance is a key component of smart and proactive security in a smart city. Motivated by a smart city application which aims at the automatic identification of concerning events for alerting law-enforcement and governmental agencies, we craft a large video dataset that focuses on the distinction between small-scale violence, large-scale violence, peaceful gatherings, and natural events. This dataset classifies public events along two axes, the size of the crowd observed and the level of perceived violence in the crowd. We name this newly-built dataset the Multi-Scale Violence and Public Gathering (<bold>MSV-PG</bold>) dataset. The videos in the dataset go through several pre-processing steps to prepare them to be fed into a deep learning architecture. We conduct several experiments on the <bold>MSV-PG</bold> datasets using a ResNet3D, a Swin Transformer and an R(2 &#x0002B; 1)D architecture. The results achieved by these models when trained on the <bold>MSV-PG</bold> dataset, 88.37%, 89.76%, and 89.3%, respectively, indicate that the dataset is well-labeled and is rich enough to train deep learning models for automatic smart surveillance for diverse scenarios.</p></abstract>
<kwd-group>
<kwd>crowd analysis</kwd>
<kwd>smart surveillance</kwd>
<kwd>violence detection</kwd>
<kwd>human action recognition</kwd>
<kwd>computer vision</kwd>
</kwd-group>
<contract-sponsor id="cn001">Qatar National Research Fund<named-content content-type="fundref-id">10.13039/100008982</named-content></contract-sponsor>
<counts>
<fig-count count="7"/>
<table-count count="5"/>
<equation-count count="0"/>
<ref-count count="69"/>
<page-count count="13"/>
<word-count count="9794"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Computer Vision</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>Early identification of violent events and potential security risks is of crucial importance to governmental and law enforcement agencies. City-wide surveillance systems are setup in many countries for this purpose. For example, China, the United States of America, and the United Kingdom deploy around 15 million, 112 thousand, and 628 thousand Closed Circuit Television (CCTV) cameras, respectively (Global, <xref ref-type="bibr" rid="B20">2022</xref>). These cameras are often deployed outdoors and their real-time feed is often monitored by humans to detect crime and other concerning events. Effective use of security cameras allows for the early detection and the deployment of adequate responses to such events.</p>
<p>Video footage coming from security cameras often require real-time continuous monitoring by humans, which poses several limitations and challenges. First, a significant amount of human capital is required whenever thousands or hundreds of thousands of cameras are deployed in a country or city. If an insufficient number of individuals are allocated for the monitoring of CCTV cameras, many concerning events caught by these cameras can go undetected. In addition, having a human inspect surveillance footage can be inefficient and prone to errors. Missing certain events, such as a protest or a large fight, or delaying their detection may have seriously negative consequences for public peace. Finally, traditional surveillance methods are reactive, requiring events to occur and be manually detected by inspectors before action is taken.</p>
<p>Intelligently automating the detection of concerning events, such as fights and unusually-large gatherings, captured by surveillance cameras is critical and has two main advantages. Firstly, it moves surveillance from the traditional reactive approach to a proactive approach since it alerts authorities regarding the potential for violence. Secondly, it significantly reduces the number of human operators needed for surveillance.</p>
<p>There is no doubt that Deep Learning (DL) has transformed many aspects of society. For instance, the literature has demonstrated that DL models have the capability to detect certain human behaviors (action recognition tasks) (Dhiman and Vishwakarma, <xref ref-type="bibr" rid="B14">2019</xref>). Therefore, in order to streamline the identification of potential security risks in surveillance footage, we propose a computer vision based approach for the automatic identification of various human behaviors caught on CCTV footage. To this end, the contribution of this paper is two fold. First, we embark on a novel data collection effort to collect a video dataset that collectively represents the human behavior classes of interest. Secondly, the developed dataset is used to train a human behavior prediction model that automatically detects the human behavior classes of interest, possibly in real-time.</p>
<p>There were four classes of human behavior that have been identified and selected; namely <italic>Large Peaceful Gathering</italic> (<bold>LPG</bold>), <italic>Large Violent Gathering</italic> (<bold>LVG</bold>), <italic>small-scale fighting</italic> (<bold>F</bold>), and <italic>Natural</italic> (<bold>N</bold>) events. Note that these classes exist along two axes, where one axis identifies the size of the crowd and the other identifies whether or not the crowd detected is violent. In addition to conveying the behavior of the crowd to law enforcement, this multi-scale distinction, as opposed to the binary distinction often discussed in the literature (distinction between &#x0201C;fighting&#x0201D; and &#x0201C;no fighting&#x0201D; or &#x0201C;violence&#x0201D; and &#x0201C;no violence&#x0201D;), can provide information on the scale of the appropriate law enforcement response. For instance, prior works did not distinguish between a small fight between two individuals and a violent gathering of hundreds of people; both scenarios would be classified as &#x0201C;violent&#x0201D; in those works. The two scenarios, however, clearly require radically different responses from law enforcement, and thus should be seen as two different classes of events. Dedicating different classes for each of those two scenarios, &#x0201C;<bold>F</bold>&#x0201D; for the small-scale fight and &#x0201C;<bold>LVG</bold>&#x0201D; for the large violent crowd, makes for a smart surveillance system with greater utility to law enforcement. Such system detects violent action and informs law enforcement of the nature of the required response. In addition, &#x0201C;non-violent&#x0201D; crowds may sometimes require law enforcement attention, especially when the crowd is relatively large. Large crowds hold the potential for a security hazard (i.e., breaking out of violence within the peaceful crowds), thus the developed dataset also distinguishes between a small peaceful crowd and a large peaceful crowd.</p>
<p>A surveillance system with the capability of automatically classifying video footage into one of the aforementioned classes would provide immediate information about any concerning or potentially concerning events to governmental and law-enforcement agencies for immediate response. The benefits of such a system is that it is scalable to large-scale surveillance, whereas human supervision of a large geographical area through CCTV is unrealistic.</p>
<p>Motivated by the detection task outlined above, we have developed a novel video dataset, called the <bold>Multi-Scale Violence and Public Gathering (MSV-PG)</bold> dataset, that comprehensively covers the aforementioned classes of behavior. To the best of our knowledge, a similar dataset does not exist or is not available to the public. Additionally, this paper trains, tests, and assesses several DL architectures on the automatic recognition of the relevant human behaviors based on the developed and diverse dataset. The aim of employing DL on the MSV-PG dataset is to showcase the robustness of the developed dataset in training various DL algorithms for behavior recognition applications. Such an application has not been investigated in the literature. <bold>The corresponding author of this paper will make the dataset available upon request</bold>. The remainder of this paper is organized as follows: Section 2 discusses the pre-processing, labeling of the dataset and the training of selected DL models. Section 3 outlines the results achieved by the selected models. Finally, Section 4 details previous DL-based approaches for video analysis, describes previous Human Action Recognition datasets, provides commentary on possible causes for the miss-classification of samples in the proposed dataset, and outlines the main conclusions of the paper.</p></sec>
<sec sec-type="materials and methods" id="s2">
<title>2 Materials and methods</title>
<p>Next, we outline the process of video collection and pre-processing, illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>, that we used to build the <bold>MSV-PG</bold> dataset. Videos with at least one instance of one of the relevant classes are identified and obtained. Then, the starting and ending time stamps of each occurrence of an event belonging to one of the relevant classes is recorded, alongside the class of that event. In this paper, an <bold><italic>instance</italic></bold> is defined as an occurrence of one of the classes whose starting and ending time stamps are identified and recorded. In the training and validation phases, we feed equal-sized image sequences to a DL network. We refer to these image sequences as <bold><italic>samples</italic></bold>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Flowchart illustrating the steps taken to build the <bold>MSV-PG</bold> dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1242690-g0001.tif"/>
</fig>
<sec>
<title>2.1 Dataset collection, labeling, and prepossessing</title>
<p>In order to identify instances of <bold>LPG</bold>, <bold>LVG</bold>, <bold>F</bold>, and <bold>N</bold> events, we define the criteria that differentiate each of the four classes as follows:</p>
<list list-type="bullet">
<list-item><p><bold>LPG</bold>: A large number of individuals who are gathered for a singular purpose. Examples of this class are peaceful protests and gatherings of sports fans.</p></list-item>
<list-item><p><bold>LVG</bold>: A cluster of individuals of whom a &#x0201C;significant&#x0201D; number are engaged in violent action. Examples of violent action include clashes with police, property destruction, and fighting between members of the crowd.</p></list-item>
<list-item><p><bold>F</bold>: A &#x0201C;small&#x0201D; group of individuals fighting one another.</p></list-item>
<list-item><p><bold>N</bold>: Footage that shows no concerning behavior. This is a class of footage that one expects to see during regular, everyday life.</p></list-item>
</list>
<p>It&#x00027;s of crucial importance to recognize that the above definitions are not objective; they are general guidelines that were used to inform the manual video-labeling process. Specifically, during labeling, the determination of whether or not a group of people is large or small is left to the judgment of the person labeling the videos. We elected to do this because, in reality, there does not exist an objective threshold for the number of individuals that would make a group of people a &#x0201C;large&#x0201D; group as opposed to a &#x0201C;small group.&#x0201D; However, to avoid subjectivity in the labeling process, each member of our team (a total of four researchers) labeled the data and majority voting was conducting to select the final label. The first step taken to develop the <bold>MSV-PG</bold> dataset was to identify sources from which to attain relevant videos. To this end, we obtained relevant videos from YouTube and relevant video datasets which were readily available online. Relevant YouTube videos are those that include at least one instance of at least one of the relevant classes. Key words such as &#x0201C;demonstration,&#x0201D; &#x0201C;violence,&#x0201D; and &#x0201C;clash&#x0201D; were used during the crawling process of YouTube videos. In addition, relevant current and historic events (i.e., George Floyd protests, Hong Kong protests, The Capitol riot, etc.) were searched and some of the resultant videos were included in the dataset. The aim was to gather a large and diverse set of videos to enable the model to generalize to a variety of concerning scenarios.</p>
<p>Similarly, relevant datasets in the literature which include videos that contain instances of one or more of the four classes of interest in this paper were collected. We merged subsets of the UBI-Fights (Degardin and Proen&#x000E7;a, <xref ref-type="bibr" rid="B12">2020</xref>) and the dataset introduced in Akt&#x00131; et al. (<xref ref-type="bibr" rid="B1">2019</xref>) into the <bold>MSV-PG</bold> dataset. Although these two datasets are divided into fighting and non-fighting videos, we re-labeled these videos according to our set of classes. <xref ref-type="fig" rid="F2">Figure 2</xref> illustrates examples of samples of the four considered classes. It is important to note that data from pre-existing datasets only make up &#x0007E;16% of the total dataset. The remainder of the <bold>MSV-PG</bold> dataset was collected by crawling relevant YouTube videos.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Examples of instances of each of the four classes in MSV-PG. <bold>(A)</bold> N. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=7hFbmAZois4">Protests continue for sixth day in Seattle</ext-link>, uploaded by Kiro 7 News, 4 via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=EgbF_lFV0ns">Ghjkmnfm</ext-link>, uploaded by Ganesh Sardar via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link> <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=wcEE7iH3hOE">Inside Apple&#x00027;s store at World Trade Center Mall. Westfield. New York</ext-link>, uploaded by Another World via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=quONYcG2iyY">Unpermitted Vendors Defy Police</ext-link>, uploaded by Santa Monica via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. <bold>(B)</bold> LPG. How George Floyd&#x00027;s killing has inspired a diverse range of protesters uploaded by PBS NewsHour via YouTube, <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=UQFQ9Q6GT00">https://www.youtube.com/watch?v=UQFQ9Q6GT00</ext-link>, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=YgYuRGre6AA">China&#x00027;s Rebel City: The Hong Kong Protests</ext-link> uploaded by South China Morning Post via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=3ueIS4hCe1k">Death of George Floyd drives protests across the U.S. -and beyond</ext-link>, uploaded by PBS NewsHour via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=T3F2KaFyumU">Demonstrators march through downtown Seattle streets on Election Night</ext-link>, uploaded by KING 5 Seattle via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. <bold>(C)</bold> LVG. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?app=desktop&#x00026;v=Kr-R7d40_s0">Raw Video: Egypt Protesters Clash with Police</ext-link>, uploaded by Associated Press via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. Reproduced with permission from Hassner et al. (<xref ref-type="bibr" rid="B26">2012</xref>), via <ext-link ext-link-type="uri" xlink:href="https://www.openu.ac.il/home/hassner/data/violentflows/">Violent Flows - Crowd Violence Database</ext-link>. <bold>(D)</bold>. F. &#x0201C;<ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=93FMl6nujr4">Antifa Tries To Beat Man With Metal Baton and Gets Knocked Out In One Shot</ext-link>,&#x0201D; uploaded by American Dream via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. Reproduced with permission from Soliman et al. (<xref ref-type="bibr" rid="B56">2019</xref>) via &#x0201C;<ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/mohamedmustafa/real-life-violence-situations-dataset">Real Life Violence Dataset</ext-link>&#x0201D;. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=SA4iUvYlb4w">Boxing Random Strangers At A Gas Station In The Hood! <sup>&#x0002A;</sup>Gone Wrong<sup>&#x0002A;</sup></ext-link>, uploaded by Kvng Reke via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. Reproduced with permission from Soliman et al. (<xref ref-type="bibr" rid="B56">2019</xref>) via &#x0201C;<ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/mohamedmustafa/real-life-violence-situations-dataset">Real Life Violence Dataset</ext-link>.&#x0201D;</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1242690-g0002.tif"/>
</fig>
<p>In order to ensure uniformity among the collected videos, the frame-rate of all the videos are unified according to a frame-rate <italic>R</italic> of choice. Furthermore, a single video may contain <italic>instances</italic> of different classes at different time periods; thus, an entire video cannot be given a single label. Instead, we opted to identify portions of each video where one of the classes occurs. Namely, we identify the instances of each class in every video collected and record these instances in an <italic>annotation table</italic>. Each row entry of this table is used to define a single instance of one of the relevant classes. The row entry of an instance defines the numeric ID of the video wherein the instance was found, the starting and ending time stamps of the instance, and the class to which the instance belongs. <xref ref-type="table" rid="T1">Table 1</xref> shows an example of an annotation table.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>An example annotation table describing five instances of the relevant classes occurring in three separate videos.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Video ID</bold></th>
<th valign="top" align="left"><bold>Starting time</bold></th>
<th valign="top" align="left"><bold>Ending time</bold></th>
<th valign="top" align="left"><bold>Class</bold></th>
</tr></thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="left">00:00:30</td>
<td valign="top" align="left">00:01:30</td>
<td valign="top" align="left"><bold>LVG</bold></td>
</tr> <tr>
<td valign="top" align="left">1</td>
<td valign="top" align="left">00:02:03</td>
<td valign="top" align="left">00:02:21</td>
<td valign="top" align="left"><bold>N</bold></td>
</tr> <tr>
<td valign="top" align="left">2</td>
<td valign="top" align="left">00:00:35</td>
<td valign="top" align="left">00:00:36</td>
<td valign="top" align="left"><bold>LPG</bold></td>
</tr> <tr>
<td valign="top" align="left">2</td>
<td valign="top" align="left">00:01:25</td>
<td valign="top" align="left">00:01:29</td>
<td valign="top" align="left"><bold>F</bold></td>
</tr> <tr>
<td valign="top" align="left">3</td>
<td valign="top" align="left">00:00:00</td>
<td valign="top" align="left">00:00:03</td>
<td valign="top" align="left"><bold>N</bold></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate the class type.</p>
</table-wrap-foot>
</table-wrap>
<p>To facilitate model training on the <bold>MSV-PG</bold> dataset, each labeled sample should be of equal length. Thus, a length of <italic>N</italic> seconds is chosen for the length of each training/validation sample. Each sample is a sequence of frames extracted from one of the videos. Assuming that an instance of class <italic>C</italic><sub><italic>i</italic></sub> occurs in the timespan from (<italic>h</italic><sub><italic>i</italic></sub>:<italic>m</italic><sub><italic>i</italic></sub>:<italic>s</italic><sub><italic>i</italic></sub>) to (<italic>h</italic><sub><italic>f</italic></sub>:<italic>m</italic><sub><italic>f</italic></sub>:<italic>s</italic><sub><italic>f</italic></sub>) of video <italic>V</italic><sub><italic>i</italic></sub>, the frames of that time range are extracted. Subsequently, a sliding window of length <italic>R</italic><sub><italic>N</italic></sub>, the number of frames per sample, is moved through the frames of the timespan from (<italic>h</italic><sub><italic>i</italic></sub>:<italic>m</italic><sub><italic>i</italic></sub>:<italic>s</italic><sub><italic>i</italic></sub>) to (<italic>h</italic><sub><italic>f</italic></sub>:<italic>m</italic><sub><italic>f</italic></sub>:<italic>s</italic><sub><italic>f</italic></sub>). Note that the number of frames per sample is equal to the length of the sample, in seconds, multiplied by the frame-rate of the video; Thus, <italic>R</italic><sub><italic>N</italic></sub> &#x0003D; <italic>R</italic>&#x000D7;<italic>N</italic>. Since consecutive samples are almost identical, they share <italic>R</italic><sub><italic>N</italic></sub>&#x02212;1 frames, adding all consecutive samples inside an instance to the dataset would inflate the size of the dataset while providing minimal additional information. Instead, we define a <italic>stride</italic> parameter <italic>S</italic> that defines the number of frames that the sliding window skips after extracting each sample from a given instance. Additionally, long instances may bias the dataset in both the training and validation phases. To prevent this, we define a parameter <italic>E</italic><sub><italic>max</italic></sub> that represents the maximum number of samples to be extracted from a single instance. If the maximum number of samples that can be extracted from an instance exceeds <italic>E</italic><sub><italic>max</italic></sub>, we extract <italic>E</italic><sub><italic>max</italic></sub> samples with an equal number of frames between consecutive samples.</p>
<p>In the final stage, we label each sample with its class label <italic>C</italic><sub><italic>i</italic></sub>, the class of the instance from which that sample was extracted. The procedure of building the dataset is described in <xref ref-type="table" rid="T6">Algorithm 1</xref>.</p>
<table-wrap position="float" id="T6">
<label>Algorithm 1</label>
<caption><p>The Procedure for building the <bold>MSV-PG</bold> dataset given a set of videos and the annotation table that records when instances of the relevant classes occur in those videos.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1242690-i0001.tif"/>
</table-wrap></sec>
<sec>
<title>2.2 Dataset summary</title>
<p>The frame-rate of the videos collected is set to 10 frames-per-second (FPS), which is a reasonable frame-rate that allows us to analyse videos in sufficient detail without requiring excess storage space. The length <italic>N</italic> of each sample is chosen to be 2 s, which is the minimum duration of any instance that can be recorded given the format of our annotation table, since the shorest instance that can be recorded is a 2-s instance that starts at timestamp <italic>h</italic>:<italic>m</italic>:<italic>s</italic> and ends at timestamp <italic>h</italic>:<italic>m</italic>:(<italic>s</italic>&#x0002B;1). Given the 10 FPS frame-rate and the 2-s length of each sample, <italic>R</italic><sub><italic>N</italic></sub>, the number of frames per sample, is 10 &#x000D7; 2 = 20 frames. In our experiments, we feed a DL architecture with 10 of the 20 frames of each sample, skipping every second frame. The stride parameter <italic>S</italic> is set to 10 frames, indicating that consecutive samples from the same instance share 10 frames, or 1 s. Setting <italic>S</italic> at 10 frames would allow the trained model to thoroughly learn the actions in the videos without requiring too much storage space. Finally, the maximum number of samples to be extracted from any instance, <italic>E</italic><sub><italic>max</italic></sub>, was set to 200 samples. The 200-sample limit is determined to achieve a good balance between limiting storage space and producing a sufficiently rich dataset.</p>
<p>The <bold>MSV-PG</bold> dataset consists of <bold>1,400 videos</bold>. The total duration of the instances in the dataset is &#x0007E;<bold>30 h</bold>. The length distribution (in seconds) of the instances is shown in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The length distribution of the instances in the <bold>MSV-PG</bold> dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1242690-g0003.tif"/>
</fig></sec>
<sec>
<title>2.3 Model training and testing</title>
<p>In this section, we discuss the models adopted to validate the <bold>MSV-PG</bold> dataset. In particular, we adopt three different DL models, (i) an 18-layer ResNet3D model (Hara et al., <xref ref-type="bibr" rid="B25">2017</xref>), (ii) a Tiny Swin Transformer model (Liu et al., <xref ref-type="bibr" rid="B39">2021</xref>), and (iii) an R(2 &#x0002B; 1)D model (Tran et al., <xref ref-type="bibr" rid="B63">2018</xref>), for the validation. The selected models have produced state-of-the-art results in vision tasks and thus have been selected for the task at hand given the developed dataset. In this section, the training and testing details are outlined.</p>
<sec>
<title>2.3.1 Deep learning models</title>
<sec>
<title>2.3.1.1 ResNet3D model</title>
<p>Residual networks (ResNets) were first introduced for image classification (He et al., <xref ref-type="bibr" rid="B28">2016</xref>). The architecture introduces residual connections to connect non-consecutive convolutional layers. The purpose of adding those connections between non-consecutive layers, called short-cut connections, is to overcome the problem of the degradation of training accuracy when layers are added to a DL model (He and Sun, <xref ref-type="bibr" rid="B27">2015</xref>; Srivastava et al., <xref ref-type="bibr" rid="B58">2015</xref>).</p></sec>
<sec>
<title>2.3.1.2 R(2 &#x0002B; 1)D model</title>
<p>An R(2 &#x0002B; 1)D model (Tran et al., <xref ref-type="bibr" rid="B63">2018</xref>) utilizes (2 &#x0002B; 1)D convolutions to approximate conventional 3D convolutions. The (2 &#x0002B; 1)D convolutions split the computation into a spatial 2D convolution followed by a temporal 1D convolution. This splitting of the computation into the steps offers the advantage of increasing the complexity of functions that can be represented due to the additional ReLU between the 2D and 1D convolutions, rendering optimization easier. The (2 &#x0002B; 1)D convolutions are also computationally cheaper than 3D convolutions. The contrast between 3D and (2 &#x0002B; 1)D convolutions.</p></sec>
<sec>
<title>2.3.1.3 The Swin Vision Transformer</title>
<p>Transformer-based DL models have provided state-of-the-art performance for many computer vision problems in recent years (Chromiak, <xref ref-type="bibr" rid="B10">2021</xref>). The Swin Transformer is one of the transformer architectures that is used in many computer vision works as a general backbone for both image and video based problems. In this paper, the Video Swin architecture (Liu et al., <xref ref-type="bibr" rid="B40">2022</xref>), which is proposed for video recognition, is used.</p>
<p>The most important feature that distinguishes the Swin Transformer from other transformer-based models is that its computational complexity increases linearly with respect to image resolution. In other models, the computational complexity is quadratic with image resolution since the attention matrix is computed among all the tokens of the image. Generating pixel-level features is critical in vision problems such as image segmentation and object detection. However, the quadratic computational complexity of the attention matrix prevents the use of patches that will enable the extraction of features at the pixel level in high-resolution images. In the Swin Transformer architecture, attention matrices are computed locally in non-overlapping windows. Since the number of patches in the windows is fixed, the computational complexity grows linearly with the image resolution. In addition, the Swin Transformer generates the features in a hierarchical manner. In the first layers, small patches are used, while in the next layers, neighboring patches are gradually combined.</p></sec></sec></sec>
<sec>
<title>2.4 Training setup</title>
<p>For training and validation, we aimed to use 80% of the samples of each class in the dataset for training and 20% for validation. However, we also require that the samples extracted from a video be used exclusively for training or exclusively for validation. The purpose of this requirement is to make the training and validation sets totally independent to avoid biasing the DL network. In order to achieve a split that approximates this 80&#x02013;20 desired split for each class while satisfying the requirement that videos used in the training and validation phases be unique, we used a simple random search method. At each iteration, a randomly-sized set of random videos from our video set is assigned for training and the rest of the videos are assigned for validation. The per-class training/validation split is then calculated. After 2 h of searching, the video split with the best per-class ratios (closest to 80:20 for each class) is used for training and validation. In our experiment, we used 1,121 videos for training the Swin Transformer model and 279 videos for validation. Furthermore, the number of instances and samples for each class used for training and validation is provided in <xref ref-type="table" rid="T2">Table 2</xref>. From <xref ref-type="table" rid="T2">Table 2</xref>, we note that the training/validation splits for each class are as follows: (1) <bold>N</bold>&#x02014;79.72%/20.28%, (2) <bold>LPG</bold>&#x02014;79.03%/20.97%, (3) <bold>LVG</bold>&#x02014;80.01%/19.99%, and (4) <bold>F</bold>&#x02014;79.80%/20.20%. Also note that there exists a significant degree of imbalance in the number of samples per class; the dataset consists of 36% <bold>N</bold> samples, 44% <bold>LPG</bold> samples, 10% <bold>LVG</bold> samples, and 10% <bold>F</bold> samples. This is due to the fact that the duration of violence is usually brief compared to the duration of peaceful events given that violence is an anomalous human behavior. Despite this, we observe that the three chosen DL models are able to recognize the general form of the four classes of interest through learning the <bold>MSV-PG</bold> dataset. <bold>The full dataset can be made available upon contacting the corresponding author of this work</bold>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Number of instances and samples per class used for training and validation.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Class</bold></th>
<th valign="top" align="left"><bold>Training samples</bold></th>
<th valign="top" align="left"><bold>Validation samples</bold></th>
<th valign="top" align="left"><bold>Training instances</bold></th>
<th valign="top" align="left"><bold>Validation instances</bold></th>
</tr></thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>N</bold></td>
<td valign="top" align="left">23,152</td>
<td valign="top" align="left">5,889</td>
<td valign="top" align="left">816</td>
<td valign="top" align="left">103</td>
</tr> <tr>
<td valign="top" align="left"><bold>LPG</bold></td>
<td valign="top" align="left">27,952</td>
<td valign="top" align="left">7,418</td>
<td valign="top" align="left">1,240</td>
<td valign="top" align="left">223</td>
</tr> <tr>
<td valign="top" align="left"><bold>LVG</bold></td>
<td valign="top" align="left">6,478</td>
<td valign="top" align="left">1,618</td>
<td valign="top" align="left">865</td>
<td valign="top" align="left">222</td>
</tr> <tr>
<td valign="top" align="left"><bold>F</bold></td>
<td valign="top" align="left">6,584</td>
<td valign="top" align="left">1,667</td>
<td valign="top" align="left">1,194</td>
<td valign="top" align="left">344</td>
</tr> <tr>
<td valign="top" align="left">Total</td>
<td valign="top" align="left">64,166</td>
<td valign="top" align="left">16,592</td>
<td valign="top" align="left">4,115</td>
<td valign="top" align="left">892</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate the class type.</p>
</table-wrap-foot>
</table-wrap></sec></sec>
<sec sec-type="results" id="s3">
<title>3 Results</title>
<p>Pre-trained models on the Kinetics-400 dataset are used in all the experiments. We train these models for five epochs and report the best result. In all the experiments performed, we set the learning rate to 0.0001 for the pre-trained layers and 0.001 for the randomly initialized classification layer. We decay the learning rates three times for each epoch by a factor of 0.9. A standard SGD is used as an optimizer with momentum and weight decay which are set to 0.9 and 0.0001, respectively. In all the experiments conducted, the input images are resized to 224 &#x000D7; 224 via bi-cubic interpolation and the batch size is set to 16. We applied random horizontal flipping as an augmentation technique during training.</p>
<p>In our experiments, two types of accuracy scores were recorded, a &#x0201C;sample accuracy&#x0201D; and an &#x0201C;instance accuracy.&#x0201D; The sample accuracy of a model is obtained by performing inference on all samples in the validation set, then dividing the number of correctly-classified validation samples by the total number of validation samples. On the other hand, the instance accuracy is recorded by first performing inference on all samples inside an instance. Then, if the class to which most samples inside the instance are classified matches the label of the instance, the number of correctly-classified instances is incremented by one. The number of correctly-classified instances is divided by the total number of instances in the validation set to obtain the instance accuracy of the model.</p>
<sec>
<title>3.1 Performance analysis</title>
<p>In this section, we present the performance of the three adopted deep learning networks on the developed <bold>MSV-PG</bold> dataset. Then, in Section 4, will examine some validation samples whose assigned label does not match the output of our trained Swin Transformer model and show that there are some samples whose appropriate labels are indeed ambiguous.</p>
<sec>
<title>3.1.1 Performance evaluation results</title>
<p>The sample accuracy and instance accuracy scores for the validation set of each class, using the three different architectures adopted, are shown in <xref ref-type="table" rid="T3">Table 3</xref>. The results demonstrate that the three architectures were able to adequately learn the <bold>MSV-PG</bold> dataset. The performance indicates that the dataset is well-labeled and can be effectively used in real-world applications. In addition, the sample and instance confusion matrices for the Swin Transformer model, the best-performing model out of the three used, are shown in <xref ref-type="table" rid="T4">Tables 4</xref>, <xref ref-type="table" rid="T5">5</xref>, respectively.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Performance (accuracy) on the <bold>MSV-PG</bold> dataset using the R(2 &#x0002B; 1)D, ResNet3D, and Swin Transformer.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th/>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="left"><bold>N</bold></th>
<th valign="top" align="left"><bold>LPG</bold></th>
<th valign="top" align="left"><bold>LVG</bold></th>
<th valign="top" align="left"><bold>F</bold></th>
<th valign="top" align="left"><bold>Overall</bold></th>
</tr></thead>
<tbody>
<tr>
<td/>
<td valign="top" align="left">R(2 &#x0002B; 1)D</td>
<td valign="top" align="left"><bold>96.32</bold></td>
<td valign="top" align="left">85.74</td>
<td valign="top" align="left"><bold>83.00</bold></td>
<td valign="top" align="left">86.86</td>
<td valign="top" align="left">89.34</td>
</tr>
 <tr>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">ResNet3D</td>
<td valign="top" align="left">94.43</td>
<td valign="top" align="left">88.34</td>
<td valign="top" align="left">68.67</td>
<td valign="top" align="left">86.26</td>
<td valign="top" align="left">88.37</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Swin</td>
<td valign="top" align="left">93.62</td>
<td valign="top" align="left"><bold>89.04</bold></td>
<td valign="top" align="left">77.94</td>
<td valign="top" align="left"><bold>90.82</bold></td>
<td valign="top" align="left"><bold>89.76</bold></td>
</tr> <tr>
<td/>
<td valign="top" align="left">R(2 &#x0002B; 1)D</td>
<td valign="top" align="left"><bold>94.17</bold></td>
<td valign="top" align="left">77.13</td>
<td valign="top" align="left"><bold>85.59</bold></td>
<td valign="top" align="left">88.08</td>
<td valign="top" align="left">85.43</td>
</tr>
 <tr>
<td valign="top" align="left">Instance</td>
<td valign="top" align="left">ResNet3D</td>
<td valign="top" align="left">91.26</td>
<td valign="top" align="left">77.13</td>
<td valign="top" align="left">75.68</td>
<td valign="top" align="left">88.37</td>
<td valign="top" align="left">82.74</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Swin</td>
<td valign="top" align="left">90.29</td>
<td valign="top" align="left"><bold>82.06</bold></td>
<td valign="top" align="left">81.53</td>
<td valign="top" align="left"><bold>92.44</bold></td>
<td valign="top" align="left"><bold>86.88</bold></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Bold values indicate highest performance achieved per class among the tested models.</p>
</table-wrap-foot>
</table-wrap><table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Sample confusion matrix of the Swin Transformer.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th/>
<th valign="top" align="left"><bold>N</bold></th>
<th valign="top" align="left"><bold>LPG</bold></th>
<th valign="top" align="left"><bold>LVG</bold></th>
<th valign="top" align="left"><bold>F</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>N</bold></td>
<td valign="top" align="left">93.62</td>
<td valign="top" align="left">02.29</td>
<td valign="top" align="left">00.22</td>
<td valign="top" align="left">03.87</td>
</tr> <tr>
<td valign="top" align="left"><bold>LPG</bold></td>
<td valign="top" align="left">06.27</td>
<td valign="top" align="left">89.04</td>
<td valign="top" align="left">04.48</td>
<td valign="top" align="left">00.22</td>
</tr> <tr>
<td valign="top" align="left"><bold>LVG</bold></td>
<td valign="top" align="left">02.66</td>
<td valign="top" align="left">10.57</td>
<td valign="top" align="left">77.94</td>
<td valign="top" align="left">08.84</td>
</tr> <tr>
<td valign="top" align="left"><bold>F</bold></td>
<td valign="top" align="left">05.40</td>
<td valign="top" align="left">00.24</td>
<td valign="top" align="left">03.54</td>
<td valign="top" align="left">90.82</td>
</tr></tbody>
</table>
</table-wrap><table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Instance confusion matrix of the Swin Transformer.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th/>
<th valign="top" align="left"><bold>N</bold></th>
<th valign="top" align="left"><bold>LPG</bold></th>
<th valign="top" align="left"><bold>LVG</bold></th>
<th valign="top" align="left"><bold>F</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>N</bold></td>
<td valign="top" align="left">90.29</td>
<td valign="top" align="left">00.97</td>
<td valign="top" align="left">00.00</td>
<td valign="top" align="left">08.74</td>
</tr> <tr>
<td valign="top" align="left"><bold>LPG</bold></td>
<td valign="top" align="left">09.87</td>
<td valign="top" align="left">82.06</td>
<td valign="top" align="left">08.07</td>
<td valign="top" align="left">00.00</td>
</tr> <tr>
<td valign="top" align="left"><bold>LVG</bold></td>
<td valign="top" align="left">01.80</td>
<td valign="top" align="left">10.36</td>
<td valign="top" align="left">81.53</td>
<td valign="top" align="left">06.31</td>
</tr> <tr>
<td valign="top" align="left"><bold>F</bold></td>
<td valign="top" align="left">03.49</td>
<td valign="top" align="left">00.29</td>
<td valign="top" align="left">03.78</td>
<td valign="top" align="left">92.44</td>
</tr></tbody>
</table>
</table-wrap></sec></sec></sec>
<sec sec-type="discussion" id="s4">
<title>4 Discussion</title>
<p>Over the last several years, significant advances have been made in the domain of video analysis using DL (Sharma et al., <xref ref-type="bibr" rid="B52">2021</xref>). DL-based video processing techniques are usually focused on human action recognition (Huang et al., <xref ref-type="bibr" rid="B30">2015</xref>; Sudhakaran and Lanz, <xref ref-type="bibr" rid="B59">2017</xref>; Arif et al., <xref ref-type="bibr" rid="B2">2019</xref>; Dhiman and Vishwakarma, <xref ref-type="bibr" rid="B14">2019</xref>; Mazzia et al., <xref ref-type="bibr" rid="B42">2022</xref>), anomaly detection (Sabokrou et al., <xref ref-type="bibr" rid="B48">2018</xref>; Nayak et al., <xref ref-type="bibr" rid="B44">2021</xref>), and behavior analysis (G&#x000F3;mez A et al., <xref ref-type="bibr" rid="B21">2015</xref>; Marsden et al., <xref ref-type="bibr" rid="B41">2017</xref>; S&#x000E1;nchez et al., <xref ref-type="bibr" rid="B50">2020</xref>). These techniques often utilize convolutional neural networks (CNNs) (Ji et al., <xref ref-type="bibr" rid="B32">2012</xref>; Karpathy et al., <xref ref-type="bibr" rid="B33">2014</xref>; Simonyan and Zisserman, <xref ref-type="bibr" rid="B55">2014</xref>; Xu et al., <xref ref-type="bibr" rid="B67">2015</xref>; Feichtenhofer et al., <xref ref-type="bibr" rid="B18">2016</xref>; Sahoo et al., <xref ref-type="bibr" rid="B49">2019</xref>; Elboushaki et al., <xref ref-type="bibr" rid="B17">2020</xref>). Tran et al. (<xref ref-type="bibr" rid="B62">2015</xref>) first proposed inflating 2D CNNs into 3D CNNs to allow for the extraction of spatio-temporal features for human action recognition tasks. Carreira and Zisserman (<xref ref-type="bibr" rid="B9">2017</xref>) then introduced a Two-Stream Inflated 3D (I3D) ConvNet, which inflates the usual 2D Convnets into 3D ConvNets for the purpose of performing video analysis. Through this I3D ConvNet, they inflate the 2D ConvNets of CNN-based image classification models into 3D ConvNets and test these architectures on the Kinetics (Kay et al., <xref ref-type="bibr" rid="B34">2017</xref>) video dataset. However, 3D CNNs suffer from short-term memory and are often only capable of learning human actions that occur in 1&#x02013;16 frames (Varol et al., <xref ref-type="bibr" rid="B64">2017</xref>). To counter this limitation, Shi et al. (<xref ref-type="bibr" rid="B53">2015</xref>) propose convolutional LSTMs, which replace the fully-connected input-to-state and state-to-state transitions of conventional LSTMs, a variant of RNNs, with convolutional transitions that allow for the encoding of spatial features. Furthermore, the literature includes other works where RNNs were used for a wide variety of applications including group activity recognition (Ibrahim et al., <xref ref-type="bibr" rid="B31">2016</xref>), facial expression recognition (Guo et al., <xref ref-type="bibr" rid="B24">2019</xref>), video segmentation (Siam et al., <xref ref-type="bibr" rid="B54">2017</xref>), anomoly detection (Murugesan and Thilagamani, <xref ref-type="bibr" rid="B43">2020</xref>), target tracking (Gao et al., <xref ref-type="bibr" rid="B19">2019</xref>), face recognition (Gong et al., <xref ref-type="bibr" rid="B22">2019</xref>), and background estimation (Savakis and Shringarpure, <xref ref-type="bibr" rid="B51">2018</xref>). Many hybrids of two types of DL architectures were also proposed in the literature. For instance, Arif et al. (<xref ref-type="bibr" rid="B2">2019</xref>) combine 3D CNNs and LSTMs for different action recognition tasks while Yadav et al. (<xref ref-type="bibr" rid="B68">2019</xref>) use 2D CNN in combination with LSTMs to recognize different yoga postures. Finally, Wang et al. (<xref ref-type="bibr" rid="B66">2019</xref>) combine the I3D network with LSTMs by extracting low level features of video frames from the I3D network and feeding it onto LSTMs to achieve human action recognition.</p>
<p>Recently, transformer-based architectures have attracted significant attention. Transformers use self-attention to learn relationships between elements in sequences, which allows for attending to long-term dependencies relative to Recurrent Neural Networks (RNNs), which process elements iterativly. Furthermore, transformers are also more scalable to very large capacity models (Lepikhin et al., <xref ref-type="bibr" rid="B38">2020</xref>). Finally, transformers assume less prior knowledge about the structure of the problem as compared to CNNs and RNNs (Hochreiter and Schmidhuber, <xref ref-type="bibr" rid="B29">1997</xref>; LeCun et al., <xref ref-type="bibr" rid="B37">2015</xref>; Goodfellow et al., <xref ref-type="bibr" rid="B23">2016</xref>). These advantages have led to their success in many Computer Vision tasks such as image recognition (Dosovitskiy et al., <xref ref-type="bibr" rid="B16">2021</xref>; Touvron et al., <xref ref-type="bibr" rid="B61">2021</xref>) and object detection (Carion et al., <xref ref-type="bibr" rid="B6">2020</xref>; Zhu et al., <xref ref-type="bibr" rid="B69">2020</xref>).</p>
<p>Dosovitskiy et al. (<xref ref-type="bibr" rid="B15">2020</xref>) proposed ViT, which achieved promising results in image classification tasks by modeling the relationship (attention) between the spatial patches of an image using the standard transformer encoder (Vaswani et al., <xref ref-type="bibr" rid="B65">2017</xref>). After ViT, many transformer-based video recognition methods (Arnab et al., <xref ref-type="bibr" rid="B3">2021</xref>; Bertasius et al., <xref ref-type="bibr" rid="B5">2021</xref>; Liu et al., <xref ref-type="bibr" rid="B39">2021</xref>; Neimark et al., <xref ref-type="bibr" rid="B45">2021</xref>) have been proposed. In these works, different techniques have been developed for temporal attention as well as spatial attention.</p>
<p>Early video datasets for action recognition include the Hollywood (Laptev et al., <xref ref-type="bibr" rid="B36">2008</xref>), UCF50 (Reddy and Shah, <xref ref-type="bibr" rid="B47">2013</xref>), UCF101 (Soomro et al., <xref ref-type="bibr" rid="B57">2012</xref>), and the HMDB-51 (Kuehne et al., <xref ref-type="bibr" rid="B35">2011</xref>) datasets. The Hollywood dataset provides annotated movie clips. Each clip in the dataset belongs to one of 51 classes, including &#x0201C;push,&#x0201D; &#x0201C;sit,&#x0201D; &#x0201C;clap,&#x0201D; &#x0201C;eat,&#x0201D; and &#x0201C;walk,&#x0201D; while the UCF50 and UCF101 datasets consist of YouTube clips grouped into one of 50 and 101 action categories, respectively. Examples of action classes in the UCF50 dataset include &#x0201C;Basketball Shooting&#x0201D; and &#x0201C;Pull Ups&#x0201D; while the action classes in UCF101 includes a wider spectrum of classes subdivided into five different categories, namely, body motion, human-human interactions, human-object interactions, and playing musical instruments and sports. The Kinetics datasets (Kay et al., <xref ref-type="bibr" rid="B34">2017</xref>; Carreira et al., <xref ref-type="bibr" rid="B7">2018</xref>, <xref ref-type="bibr" rid="B8">2019</xref>), more recent benchmarks, significantly increase the number of classes from prior action classification datasets to 400, 600, and 700 action classes, respectively. The aforementioned pre-existing datasets are useful for testing different DL architectures but are not necessarily useful for specific practical tasks, such as surveillance, which likely require the distinction between a limited number of specific action classes.</p>
<p>In terms of public datasets that encompass violent scenery, a dataset focused on violence detection in movies is proposed by Demarty et al. (<xref ref-type="bibr" rid="B13">2014</xref>). Movie clips in this dataset are annotated as violent or non-violent scenes. Bermejo Nievas et al. (<xref ref-type="bibr" rid="B4">2011</xref>) introduce a database of 1,000 videos divided into two groups, namely, fights and non-fights. Hassner et al. (<xref ref-type="bibr" rid="B26">2012</xref>) propose the Violent Flows dataset, which focuses on crowd violence and contains two classes; violence and non-violence. Sultani et al. (<xref ref-type="bibr" rid="B60">2018</xref>) collected the UCF-Crime dataset, which includes clips of fighting among other crime classes (e.g, road accident, burglary, robbery, etc.).</p>
<p>Perez et al. (<xref ref-type="bibr" rid="B46">2019</xref>) proposed CCTV-fights, a dataset of 1,000 videos, whose accumulative length exceeds 8 h of real fights caught by CCTV cameras. Akt&#x00131; et al. (<xref ref-type="bibr" rid="B1">2019</xref>) put forward a dataset of 300 videos divided equally into two classes; fight and non-fight. UBI-fights (Degardin and Proen&#x000E7;a, <xref ref-type="bibr" rid="B12">2020</xref>) is another dataset which distinguishes between fighting and non-fighting videos.</p>
<p>We note that none of the aforementioned datasets are usable for our application independently. The Hollywood (Laptev et al., <xref ref-type="bibr" rid="B36">2008</xref>) dataset does not include classes relevant to our desired application. On the other hand, the UCF50 (Reddy and Shah, <xref ref-type="bibr" rid="B47">2013</xref>), UCF101 (Soomro et al., <xref ref-type="bibr" rid="B57">2012</xref>), HMDBI51 (Kuehne et al., <xref ref-type="bibr" rid="B35">2011</xref>), and Kinetics (Kay et al., <xref ref-type="bibr" rid="B34">2017</xref>) datasets are not sufficiently focused on the task of violence detection as they also include a vast range of actions that are not interesting for violence-detection applications. Training a DL model on a dataset that cover a vast number of actions, while generally useful, is potentially detrimental when the desired application is only interested in a small subset of the actions included in that dataset. Instead, it&#x00027;s preferable to limit the number of classes in a dataset to ensure that the trained DL model is highly specialized in recognizing certain behaviors with high accuracy. Examples of datasets that are exclusively focused on violence detection are the Hockey (Bermejo Nievas et al., <xref ref-type="bibr" rid="B4">2011</xref>), Violent Flows (Hassner et al., <xref ref-type="bibr" rid="B26">2012</xref>), CCTV-fights (Perez et al., <xref ref-type="bibr" rid="B46">2019</xref>), SC Fight (Akt&#x00131; et al., <xref ref-type="bibr" rid="B1">2019</xref>), and the UBI-fights (Degardin and Proen&#x000E7;a, <xref ref-type="bibr" rid="B12">2020</xref>) datasets. These datasets are specifically constructed for violence-detection tasks and are useful for an application such as ours. However, these datasets classify human behavior along a single dimension (whether or not the behavior is violent), as opposed to our application which seeks to recognize the size and violent nature of a crowd. Due to these limitations, we conclude that the field of smart surveillance requires a new dataset which classifies human behavior according to its extent as well as its violent nature.</p>
<p>No video dataset in the literature, to the best of our knowledge, contains large gatherings, such as protests, as an action class. Protest datasets in the literature, for instance, are limited to image datasets (Clark and Regan, <xref ref-type="bibr" rid="B11">2016</xref>), which documents protester demands, government responses, protest location, and protester identities. Thus, the novelty of our developed video dataset is that it is specifically aimed toward the identification of scenarios of public unrest (violent protests, fights, etc.) or scenarios which have the potential to develop into public unrest (large gatherings, peaceful protests, etc.). Large gatherings are particularly interesting and important to be carefully surviellanced as they can lead to unruly events. In specific, large gatherings that seem peaceful can evolve into a violent scenario with fighting, destruction of property, etc. In addition, the scale of violence captured can inform the scale of the response from law-enforcement. Thus, for the current task, we divide violence into small-scale violence (i.e., <bold>F</bold>) and large scale violence (i.e., <bold>LVG</bold>). To our knowledge, these aspects have been largely neglected in existing datasets, which motivates this work.</p>
<p>From the confusion matrices in <xref ref-type="table" rid="T4">Tables 4</xref>, <xref ref-type="table" rid="T5">5</xref>, we note that ambiguity in labeling occurs mostly in samples whose appropriate label lies between <bold>LPG</bold>/<bold>N</bold> and <bold>F</bold>/<bold>LVG</bold>. Furthermore, we explain why a DL model could be reasonably expected to missclassify some <bold>LVG</bold> samples as <bold>LPG</bold> and some <bold>F</bold> samples as <bold>N</bold>. The confusion between these pairs of classes, in both the labeling and inference processes, accounts for the vast majority of the error illustrated in the confusion matrices in <xref ref-type="table" rid="T4">Tables 4</xref>, <xref ref-type="table" rid="T5">5</xref>.</p>
<sec>
<title>4.1 LPG-N misclassification</title>
<p>We define the class of <bold>LPG</bold>s, according to the criteria outlined in Section 2.1, as events consisting of a &#x0201C;large&#x0201D; congregation of individuals who are gathered for a singular purpose. Firstly, the threshold for the number of individuals required for a gathering to be considered a &#x0201C;large&#x0201D; one is inherently subjective.</p>
<p>Additionally, even if we decide that some video footage contains a &#x0201C;large&#x0201D; number of people, we must substantiate that the individuals in the footage are gathered for a singular purpose, as opposed to a large group of individuals who happen to be in one place by mere chance, before we classify said footage as <bold>LPG</bold>. This is because a large group of individuals who are gathered by chance, such as people in a public park on a holiday, is an example of a Natural (<bold>N</bold>) event. Given that our DL model classifies 2 s of footage at a time, we would expect that the classification would sometimes fail to account for the larger context of a gathering and thus missclassify some <bold>LPG</bold> samples as <bold>N</bold>, or vice-versa.</p>
<p>Consider the sample in <xref ref-type="fig" rid="F4">Figure 4</xref>. This sample consists of a group of individuals at a traffic stop crossing the road. It&#x00027;s intuitive that this sample constitutes an <bold>N</bold> sample since seeing a group of people crossing a traffic light should obviously not raise concern. However, since the sample is only 2 s long and since the model receives no information about the context of the scene, a DL model might (and in this case has) missclassified this sample as an <bold>LPG</bold>, despite the fact that the appropriate label for the sample shown in <xref ref-type="fig" rid="F4">Figure 4</xref> is clearly <bold>N</bold>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p><bold>(A)</bold> A sample whose appropriate label lies in the blurry area between <bold>LPG</bold> and <bold>N</bold> because of the subjective nature of the term &#x0201C;large.&#x0201D; <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=Ow3m0tfjzNo&#x00026;t=28s">Pop-up protest against the permanent pandemic legislation on Tuesday - 09.11.21</ext-link>, uploaded by Real Press Media via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. <bold>(B)</bold> A sample depicting a group of individuals who, by chance, are gathered near a traffic light. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=b8QZJ5ZodTs">Crowd walking on street</ext-link>, uploaded by LionReputationMarketingCoach via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1242690-g0004.tif"/>
</fig></sec>
<sec>
<title>4.2 F-LVG misclassification</title>
<p>We define an <bold>LVG</bold> as a gathering of a &#x0201C;large&#x0201D; number of individuals engaged in violent actions. A subset of such violent actions is fighting that might occur among a subset of the individuals in the gathering. Similarly to the <bold>N</bold>-<bold>LPG</bold> distinction, there&#x00027;s a subjective threshold for the number of people in a violent scene for it to be labeled as <bold>LVG</bold> instead of <bold>F</bold>. A scene where it&#x00027;s unclear if the number of individuals depicted satisfies this subjective threshold is shown in <xref ref-type="fig" rid="F5">Figure 5A</xref>. The scene in the figure illustrates a group of individuals engaged in fighting one another. However, it&#x00027;s not clear if the number of people in the scene is large enough to constitute a large violent gathering (<bold>LVG</bold>) as opposed to a small-scale fight (<bold>F</bold>).</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>An Illustration of two scenes demonstrating the ambiguity between <bold>LVG</bold> and <bold>F</bold> events. <bold>(A)</bold> A scene with a violent gathering which may or may not be deemed sufficiently large to warrant an <bold>LVG</bold> label instead of an <bold>F</bold> label. Reproduced with permission from Soliman et al. (<xref ref-type="bibr" rid="B56">2019</xref>) via <ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/mohamedmustafa/real-life-violence-situations-dataset">Real Life Violence Dataset</ext-link>. <bold>(B)</bold> A scene with large gathering with only a few people fighting, making it unclear if the label if the scene should be <bold>LVG</bold> or <bold>F</bold>. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?app=desktop&#x00026;v=XC0_-GsiMOU">Vancouver Canucks Game 5 Stanley Cup Finals- Rogers Arena - Granville Street Party</ext-link>", uploaded by Marlo via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1242690-g0005.tif"/>
</fig>
<p>If a violent scene is determined to contain a &#x0201C;large&#x0201D; number of people, it&#x00027;s unclear how many individuals from the observed group must participate in the violent action for the appropriate label to be <bold>LVG</bold>. Intuitively, if the number of individuals who are involved in violence is small, the label given to such a scene should be <bold>F</bold>. The line between <italic>F</italic> and <bold>LVG</bold> is blurry when it comes to samples where only a subset of the individuals in a gathering are engaged in violence, such as the sample in <xref ref-type="fig" rid="F5">Figure 5B</xref>. Additionally, a trained model might incorrectly label a scene of a large number of individuals, of whom only a few are engaged in fighting, as <bold>LVG</bold> instead of <bold>F</bold> since the model may see the large number of individuals in the scene, coupled with the violent action of a few individuals from the crowd, as clues that the appropriate classification of the scene is <bold>LVG</bold>.</p></sec>
<sec>
<title>4.3 LPG-LVG misclassification</title>
<p>In instances where a violent crowd is gathered, the violent action may not be central to the footage being analyzed. Namely, the violence in a scene may occur in the background or the corner of the footage such that it is not clearly evident in a frame. One such instance is shown in <xref ref-type="fig" rid="F6">Figure 6A</xref>. In this figure, the large crowd is mostly peaceful, except for violence that occurs in the back of the crowd, which is difficult to notice without observing the scene carefully. As a result, it&#x00027;s natural for this scene to be misclassified as an <bold>LPG</bold> instead of an <bold>LVG</bold>.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p><bold>(A)</bold> A scene showing a large crowd where violence only occurs in the periphery of the scene, making it hard to notice. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=HMiegFH-qaU">Serbian protesters clash with police over government handling of coronavirus</ext-link>, uploaded by Guardian News via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>. <bold>(B)</bold> A scene of a large crowd where violence occurs only at the end of the scene, which may lead to a DL model failing to catch the late-occurring violence in this scene. <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=mNJzZlOlZ6M">Green Pass, tafferugli a Milano: polizia affronta i No Green Pass</ext-link>, uploaded by Local Team via YouTube, licensed under <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/t/terms">YouTube Standard License</ext-link>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1242690-g0006.tif"/>
</fig>
<p>In an otherwise peaceful crowds, brief moments (e.g., 0.5 s or less) of violent action might occur. In that case, given that our model examines 2-s samples of the incoming footage, the information about the occurrence of violence might be drowned out by information about a peaceful gathering. An example of this is in <xref ref-type="fig" rid="F6">Figure 6B</xref>. As we will see next, this phenomenon also occurs in fighting scenes where the fight is ignored by a DL model in favor of a largely empty background depicting uninteresting, or <italic>natural</italic>, events.</p></sec>
<sec>
<title>4.4 F-N misclassification</title>
<p>A fight scene that occurs in an open or largely-empty outdoor area, such as the one in <xref ref-type="fig" rid="F7">Figure 7A</xref>, may be missclassified by a DL model as <bold>N</bold>. This can be due to the fact that the fight only occurs in a small region of the video scene while the rest of the scene appears to be &#x0201C;natural.&#x0201D; Such missclassifications are particularly prevalent with footage coming from CCTV cameras with wide fields of view.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p><bold>(A)</bold> A fight scene in a largely-empty area which may be misclassified as <bold>N</bold>. Reproduced with permission from Soliman et al. (<xref ref-type="bibr" rid="B56">2019</xref>) via <ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/datasets/mohamedmustafa/real-life-violence-situations-dataset">Real Life Violence Dataset</ext-link>. <bold>(B)</bold> A mostly-natural scene where a fight occurs at the end, possibly confusing a DL model to classify it as <bold>N</bold> instead of <bold>F</bold>. Reproduced with permission from Degardin and Proen&#x000E7;a (<xref ref-type="bibr" rid="B12">2020</xref>) via <ext-link ext-link-type="uri" xlink:href="http://socia-lab.di.ubi.pt/EventDetection/">UBI-fights dataset</ext-link>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1242690-g0007.tif"/>
</fig>
<p>As is the case with <bold>LVG</bold>-labeled instances, a fight, like the one in <xref ref-type="fig" rid="F7">Figure 7B</xref>, might continue for a small fraction of a 2-s sample. It&#x00027;s predictable that such a sample may be classified as <bold>N</bold> instead of <bold>F</bold>.</p></sec>
<sec>
<title>4.5 Conclusions</title>
<p>This work was motivated by the task of surveillance in an outdoor area and automatically identifying note-worthy events for law enforcement. This paper presented a new dataset that divides video footage into peaceful gatherings, violent gatherings, small-scale fighting, and natural events. Based on the classification of the captured video, security agencies can be notified and a respond appropriately to the nature of the class of event identified. The dataset presented in this work was validated by using it to train three different architectures with different characteristics, namely, ResNet3D, R(2 &#x0002B; 1)D and the Swin Transformer. The validation results show that the dataset is sufficiently generalized and can be used to train models that can be deployed for real-world surveillance. <bold>The dataset described in this paper can be obtained by contacting the corresponding author of this paper</bold>.</p></sec></sec>
<sec sec-type="data-availability" id="s5">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p></sec>
<sec sec-type="ethics-statement" id="s6">
<title>Ethics statement</title>
<p>Written informed consent was not obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article because videos collected from publicly available sites (YouTube) and datasets.</p></sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>The ideas presented in this paper were conceptualized by MQ and were discussed with YY, AE, and EB. AE collected the presented dataset and EB ran the experiments on the collected dataset, which are presented in this paper. Finally, the text of this paper was written by AE and EB and was reviewed by MQ and YY. All authors contributed to the article and approved the submitted version.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This publication was made possible by AICC03-0324-200005 from the Qatar National Research Fund (a member of Qatar Foundation). The findings herein reflect the work, and are solely the responsibility of the authors.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Akt&#x00131;</surname> <given-names>&#x0015E;.</given-names></name> <name><surname>Tataro&#x0011F;lu</surname> <given-names>G. A.</given-names></name> <name><surname>Ekenel</surname> <given-names>H. K.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Vision-based fight detection from surveillance cameras,&#x0201D;</article-title> in <source>2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA)</source> (<publisher-loc>Istanbul</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/IPTA.2019.8936070</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arif</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Ul Hassan</surname> <given-names>T.</given-names></name> <name><surname>Fei</surname> <given-names>Z.</given-names></name></person-group> (<year>2019</year>). <article-title>3d-cnn-based fused feature maps with LSTM applied to action recognition</article-title>. <source>Future Internet</source> <volume>11</volume>:<fpage>42</fpage>. <pub-id pub-id-type="doi">10.3390/fi11020042</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Arnab</surname> <given-names>A.</given-names></name> <name><surname>Dehghani</surname> <given-names>M.</given-names></name> <name><surname>Heigold</surname> <given-names>G.</given-names></name> <name><surname>Sun</surname> <given-names>C.</given-names></name> <name><surname>Lu&#x0010D;i&#x00107;</surname> <given-names>M.</given-names></name> <name><surname>Schmid</surname> <given-names>C.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Vivit: a video vision transformer,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6836</fpage>&#x02013;<lpage>6846</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00676</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bermejo Nievas</surname> <given-names>E.</given-names></name> <name><surname>Deniz Suarez</surname> <given-names>O.</given-names></name> <name><surname>Bueno Garc&#x000ED;a</surname> <given-names>G.</given-names></name> <name><surname>Sukthankar</surname> <given-names>R.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Violence detection in video using computer vision techniques,&#x0201D;</article-title> in <source>International Conference on Computer Analysis of Images and Patterns</source> Berlin: (Springer), <fpage>332</fpage>&#x02013;<lpage>339</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-23678-5_39</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bertasius</surname> <given-names>G.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Torresani</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). &#x0201C;Is space-time attention all you need for video understanding?&#x0201D; <italic>in ICML</italic>, volume 2, 4.</citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Carion</surname> <given-names>N.</given-names></name> <name><surname>Massa</surname> <given-names>F.</given-names></name> <name><surname>Synnaeve</surname> <given-names>G.</given-names></name> <name><surname>Usunier</surname> <given-names>N.</given-names></name> <name><surname>Kirillov</surname> <given-names>A.</given-names></name> <name><surname>Zagoruyko</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;End-to-end object detection with transformers,&#x0201D;</article-title> in <source>Computer Vision - ECCV 2020</source> (<publisher-loc>Springer International Publishing</publisher-loc>), <fpage>213</fpage>&#x02013;<lpage>229</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carreira</surname> <given-names>J.</given-names></name> <name><surname>Noland</surname> <given-names>E.</given-names></name> <name><surname>Banki-Horvath</surname> <given-names>A.</given-names></name> <name><surname>Hillier</surname> <given-names>C.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>A short note about kinetics-600</article-title>. <source>arXiv</source>. [Preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1808.01340</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carreira</surname> <given-names>J.</given-names></name> <name><surname>Noland</surname> <given-names>E.</given-names></name> <name><surname>Hillier</surname> <given-names>C.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>A short note on the kinetics-700 human action dataset</article-title>. <source>arXiv</source>. [Preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1907.06987</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Carreira</surname> <given-names>J.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Quo vadis, action recognition? A new model and the kinetics dataset,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6299</fpage>&#x02013;<lpage>6308</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.502</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chromiak</surname> <given-names>M. P.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Exploring recent advancements of transformer based architectures in computer vision,&#x0201D;</article-title> in <source>Selected Topics in Applied Computer Science</source> (<publisher-loc>Maria Curie-Sk&#x00142;odowska University Press</publisher-loc>), <fpage>59</fpage>&#x02013;<lpage>75</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Clark</surname> <given-names>D.</given-names></name> <name><surname>Regan</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <source>Mass Mobilization Protest Data</source>. Harvard Dataverse.</citation>
</ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Degardin</surname> <given-names>B.</given-names></name> <name><surname>Proen&#x000E7;a</surname> <given-names>H.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Human activity analysis: iterative weak/self-supervised learning frameworks for detecting abnormal events,&#x0201D;</article-title> in <source>2020 IEEE International Joint Conference on Biometrics (IJCB)</source> (<publisher-loc>Houston, TX</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1109/IJCB48548.2020.9304905</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Demarty</surname> <given-names>C.-H.</given-names></name> <name><surname>Ionescu</surname> <given-names>B.</given-names></name> <name><surname>Jiang</surname> <given-names>Y.-G.</given-names></name> <name><surname>Quang</surname> <given-names>V. L.</given-names></name> <name><surname>Schedl</surname> <given-names>M.</given-names></name> <name><surname>Penet</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>&#x0201C;Benchmarking violent scenes detection in movies,&#x0201D;</article-title> in <source>2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI)</source> (<publisher-loc>Klagenfurt</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/CBMI.2014.6849827</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dhiman</surname> <given-names>C.</given-names></name> <name><surname>Vishwakarma</surname> <given-names>D. K.</given-names></name></person-group> (<year>2019</year>). <article-title>A review of state-of-the-art techniques for abnormal human activity recognition</article-title>. <source>Eng. Appl. Arti. Intell</source>. <volume>77</volume>, <fpage>21</fpage>&#x02013;<lpage>45</lpage>. <pub-id pub-id-type="doi">10.1016/j.engappai.2018.08.014</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dosovitskiy</surname> <given-names>A.</given-names></name> <name><surname>Beyer</surname> <given-names>L.</given-names></name> <name><surname>Kolesnikov</surname> <given-names>A.</given-names></name> <name><surname>Weissenborn</surname> <given-names>D.</given-names></name> <name><surname>Zhai</surname> <given-names>X.</given-names></name> <name><surname>Unterthiner</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>An image is worth 16x16 words: transformers for image recognition at scale</article-title>. <source>arXiv</source> [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.2010.11929</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dosovitskiy</surname> <given-names>A.</given-names></name> <name><surname>Beyer</surname> <given-names>L.</given-names></name> <name><surname>Kolesnikov</surname> <given-names>A.</given-names></name> <name><surname>Weissenborn</surname> <given-names>D.</given-names></name> <name><surname>Zhai</surname> <given-names>X.</given-names></name> <name><surname>Unterthiner</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;An image is worth 16x16 words: transformers for image recognition at scale,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elboushaki</surname> <given-names>A.</given-names></name> <name><surname>Hannane</surname> <given-names>R.</given-names></name> <name><surname>Afdel</surname> <given-names>K.</given-names></name> <name><surname>Koutti</surname> <given-names>L.</given-names></name></person-group> (<year>2020</year>). <article-title>MULTID-CNN: a multi-dimensional feature learning approach based on deep convolutional networks for gesture recognition in rgb-d image sequences</article-title>. <source>Expert Syst. Appl</source>. <volume>139</volume>:<fpage>112829</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2019.112829</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Feichtenhofer</surname> <given-names>C.</given-names></name> <name><surname>Pinz</surname> <given-names>A.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Convolutional two-stream network fusion for video action recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1933</fpage>&#x02013;<lpage>1941</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.213</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>C.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Zhou</surname> <given-names>S.</given-names></name> <name><surname>Varshney</surname> <given-names>P. K.</given-names></name> <name><surname>Liu</surname> <given-names>H.</given-names></name></person-group> (<year>2019</year>). <article-title>Long short-term memory-based deep recurrent neural networks for target tracking</article-title>. <source>Inf. Sci</source>. <volume>502</volume>, <fpage>279</fpage>&#x02013;<lpage>296</lpage>. <pub-id pub-id-type="doi">10.1016/j.ins.2019.06.039</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Global</surname> <given-names>I.</given-names></name></person-group> (<year>2022</year>). <source>Role of CCTV Cameras: Public, Privacy and Protection</source>.</citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>G&#x000F3;mez A</surname> <given-names>H. F.</given-names></name> <name><surname>Tom&#x000E1;s</surname> <given-names>R. M.</given-names></name> <name><surname>Tapia</surname> <given-names>S. A.</given-names></name> <name><surname>Caballero</surname> <given-names>A. F.</given-names></name> <name><surname>Ratt&#x000E9;</surname> <given-names>S.</given-names></name> <name><surname>Eras</surname> <given-names>A. G.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>&#x0201C;Identification of loitering human behaviour in video surveillance environments,&#x0201D;</article-title> in <source>International Work-Conference on the Interplay Between Natural and Artificial Computation</source> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>516</fpage>&#x02013;<lpage>525</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-18914-7_54</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gong</surname> <given-names>S.</given-names></name> <name><surname>Shi</surname> <given-names>Y.</given-names></name> <name><surname>Jain</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Low quality video face recognition: multi-mode aggregation recurrent network (MARN),&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops</source> (<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/ICCVW.2019.00132</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <source>Deep Learning</source>. MIT Press. Availabl online at: <ext-link ext-link-type="uri" xlink:href="http://www.deeplearningbook.org">http://www.deeplearningbook.org</ext-link> (accessed June, 2023).</citation>
</ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>J.-M.</given-names></name> <name><surname>Huang</surname> <given-names>P.-C.</given-names></name> <name><surname>Chang</surname> <given-names>L.-Y.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;A hybrid facial expression recognition system based on recurrent neural network,&#x0201D;</article-title> in <source>2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)</source> (<publisher-loc>Taipei</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/AVSS.2019.8909888</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hara</surname> <given-names>K.</given-names></name> <name><surname>Kataoka</surname> <given-names>H.</given-names></name> <name><surname>Satoh</surname> <given-names>Y.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Learning spatio-temporal features with 3d residual networks for action recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision Workshops</source> (<publisher-loc>Venice</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3154</fpage>&#x02013;<lpage>3160</lpage>. <pub-id pub-id-type="doi">10.1109/ICCVW.2017.373</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hassner</surname> <given-names>T.</given-names></name> <name><surname>Itcher</surname> <given-names>Y.</given-names></name> <name><surname>Kliper-Gross</surname> <given-names>O.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Violent flows: real-time detection of violent crowd behavior,&#x0201D;</article-title> in <source>2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops</source> (<publisher-loc>Providence, RI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/CVPRW.2012.6239348</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Convolutional neural networks at constrained time cost,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Boston, MA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5353</fpage>&#x02013;<lpage>5360</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2015.7299173</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep residual learning for image recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hochreiter</surname> <given-names>S.</given-names></name> <name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>1997</year>). <article-title>Long short-term memory</article-title>. <source>Neural Comput</source>. <volume>9</volume>, <fpage>1735</fpage>&#x02013;<lpage>1780</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>C.-D.</given-names></name> <name><surname>Wang</surname> <given-names>C.-Y.</given-names></name> <name><surname>Wang</surname> <given-names>J.-C.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Human action recognition system for elderly and children care using three stream convnet,&#x0201D;</article-title> in <source>2015 International Conference on Orange Technologies (ICOT)</source> (<publisher-loc>Hong Kong</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1109/ICOT.2015.7498476</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ibrahim</surname> <given-names>M. S.</given-names></name> <name><surname>Muralidharan</surname> <given-names>S.</given-names></name> <name><surname>Deng</surname> <given-names>Z.</given-names></name> <name><surname>Vahdat</surname> <given-names>A.</given-names></name> <name><surname>Mori</surname> <given-names>G.</given-names></name></person-group> (<year>2016</year>). <article-title>A hierarchical deep temporal model for group activity recognition</article-title>. <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages</source> 1971-1980. <pub-id pub-id-type="doi">10.1109/CVPR.2016.217</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ji</surname> <given-names>S.</given-names></name> <name><surname>Xu</surname> <given-names>W.</given-names></name> <name><surname>Yang</surname> <given-names>M.</given-names></name> <name><surname>Yu</surname> <given-names>K.</given-names></name></person-group> (<year>2012</year>). <article-title>3d convolutional neural networks for human action recognition</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>35</volume>, <fpage>221</fpage>&#x02013;<lpage>231</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2012.59</pub-id><pub-id pub-id-type="pmid">22392705</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Karpathy</surname> <given-names>A.</given-names></name> <name><surname>Toderici</surname> <given-names>G.</given-names></name> <name><surname>Shetty</surname> <given-names>S.</given-names></name> <name><surname>Leung</surname> <given-names>T.</given-names></name> <name><surname>Sukthankar</surname> <given-names>R.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>&#x0201C;Large-scale video classification with convolutional neural networks,&#x0201D;</article-title> in <source>2014 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Columbus, OH</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1725</fpage>&#x02013;<lpage>1732</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2014.223</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kay</surname> <given-names>W.</given-names></name> <name><surname>Carreira</surname> <given-names>J.</given-names></name> <name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>Hillier</surname> <given-names>C.</given-names></name> <name><surname>Vijayanarasimhan</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>The kinetics human action video dataset</article-title>. <source>arXiv</source>. [Preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1705.06950</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kuehne</surname> <given-names>H.</given-names></name> <name><surname>Jhuang</surname> <given-names>H.</given-names></name> <name><surname>Garrote</surname> <given-names>E.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name> <name><surname>Serre</surname> <given-names>T.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;HMDB51: a large video database for human motion recognition,&#x0201D;</article-title> in <source>2011 International Conference on Computer Vision</source> (<publisher-loc>Barcelona</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2556</fpage>&#x02013;<lpage>2563</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2011.6126543</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Laptev</surname> <given-names>I.</given-names></name> <name><surname>Marszalek</surname> <given-names>M.</given-names></name> <name><surname>Schmid</surname> <given-names>C.</given-names></name> <name><surname>Rozenfeld</surname> <given-names>B.</given-names></name></person-group> (<year>2008</year>). <article-title>&#x0201C;Learning realistic human actions from movies,&#x0201D;</article-title> in <source>2008 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Anchorage, AK</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2008.4587756</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning</article-title>. <source>Nature</source> <volume>521</volume>, <fpage>436</fpage>&#x02013;<lpage>44</lpage>. <pub-id pub-id-type="doi">10.1038/nature14539</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lepikhin</surname> <given-names>D.</given-names></name> <name><surname>Lee</surname> <given-names>H.</given-names></name> <name><surname>Xu</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>D.</given-names></name> <name><surname>Firat</surname> <given-names>O.</given-names></name> <name><surname>Huang</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Gshard: scaling giant models with conditional computation and automatic sharding</article-title>. <source>arXiv</source>. [Preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.2006.16668</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Lin</surname> <given-names>Y.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Hu</surname> <given-names>H.</given-names></name> <name><surname>Wei</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Swin transformer: hierarchical vision transformer using shifted windows,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>10012</fpage>&#x02013;<lpage>10022</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00986</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Ning</surname> <given-names>J.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Wei</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Lin</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>&#x0201C;Video swin transformer,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3202</fpage>&#x02013;<lpage>3211</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.00320</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Marsden</surname> <given-names>M.</given-names></name> <name><surname>McGuinness</surname> <given-names>K.</given-names></name> <name><surname>Little</surname> <given-names>S.</given-names></name> <name><surname>O&#x00027;Connor</surname> <given-names>N. E.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Resnetcrowd: a residual deep learning architecture for crowd counting, violent behaviour detection and crowd density level classification,&#x0201D;</article-title> in <source>2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)</source> (<publisher-loc>Lecce</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1109/AVSS.2017.8078482</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mazzia</surname> <given-names>V.</given-names></name> <name><surname>Angarano</surname> <given-names>S.</given-names></name> <name><surname>Salvetti</surname> <given-names>F.</given-names></name> <name><surname>Angelini</surname> <given-names>F.</given-names></name> <name><surname>Chiaberge</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Action transformer: a 520 self-attention model for short-time pose-based human action recognition</article-title>. <source>Pattern Recognit.</source> <volume>124</volume>:<fpage>108487</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2021.108487</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Murugesan</surname> <given-names>M.</given-names></name> <name><surname>Thilagamani</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>Efficient anomaly detection in surveillance videos based on multi layer perception recurrent neural network</article-title>. <source>Microprocess. Microsyst</source>. <volume>79</volume>:<fpage>103303</fpage>. <pub-id pub-id-type="doi">10.1016/j.micpro.2020.103303</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nayak</surname> <given-names>R.</given-names></name> <name><surname>Pati</surname> <given-names>U. C.</given-names></name> <name><surname>Das</surname> <given-names>S. K.</given-names></name></person-group> (<year>2021</year>). <article-title>A comprehensive review on deep learning-based methods for video anomaly detection</article-title>. <source>Image Vis. Comput</source>. <volume>106</volume>:<fpage>104078</fpage>. <pub-id pub-id-type="doi">10.1016/j.imavis.2020.104078</pub-id></citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Neimark</surname> <given-names>D.</given-names></name> <name><surname>Bar</surname> <given-names>O.</given-names></name> <name><surname>Zohar</surname> <given-names>M.</given-names></name> <name><surname>Asselmann</surname> <given-names>D.</given-names></name></person-group> (<year>2021</year>). &#x0201C;Video transformer network,&#x0201D; <italic>in Proceedings of the IEEE/CVF International Conference on Computer Vision</italic> (Montreal, BC: IEEE), <fpage>3163</fpage>&#x02013;<lpage>3172</lpage>. <pub-id pub-id-type="doi">10.1109/ICCVW54120.2021.00355</pub-id></citation>
</ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Perez</surname> <given-names>M.</given-names></name> <name><surname>Kot</surname> <given-names>A. C.</given-names></name> <name><surname>Rocha</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Detection of real-world fights in surveillance videos,&#x0201D;</article-title> in <source>ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Brighton</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2662</fpage>&#x02013;<lpage>2666</lpage>. <pub-id pub-id-type="doi">10.1109/ICASSP.2019.8683676</pub-id></citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reddy</surname> <given-names>K.</given-names></name> <name><surname>Shah</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>Recognizing 50 human action categories of web videos</article-title>. <source>Mach. Vis. Appl</source>. <volume>24</volume>, <fpage>971</fpage>&#x02013;<lpage>981</lpage>. <pub-id pub-id-type="doi">10.1007/s00138-012-0450-4</pub-id></citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sabokrou</surname> <given-names>M.</given-names></name> <name><surname>Fayyaz</surname> <given-names>M.</given-names></name> <name><surname>Fathy</surname> <given-names>M.</given-names></name> <name><surname>Moayed</surname> <given-names>Z.</given-names></name> <name><surname>Klette</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>Deep-anomaly: fully convolutional neural network for fast anomaly detection in crowded scenes</article-title>. <source>Comput. Vis. Image Underst</source>. <volume>172</volume>, <fpage>88</fpage>&#x02013;<lpage>97</lpage>. <pub-id pub-id-type="doi">10.1016/j.cviu.2018.02.006</pub-id></citation>
</ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sahoo</surname> <given-names>S. R.</given-names></name> <name><surname>Dash</surname> <given-names>R.</given-names></name> <name><surname>Mahapatra</surname> <given-names>R. K.</given-names></name> <name><surname>Sahu</surname> <given-names>B.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Unusual event detection in surveillance video using transfer learning,&#x0201D;</article-title> in <source>2019 International Conference on Information Technology (ICIT)</source> (<publisher-loc>Bhubaneswar</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>319</fpage>&#x02013;<lpage>324</lpage>. <pub-id pub-id-type="doi">10.1109/ICIT48102.2019.00063</pub-id></citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>S&#x000E1;nchez</surname> <given-names>F. L.</given-names></name> <name><surname>Hupont</surname> <given-names>I.</given-names></name> <name><surname>Tabik</surname> <given-names>S.</given-names></name> <name><surname>Herrera</surname> <given-names>F.</given-names></name></person-group> (<year>2020</year>). <article-title>Revisiting crowd behaviour analysis through deep learning: taxonomy, anomaly detection, crowd emotions, datasets, opportunities and prospects</article-title>. <source>Inf. Fusion</source> <volume>64</volume>, <fpage>318</fpage>&#x02013;<lpage>335</lpage>. <pub-id pub-id-type="doi">10.1016/j.inffus.2020.07.008</pub-id><pub-id pub-id-type="pmid">32834797</pub-id></citation></ref>
<ref id="B51">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Savakis</surname> <given-names>A.</given-names></name> <name><surname>Shringarpure</surname> <given-names>A. M.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Semantic background estimation in video sequences,&#x0201D;</article-title> in <source>2018 5th International Conference on Signal Processing and Integrated Networks (SPIN)</source> (<publisher-loc>Noida</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>597</fpage>&#x02013;<lpage>601</lpage>. <pub-id pub-id-type="doi">10.1109/SPIN.2018.8474279</pub-id></citation>
</ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sharma</surname> <given-names>V.</given-names></name> <name><surname>Gupta</surname> <given-names>M.</given-names></name> <name><surname>Kumar</surname> <given-names>A.</given-names></name> <name><surname>Mishra</surname> <given-names>D.</given-names></name></person-group> (<year>2021</year>). <article-title>Video processing using deep learning techniques: a systematic literature review</article-title>. <source>IEEE Access</source> <volume>9</volume>, <fpage>139489</fpage>&#x02013;<lpage>139507</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3118541</pub-id></citation>
</ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shi</surname> <given-names>X.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Yeung</surname> <given-names>D.-Y.</given-names></name> <name><surname>Wong</surname> <given-names>W.-K.</given-names></name> <name><surname>Woo</surname> <given-names>W.-c.</given-names></name></person-group> (<year>2015</year>). <article-title>Convolutional LSTM network: a machine learning approach for precipitation nowcasting</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>28</volume>, <fpage>28</fpage>&#x02013;<lpage>37</lpage>.</citation>
</ref>
<ref id="B54">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Siam</surname> <given-names>M.</given-names></name> <name><surname>Valipour</surname> <given-names>S.</given-names></name> <name><surname>Jagersand</surname> <given-names>M.</given-names></name> <name><surname>Ray</surname> <given-names>N.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Convolutional gated recurrent networks for video segmentation,&#x0201D;</article-title> in <source>2017 IEEE International Conference on Image Processing (ICIP)</source> (<publisher-loc>Beijing</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3090</fpage>&#x02013;<lpage>3094</lpage>. <pub-id pub-id-type="doi">10.1109/ICIP.2017.8296851</pub-id></citation>
</ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Two-stream convolutional networks for action recognition in videos</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>1</volume>, <fpage>27</fpage>&#x02013;<lpage>36</lpage>.</citation>
</ref>
<ref id="B56">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Soliman</surname> <given-names>M.</given-names></name> <name><surname>Kamal</surname> <given-names>M.</given-names></name> <name><surname>Nashed</surname> <given-names>M.</given-names></name> <name><surname>Mostafa</surname> <given-names>Y.</given-names></name> <name><surname>Chawky</surname> <given-names>B.</given-names></name> <name><surname>Khattab</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Violence recognition from videos using deep learning techniques&#x0201D;</article-title>, in <source>Proceeding of 9th International Conference on Intelligent Computing and Information Systems (ICICIS&#x00027;19)</source> (Cairo), <fpage>79</fpage>&#x02013;<lpage>84</lpage>.</citation>
</ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Soomro</surname> <given-names>K.</given-names></name> <name><surname>Zamir</surname> <given-names>A. R.</given-names></name> <name><surname>Shah</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>Ucf101: a dataset of 101 human actions classes from videos in the wild</article-title>. <source>arXiv</source>. [Preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1212.0402</pub-id></citation>
</ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Srivastava</surname> <given-names>R. K.</given-names></name> <name><surname>Greff</surname> <given-names>K.</given-names></name> <name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Highway networks</article-title>. <source>arXiv</source>. [Preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1505.00387</pub-id></citation>
</ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><collab>Sudhakaran S. and Lanz, O..</collab></person-group> (<year>2017</year>). <source>Learning to Detect Violent Videos Using Convolutional Long Short-Term Memory</source>.</citation>
</ref>
<ref id="B60">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sultani</surname> <given-names>W.</given-names></name> <name><surname>Chen</surname> <given-names>C.</given-names></name> <name><surname>Shah</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). &#x0201C;Real-world anomaly detection in surveillance videos,&#x0201D; <italic>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</italic> (Salt Lake City, UT: IEEE), <fpage>6479</fpage>&#x02013;<lpage>6488</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00678</pub-id></citation>
</ref>
<ref id="B61">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Touvron</surname> <given-names>H.</given-names></name> <name><surname>Cord</surname> <given-names>M.</given-names></name> <name><surname>Douze</surname> <given-names>M.</given-names></name> <name><surname>Massa</surname> <given-names>F.</given-names></name> <name><surname>Sablayrolles</surname> <given-names>A.</given-names></name> <name><surname>J&#x000E9;gou</surname> <given-names>H.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Training data-efficient image transformers and distillation through attention,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>. PMLR.</citation>
</ref>
<ref id="B62">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tran</surname> <given-names>D.</given-names></name> <name><surname>Bourdev</surname> <given-names>L.</given-names></name> <name><surname>Fergus</surname> <given-names>R.</given-names></name> <name><surname>Torresani</surname> <given-names>L.</given-names></name> <name><surname>Paluri</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Learning spatiotemporal features with 3d convolutional networks,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Santiago</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4489</fpage>&#x02013;<lpage>4497</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2015.510</pub-id></citation>
</ref>
<ref id="B63">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tran</surname> <given-names>D.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Torresani</surname> <given-names>L.</given-names></name> <name><surname>Ray</surname> <given-names>J.</given-names></name> <name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Paluri</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;A closer look at spatiotemporal convolutions for action recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6450</fpage>&#x02013;<lpage>6459</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00675</pub-id></citation>
</ref>
<ref id="B64">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Varol</surname> <given-names>G.</given-names></name> <name><surname>Laptev</surname> <given-names>I.</given-names></name> <name><surname>Schmid</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>Long-term temporal convolutions for action recognition</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>40</volume>, <fpage>1510</fpage>&#x02013;<lpage>1517</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2017.2712608</pub-id><pub-id pub-id-type="pmid">28600238</pub-id></citation></ref>
<ref id="B65">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Attention is all you need</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>30</volume>, <fpage>30</fpage>&#x02013;<lpage>41</lpage>.</citation>
</ref>
<ref id="B66">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Miao</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>R.</given-names></name> <name><surname>Hao</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>I3d-LSTM: a new model for human action recognition</article-title>. <source>IOP Conf. Ser.: Mater. Sci. Eng</source>. <volume>569</volume>:<fpage>032035</fpage>. <pub-id pub-id-type="doi">10.1088/1757-899X/569/3/032035</pub-id></citation>
</ref>
<ref id="B67">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>Z.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Hauptmann</surname> <given-names>A. G.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;A discriminative cnn video representation for event detection,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>(Boston, MA: IEEE), <fpage>1798</fpage>&#x02013;<lpage>1807</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2015.7298789</pub-id></citation>
</ref>
<ref id="B68">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yadav</surname> <given-names>S. K.</given-names></name> <name><surname>Singh</surname> <given-names>A.</given-names></name> <name><surname>Gupta</surname> <given-names>A.</given-names></name> <name><surname>Raheja</surname> <given-names>J. L.</given-names></name></person-group> (<year>2019</year>). <article-title>Real-time yoga recognition using deep learning</article-title>. <source>Neural Comput. Appl</source>. <volume>31</volume>, <fpage>9349</fpage>&#x02013;<lpage>9361</lpage>. <pub-id pub-id-type="doi">10.1007/s00521-019-04232-7</pub-id></citation>
</ref>
<ref id="B69">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>X.</given-names></name> <name><surname>Su</surname> <given-names>W.</given-names></name> <name><surname>Lu</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Dai</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Deformable detr: Deformable transformers for end-to-end object detection</article-title>. <source>arXiv</source>. [Preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.2010.04159</pub-id></citation>
</ref>
</ref-list>
</back>
</article>