<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="review-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Sig. Proc.</journal-id>
<journal-title>Frontiers in Signal Processing</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Sig. Proc.</abbrev-journal-title>
<issn pub-type="epub">2673-8198</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">984169</article-id>
<article-id pub-id-type="doi">10.3389/frsip.2022.984169</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Signal Processing</subject>
<subj-group>
<subject>Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Video fingerprinting: Past, present, and future</article-title>
<alt-title alt-title-type="left-running-head">Allouche and Mitrea</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frsip.2022.984169">10.3389/frsip.2022.984169</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Allouche</surname>
<given-names>Mohamed</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1923466/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Mitrea</surname>
<given-names>Mihai</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1280248/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Telecom SudParis</institution>, <institution>ARTEMIS Department</institution>, <institution>SAMOVAR Laboratory</institution>, <addr-line>Evry</addr-line>, <country>France</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>VIDMIZER</institution>, <addr-line>Paris</addr-line>, <country>France</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1120289/overview">Frederic Dufaux</ext-link>, CNRS, France</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1224181/overview">Benedetta Tondi</ext-link>, University of Siena, Italy</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1730917/overview">William Puech</ext-link>, Universit&#xe9; de Montpellier, France</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Mihai Mitrea, <email>mihai.mitrea@telecom-sudparis.eu</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Image Processing, a section of the journal Frontiers in Signal Processing</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>02</day>
<month>09</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>2</volume>
<elocation-id>984169</elocation-id>
<history>
<date date-type="received">
<day>01</day>
<month>07</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>02</day>
<month>08</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Allouche and Mitrea.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Allouche and Mitrea</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>The last decades have seen video production and consumption rise significantly: TV/cinematography, social networking, digital marketing, and video surveillance incrementally and cumulatively turned video content into the predilection type of data to be exchanged, stored, and processed. Belonging to video processing realm, <italic>video fingerprinting</italic> (also referred to as <italic>content-based copy detection</italic> or <italic>near duplicate detection</italic>) regroups research efforts devoted to identifying duplicated and/or replicated versions of a given video sequence (query) in a reference video dataset. The present paper reports on a state-of-the-art study on the past and present of video fingerprinting, while attempting to identify trends for its development. First, the conceptual basis and evaluation frameworks are set. This way, the methodological approaches (situated at the cross-roads of image processing, machine learning, and neural networks) can be structured and discussed. Finally, fingerprinting is confronted to the challenges raised by the emerging video applications (<italic>e.g.</italic>, unmanned vehicles or fake news) and to the constraints they set in terms of content traceability and computational complexity. The relationship with other technologies for content tracking (<italic>e.g.,</italic> DLT - Distributed Ledger Technologies) are also presented and discussed.</p>
</abstract>
<kwd-group>
<kwd>Video fingerprinting</kwd>
<kwd>review</kwd>
<kwd>machine learing</kwd>
<kwd>neural network</kwd>
<kwd>visual feature</kwd>
<kwd>DLT (distributed ledger technologies)</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Nowadays, TV/cinematography, social networking, digital marketing, and video surveillance incrementally and cumulatively turned video content into the predilection type of data to be exchanged, stored, and processed. As an illustration, according to <xref ref-type="bibr" rid="B107">Statista, 2022</xref>, the TV over Internet traffic tripled between 2016 and 2021, reaching a monthly 42,000 petabytes of data.</p>
<p>Such a tremendous quantity of information, coupled to myriad of domestic/professional usages should be backboned by strong scientific and methodological video processing paradigms, and video fingerprinting is one of these. <italic>Video fingerprinting</italic> identifies duplicated, replicated and/or slightly modified versions of a given video sequence (query) in a reference video dataset <xref ref-type="bibr" rid="B19">Douze et al., 2008</xref>, <xref ref-type="bibr" rid="B63">Lee and Yoo, 2008</xref>, <xref ref-type="bibr" rid="B108">Su et al., 2009</xref>, <xref ref-type="bibr" rid="B120">Wary and Neelima, 2019</xref>. It is also referred to as <italic>near duplicate detection</italic>, or <italic>content-based copy detection</italic> <xref ref-type="bibr" rid="B61">Law-To et al., 2007a</xref>. The term <italic>video hashing</italic>
<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref> (or <italic>perceptual video hashing</italic>) is also in use for fingerprinting applications applied to very large video database search <xref ref-type="bibr" rid="B84">Nie et al., 2015</xref>, <xref ref-type="bibr" rid="B76">Liu, 2019</xref>, <xref ref-type="bibr" rid="B4">Anuranji and Srimathi, 2020</xref>.</p>
<p>
<italic>Video fingerprint principle</italic> can be illustrated in relation to the human fingerprints <xref ref-type="bibr" rid="B88">Oostveen et al., 2002</xref>, <xref ref-type="fig" rid="F1">Figure 1</xref>. The patterns of dermal ridges on human fingertips are natural identifiers for humans, as disclosed by Sir Francis Galton in 1893. Although they are tiny when compared to the entire human body, human fingerprints can uniquely identify a person regardless of their physiognomy changes and potential disguises. Analogously, video fingerprints are meant to be video identifiers that shall uniquely identify videos even if their contents undergo a predefined, application dependent set of transformations.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Human fingerprinting versus video fingerprinting.</p>
</caption>
<graphic xlink:href="frsip-02-984169-g001.tif"/>
</fig>
<p>The conceptual premise being generic, the underlying research studies are very different, from both methodological and applicative perspectives. The present paper reports on a state-of-the-art study on the past and present of video fingerprinting while trying to identify trends for its future development. It solely considers the video component and leaves the multimodal approaches (video/audio, video/annotations, video/depth, <italic>etc</italic>.) outside its scope.</p>
<p>The paper is structured as follows. First, <xref ref-type="sec" rid="s2">Section 2</xref> identifies the <italic>fingerprinting</italic> scope with respect to two related yet complementary applicative frameworks, namely <italic>video indexing</italic> and <italic>video watermarking</italic>. The fingerprinting evaluation framework is set in <xref ref-type="sec" rid="s3">Section 3</xref>. This way, the methodological approaches (situated at the cross-roads of image processing, ML&#x2014;machine learning and NN&#x2014;neural networks) can be objectively structured and presented in <xref ref-type="sec" rid="s4">Section 4</xref>. Finally, fingerprinting is confronted to the challenges raised by emerging video processing paradigms in <xref ref-type="sec" rid="s5">Section 5</xref>. Conclusions are drawn in <xref ref-type="sec" rid="s6">Section 6</xref>. A list of acronyms (unless they are commonly known and/or unambiguous) is included after References.</p>
</sec>
<sec id="s2">
<title>2 Applicative scope</title>
<p>The applicative scope of video fingerprint can be identified through synergies and complementarities with <italic>video indexing</italic> <xref ref-type="bibr" rid="B41">Idris and Panchanathan, 1997</xref> and <italic>video watermarking</italic> <xref ref-type="bibr" rid="B13">Cox et al., 2007</xref>. To this end, this section will incrementally illustrate the principles of these three paradigms and will identify their relationship.</p>
<p>
<italic>Video indexing</italic> might be considered as the first framework for content-based video searching and retrieval <xref ref-type="bibr" rid="B41">Idris and Panchanathan, 1997</xref>, <xref ref-type="bibr" rid="B12">Coudert et al., 1999</xref>. Assuming a video repository, the objective of video indexing is to find all the video sequences that are visually related to a query. For instance, assuming the query is a video showing some Panda bears and the repository consist of some wild animal sequences, a video indexing solution searches for all sequences in the repository that contain Panda bears, as well as images containing the same type of background, as illustrated in <xref ref-type="fig" rid="F2">Figure 2</xref>. To this end, salient information (referred to as <italic>descriptor</italic>) is extracted from the query and compared to the <italic>descriptors</italic> of all the sequences in that repository (that were <italic>a priori</italic> computed and stored). Such a comparison implicitly assumes that a similarity measure for the visual proximity between two video sequences is defined and that a threshold according to which two descriptors can be matched is set.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Video indexing principle: a binary descriptor is extracted from a query video to retrieve any other related visual content in the dataset.</p>
</caption>
<graphic xlink:href="frsip-02-984169-g002.tif"/>
</fig>
<p>
<italic>Digital watermarking</italic> <xref ref-type="bibr" rid="B13">Cox et al., 2007</xref> deals with the identification of any modified version of video content, <xref ref-type="fig" rid="F3">Figure 3</xref>. For instance, assuming again a video sequence representing some Panda bears is displayed on a screen and that the screen content is recorded by an external camera, the original content should be identifiable from the camcordered version. To this end, according to the digital watermarking framework, extra information (referred to as <italic>mark</italic> or <italic>watermark</italic>) is imperceptibly <italic>inserted</italic> (or, as a synonym, <italic>embedded</italic>) into the video content prior to its release (distribution, storage, display, &#x2026; ). By detecting the watermark in a potentially modified version of the watermarked video content, the original content shall be unambiguously identified. Of course, the watermark shall not be recovered from any unmarked content (be it visually related to the original content or not).</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Video watermarking principle: a binary watermark is imperceptibly inserted (embedded) in the video sequence; this way, the watermarked sequence can be subsequently identified even when its content is modified (maliciously or not).</p>
</caption>
<graphic xlink:href="frsip-02-984169-g003.tif"/>
</fig>
<p>
<italic>Video fingerprinting</italic> also deals with identifying slightly modified (replicated, or near duplicated) content, yet its approach is different with respect to both indexing and watermarking, as illustrated in <xref ref-type="fig" rid="F4">Figure 4</xref>. Coming back to the previous two examples, video fingerprinting shall also track a near-duplicated video sequence (<italic>e.g</italic>., a screen recorded Panda sequence) back to its original (<italic>e.g</italic>., the Panda original sequence) that is stored in a video repository. Yet, unlike indexing, any other sequence, even visually related to it (<italic>e.g.</italic>, the same Panda bear at a different time of the day and/or in different postures) shall not be detected as identical. To this end, some salient information (referred to as <italic>fingerprint</italic> or <italic>perceptual hash</italic>) is extracted from the query video sequence (note that this information is not previously inserted in the content, as in case of watermarking sequences). By comparing (according to a similarity measure and a preestablished threshold) the query fingerprint to the reference sequence fingerprints, a decision on the visual identity between the video sequences shall be made.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Video fingerprinting principle: a binary descriptor extracted from a query video (<italic>fingerprint</italic>) can unambiguously identify all the near-duplicated versions of that content.</p>
</caption>
<graphic xlink:href="frsip-02-984169-g004.tif"/>
</fig>
<p>Three main properties are generally considered for fingerprinting.</p>
<p>First, the <italic>unicity</italic> (or <italic>uniqueness</italic>) property assumes that different contents (<italic>i.e.</italic>, content that is neither the query nor one of its near-duplicated versions) result in different fingerprints (in the sense of the similarity measure and of its related threshold).</p>
<p>Secondly, the <italic>robustness</italic> property relates to the possibility of identifying as similar sequences that are near-duplicated. The transformations a video can undergo will be further referred to as <italic>modifications</italic>, <italic>distortions</italic>, or <italic>attacks</italic>, be them malicious or mundane. The video that is obtained through transformations, modifications, distortions, or attacks will be denoted as a <italic>copy</italic>, a <italic>replica video</italic>, a <italic>near duplicated video</italic> or an <italic>attacked video</italic>. While these terms are conceptually similar, fine distinction among them can be made for some specific applicative fields. For instance, <xref ref-type="bibr" rid="B74">Liu et al., 2013</xref> mention at least four different definitions related to near duplicated video content, ranging from &#x201c;<italic>Identical or approximately identical videos close to the exact duplicate of each other, but different in file formats, encoding parameters, photometric variations (color, lighting changes), editing operations (caption, logo and border insertion), different lengths, and certain modifications (frames add/remove)</italic>&#x201d; <xref ref-type="bibr" rid="B124">Wu et al., 2007a</xref>, <xref ref-type="bibr" rid="B122">Wu et al., 2007b</xref> to &#x201c;<italic>Videos of the same scene (e.g., a person riding a bike) varying viewpoints, sizes, appearances, bicycle type, and camera motions. The same semantic concept can occur under different illumination, appearance, and scene settings, just to name a few.</italic>&#x201d; <xref ref-type="bibr" rid="B6">Basharat et al., 2008</xref>. Our study will stay at a generic level and will use these terms as referred to in the cited studies.</p>
<p>Finally, a fingerprinting method is said to feature <italic>dataset search efficiency</italic> if the computation of the fingerprints and the matching procedure ensure low, application dependent computation time. The dataset search efficiency is assessed by the average computation time needed to identify a query in the context of a considered video fingerprinting use case (that is, execution time on a given processing environment and on a given repository).</p>
<p>By comparing among them these three methodological frameworks, it can be noted that:<list list-type="simple">
<list-item>
<p>&#x2022; <italic>Indexing</italic> and <italic>fingerprinting</italic> share the concept of tracking content thanks to information directly extracted from that content (that is, both <italic>indexing</italic> and <italic>fingerprinting are passive</italic> tracking technique); yet, while fingerprinting tracks the content <italic>per se</italic>, indexing rather tracks a whole semantic family related to that content. From the applicative point of view, indexing and fingerprinting differ in the unicity property.</p>
</list-item>
<list-item>
<p>&#x2022; <italic>Watermarking</italic> and <italic>fingerprinting</italic> share the possibility of tracking both an original content and its replicas modified under a given level of accepted distortion; yet watermarking requires the insertion of additional information (that is, watermarking is an <italic>active</italic> tracking technique) while fingerprinting solely exploits information extracted from the very content to be tracked.</p>
</list-item>
</list>
</p>
<p>Moreover, note that <italic>video fingerprinting</italic> is also sometimes referred to as (<italic>perceptual</italic>) <italic>video hashing</italic> <xref ref-type="bibr" rid="B84">Nie et al., 2015</xref>, <xref ref-type="bibr" rid="B76">Liu, 2019</xref>, <xref ref-type="bibr" rid="B4">Anuranji and Srimathi, 2020</xref>. Yet, distinction should be made with respect to <italic>robust video hashing</italic> <xref ref-type="bibr" rid="B26">Fridrich and Goljan, 2000</xref>, <xref ref-type="bibr" rid="B133">Zhao et al., 2013</xref>, <xref ref-type="bibr" rid="B91">Ouyang et al., 2015</xref> that belongs to the security and/or forensics applicative areas and generally refers to applications where distinction between content preserving and content manipulation attacks should be made. Robust video hashing is out of the scope of the present study.</p>
<p>These properties turn fingerprinting in a paradigm with potential impact in large variety of applicative fields. The ability to identify and retrieve video even under distortions is a powerful tool for automatic video filtering and retrieval, copyright infringement prevention, media content broadcast monitoring over multi-broadcast channels, contextual advertising, or business analytics, to mention but a few <xref ref-type="bibr" rid="B65">Lefebvre et al., 2009</xref>, <xref ref-type="bibr" rid="B78">Lu, 2009</xref>, <xref ref-type="bibr" rid="B98">Seidel, 2009</xref>, <xref ref-type="bibr" rid="B129">Yuan et al., 2016</xref>, <xref ref-type="bibr" rid="B120">Wary and Neelima, 2019</xref>, <xref ref-type="bibr" rid="B87">Nie et al., 2021</xref>.</p>
<p>The analogy between the human and video fingerprints brings to light two key aspects. First, from the conceptual point of view, it implicitly assumes that video fingerprinting exists, that is, that a reduced set of information extracted from the video content makes it possible for the content to be tracked. As this concept cannot be <italic>a priori</italic> proved, it requires comprehensive <italic>a posteriori</italic> validation in a consensual evaluation framework, as discussed in <xref ref-type="sec" rid="s3">Section 3</xref>. Secondly, from the methodological point of view, any video fingerprinting processing pipeline is composed of two main components: the <italic>fingerprint extractor</italic> (that is, the method for computing the fingerprint) and the <italic>fingerprint detector</italic> (that is, the method for searching similar content based on that fingerprint). Consequently, the state-of-the-art studies in <xref ref-type="sec" rid="s4">Section 4</xref> will be presented according to these two items.</p>
</sec>
<sec id="s3">
<title>3 Evaluation framework</title>
<p>In a nutshell, the performances of a video fingerprinting system can be objectively assessed by <italic>evaluating its properties (uniqueness, robustness, and dataset search efficiency) on a consensual, statistically relevant dataset</italic>, and this section is structured accordingly. <xref ref-type="sec" rid="s3-1">Section 3.1</xref> presents the quantitative measures that are most often considered in state-of-the-art studies, alongside with their statistical grounds. <xref ref-type="sec" rid="s3-2">Section 3.2</xref> deals with the datasets to be processed in video fingerprinting experiments and presents the principles for their specification as well as some key examples that will be further referred to in <xref ref-type="sec" rid="s4">Section 4</xref>.</p>
<sec id="s3-1">
<title>3.1 Property evaluation</title>
<p>The evaluation of the uniqueness and the robustness properties can be achieved by considering fingerprinting as a statistical binary decision problem. Be there a query sequence whose identity is looked up in a reference dataset with the help of a video fingerprinting system.</p>
<p>According to the binary decision principle, when comparing a query to a given sequence in the dataset, two hypotheses can be stated:<list list-type="simple">
<list-item>
<p>
<monospace>o</monospace>
<italic>H0: the query is a replica of a video sequence identified though the tested fingerprint.</italic>
</p>
</list-item>
<list-item>
<p>
<monospace>o</monospace>
<italic>H1: the query is not a replica of the video sequence identified through the tested fingerprint</italic>.</p>
</list-item>
</list>
</p>
<p>The output of the system can be of two types: <italic>positive</italic>, when the query is identified as replica of a video sequence and <italic>negative</italic> otherwise.</p>
<p>When confronted to the ground truth, the statistical decisions can be labeled as: <italic>true</italic>, when the result provided by the test is correct and as <italic>false</italic> otherwise.</p>
<p>Consequently, four types of decisions are made:<list list-type="simple">
<list-item>
<p>
<monospace>o</monospace>
<italic>False positive</italic> (or <italic>false alarm</italic>, denoted by <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>): the system erroneously accepts the query as a copy of a reference video sequence.</p>
</list-item>
<list-item>
<p>
<monospace>o</monospace>
<italic>False negative</italic> (or <italic>missed detection</italic>, denoted by <inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>): the system erroneously rejects a query as a copy of a reference video sequence.</p>
</list-item>
<list-item>
<p>
<monospace>o</monospace>
<italic>True positive</italic> (denoted by <inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>): the system correctly accepted a query as a copy of a reference video sequence.</p>
</list-item>
<list-item>
<p>
<monospace>o</monospace>
<italic>True negative</italic> (denoted by <inline-formula id="inf4">
<mml:math id="m4">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>): the system correctly rejected a query as a copy of a reference video sequence.</p>
</list-item>
</list>
</p>
<p>The objective evaluation of a video fingerprinting system is achieved by deriving performance indicators from the four measures above.</p>
<p>To evaluate the uniqueness property, two measures are generally considered: the <inline-formula id="inf5">
<mml:math id="m5">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#xa0;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> (Probability of False Alarm), and the <inline-formula id="inf6">
<mml:math id="m6">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (Precision) rate, <xref ref-type="bibr" rid="B108">Su et al., 2009</xref>, <xref ref-type="bibr" rid="B63">Lee and Yoo, 2008</xref>:<disp-formula id="equ1">
<mml:math id="m7">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>p</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mtext>&#x2003;</mml:mtext>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<inline-formula id="inf7">
<mml:math id="m8">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <italic>Prec</italic> are also referred to as <italic>FPR</italic> (False Positive Rate) and <italic>TPR</italic> (True Positive Rate), respectively.</p>
<p>To evaluate the robustness property, the <inline-formula id="inf8">
<mml:math id="m9">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (Probability of Missed Detection), and the <inline-formula id="inf9">
<mml:math id="m10">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (Recall) rate, are generally considered:<disp-formula id="equ2">
<mml:math id="m11">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mtext>&#x2003;</mml:mtext>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>An efficient fingerprinting method (featuring both unicity and robustness) should jointly ensure low values for <inline-formula id="inf10">
<mml:math id="m12">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#xa0;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf11">
<mml:math id="m13">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> while having <inline-formula id="inf12">
<mml:math id="m14">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf13">
<mml:math id="m15">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> values close to 1. The actual thresholds for these entities depend on the specific use case.</p>
<p>Although <inline-formula id="inf14">
<mml:math id="m16">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf15">
<mml:math id="m17">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> are two measures commonly used in the evaluation of any information retrieval system, they are not statistical measures as they do not consider the true negative results. Hence, to comprehensively present the properties of a system, <inline-formula id="inf16">
<mml:math id="m18">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#xa0;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf17">
<mml:math id="m19">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> should also be considered.</p>
<p>In practice, several other derived and/or complementary performance indicators can be considered, such as the <italic>F1 score</italic>, the <italic>ROC</italic> (Receiver Operating Characteristic), the <italic>AUC</italic> (Area Under the Curve), or the <italic>mAP</italic> (mean Average Precision).</p>
<p>From a theoretical point of view, the dataset search efficiency can be expressed by the computational complexity, that expresses the number of elementary operations required for computing and matching fingerprints as a function of video sequence parameters (frame size, frame rate) and repository size. As such an approach is limitative for NN-based algorithms, the dataset search efficiency property is commonly assessed by the average processing time required by the video fingerprinting system to identify the query within the reference dataset and to output the result for a query. The average processing time can be obtained by averaging the processing time required by the system for the considered collection of queries. Of course, such an evaluation implicitly assumes that detail description is available about the computing configuration (CPU, GPU) performances as well as about the size of the dataset.</p>
</sec>
<sec id="s3-2">
<title>3.2 Evaluation dataset</title>
<p>Regardless the evaluated property, the dataset plays a central role, and its design is expected to observe to three constraints: statistical relevance, application completeness, and consensual usage.</p>
<p>The statistical relevance (and implicitly, the reproducibility of the results) mainly relates to the size of the dataset that should ensure the statistical error control (<italic>e.g.</italic>, the sizes of <inline-formula id="inf18">
<mml:math id="m20">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf19">
<mml:math id="m21">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, the related relative errors, &#x2026; ) during the algorithmic evaluation and comparison. From this point of view, fingerprinting properties are expected to be reported with statistical precision (<italic>e.g.</italic>, confidence limits for the abovementioned entities).</p>
<p>The application completeness mainly relates to the type of content included in the dataset, that is expected to serve and to cover the applicative scope of the developed method.</p>
<p>The consensual usage relates to the acceptance of the dataset by the research community: this item relates to the possibility of objectively comparing results reported in different studies.</p>
<p>Of course, each dataset and each application evaluated on a specific dataset reach a different trade-off among these three desiderata. <xref ref-type="table" rid="T1">Table 1</xref> provides a comparative view about some of the most often considered datasets (see <xref ref-type="sec" rid="s4">Section 4</xref>); some of these corpora are introduced here-after.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Examples of datasets processed for fingerprinting evaluation. The lower part (last 5 rows) corresponds to corpora processed by fingerprinting method exploiting NN.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Dataset</th>
<th align="left">No. of video clips</th>
<th align="left">Total duration</th>
<th align="left">Average clip duration</th>
<th align="left">Attacks</th>
<th align="left">Additional info</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="6" align="left">
<italic>Muscle-VCD</italic> <bold>
<italic>2007</italic>
</bold>
</td>
<td rowspan="6" align="left">
<bold>101</bold> (15 originals)</td>
<td rowspan="6" align="left">
<bold>80&#xa0;h</bold> (2.5&#xa0;h originals)</td>
<td rowspan="6" align="left">47&#xa0;min 30 s</td>
<td align="left">change of color/brightness</td>
<td rowspan="22" align="left">100,000 additional videos in option (to serve as background distraction)</td>
</tr>
<tr>
<td align="left">blur</td>
</tr>
<tr>
<td align="left">recording with an angle</td>
</tr>
<tr>
<td align="left">logos/subtitles insertion</td>
</tr>
<tr>
<td align="left">vertical shift</td>
</tr>
<tr>
<td align="left">flipping</td>
</tr>
<tr>
<td rowspan="5" align="left">
<italic>CC_WEB_</italic>v<italic>VIDEO</italic> <bold>
<italic>2007</italic>
</bold>
</td>
<td rowspan="5" align="left">
<bold>13,129</bold> (9,300 originals)</td>
<td rowspan="5" align="left">
<bold>551&#xa0;h</bold> (387.5&#xa0;h originals)</td>
<td rowspan="5" align="left">2&#xa0;min 30 s</td>
<td align="left">compression</td>
</tr>
<tr>
<td align="left">photometric variations</td>
</tr>
<tr>
<td align="left">postproduction</td>
</tr>
<tr>
<td align="left">content modification (frame add/remove)</td>
</tr>
<tr>
<td align="left">frame rate modification</td>
</tr>
<tr>
<td rowspan="7" align="left">
<italic>TRECVID</italic> <bold>
<italic>2011</italic>
</bold>
</td>
<td rowspan="7" align="left">
<bold>11,256</bold> (201 originals)</td>
<td rowspan="7" align="left">
<bold>400&#xa0;h</bold> (6.7&#xa0;h originals)</td>
<td rowspan="7" align="left">2&#xa0;min</td>
<td align="left">camcording</td>
</tr>
<tr>
<td align="left">picture in picture</td>
</tr>
<tr>
<td align="left">insertions of patterns</td>
</tr>
<tr>
<td align="left">compression</td>
</tr>
<tr>
<td align="left">change of gamma</td>
</tr>
<tr>
<td align="left">decrease in quality</td>
</tr>
<tr>
<td align="left">postproduction</td>
</tr>
<tr>
<td rowspan="4" align="left">
<italic>VCDB</italic> <bold>
<italic>2014</italic>
</bold>
</td>
<td rowspan="4" align="left">
<bold>9,236</bold> (528 originals)</td>
<td rowspan="4" align="left">
<bold>2,030&#xa0;h</bold> (27&#xa0;h originals)</td>
<td rowspan="4" align="left">73&#xa0;s</td>
<td align="left">insertion of patterns</td>
</tr>
<tr>
<td align="left">camcording</td>
</tr>
<tr>
<td align="left">scale changes</td>
</tr>
<tr>
<td align="left">picture in picture</td>
</tr>
<tr>
<td align="left">
<italic>CCV</italic> <bold>
<italic>2011</italic>
</bold>
</td>
<td align="left">
<bold>9,317</bold>
</td>
<td align="left">
<bold>210&#xa0;h</bold>
</td>
<td align="left">80&#xa0;s</td>
<td rowspan="6" align="left"/>
<td rowspan="2" align="left">20 semantic labels (bird, soccer, baseball, &#x2026; )</td>
</tr>
<tr>
<td rowspan="2" align="left">
<italic>UCF101</italic> <bold>
<italic>2012</italic>
</bold>
</td>
<td rowspan="2" align="left">
<bold>13,320</bold>
</td>
<td rowspan="2" align="left">
<bold>27&#xa0;h</bold>
</td>
<td rowspan="2" align="left">7&#xa0;s</td>
</tr>
<tr>
<td align="left">action recognition data set of realistic action videos</td>
</tr>
<tr>
<td align="left">
<italic>ActivityNet</italic> <bold>
<italic>2015</italic>
</bold>
</td>
<td align="left">
<bold>19,994</bold>
</td>
<td align="left">
<bold>849&#xa0;h</bold>
</td>
<td align="left">2&#xa0;min 30 s</td>
<td align="left">specialized for human activity understanding</td>
</tr>
<tr>
<td align="left">
<italic>YLI-MED</italic> <bold>
<italic>2015</italic>
</bold>
</td>
<td align="left">
<bold>50,000</bold>
</td>
<td align="left">
<bold>625&#xa0;h</bold>
</td>
<td align="left">45&#xa0;s</td>
<td align="left">specialized for research in multimedia event detection</td>
</tr>
<tr>
<td align="left">
<italic>Youtube-8M</italic>
<bold>
<italic>2018</italic>
</bold>
</td>
<td align="left">
<bold>6,100,000</bold>
</td>
<td align="left">
<bold>350,000&#xa0;h</bold>
</td>
<td align="left">3&#xa0;min 30 s</td>
<td align="left"/>
</tr>
</tbody>
</table>
</table-wrap>
<p>TRECVID (TREC Video Retrieval Evaluation) framework <xref ref-type="bibr" rid="B115">Trecvid, 2022</xref>, <xref ref-type="bibr" rid="B19">Douze et al., 2008</xref> is a key example in this respect, as it provides consequent benchmarking datasets. Sponsored by the NIST (National Institute of Standards and Technology) with additional support from other US governmental agencies, TRECVID is structured around different &#x201c;tasks&#x201d; focused on a particular aspect of the multimedia retrieval problem, as <italic>ad-hoc</italic> video search, instance search, and event detection, to mention but a few. TRECVID datasets consider video copies that are generated under video transformations, such as blurring, cropping, shifting, brightness changing, noise addition, picture-in-picture, frame removing, or text inserting <xref ref-type="bibr" rid="B119">Wang et al., 2016</xref>, <xref ref-type="bibr" rid="B81">Mansencal et al., 2018</xref>.</p>
<p>Related efforts are also carried out under different frameworks, such as Muscle-VCD (or simply Muscle) <xref ref-type="bibr" rid="B62">Law-To et al., 2007b</xref> or VCDB (Large-Scale Video Copy Detection Database) <xref ref-type="bibr" rid="B45">Jiang and Wang, 2016</xref>. Research institutes active in the field, like INRIA in France, also created INRIA Copy Days dataset <xref ref-type="bibr" rid="B42">Jegou et al., 2008</xref>. National and/or international research projects are also prone to generate datasets <xref ref-type="bibr" rid="B89">Open Video, 2022</xref>, <xref ref-type="bibr" rid="B27">Garboan and Mitrea, 2016</xref>.</p>
<p>With the advent of NN approaches, research groups affiliated to popular multimedia platforms operators organized and made available large datasets, as presented in the last 5 rows in <xref ref-type="table" rid="T1">Table 1</xref>. For instance, YouTube-8M Segments dataset <xref ref-type="bibr" rid="B1">Abu-El-Haija et al., 2016</xref> includes human-verified labels on about 237K segments and 1,000 classes, summing-up to more than 6 million video ID or more than 350,000&#xa0;h of video. The dataset is organized in about 3,800 classes with an average of 3 labels per video. Of course, several other AI datasets coexist. For instance, <xref ref-type="bibr" rid="B134">Zhixiang et al., 2018</xref> points to three of them: CCV (Columbia Consumer Video) <xref ref-type="bibr" rid="B47">Jiang et al., 2011</xref>, YLI-MED (YLI Multimedia Event Detection) <xref ref-type="bibr" rid="B8">Bend, 2015</xref>, <xref ref-type="bibr" rid="B114">Thomee, 2016</xref>, and ActivityNet <xref ref-type="bibr" rid="B36">Heilbron et al., 2015</xref>. Note that unlike the TRECVID datasets, the datasets mentioned in this paragraph are not specifically designed for fingerprinting applications but for general video tracking applications (including indexing): hence, the near duplicated content is expected to be created by the experimenter, according to the application requirements and the principles above.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Methodological frameworks</title>
<p>While today any fingerprinting state-of-the-art study cannot be either exhaustive or detailed, this section rather focusses on illustrating the main trends than on the impressive variety of studies. It is structured according to the two main steps in a generic fingerprinting computing pipeline: fingerprinting extraction (that is, spatio-temporal salient information extraction) and fingerprinting matching (that is, comparing salient information extracted from two different video sequences). These two basic steps are, in their turn, composed of several sub-steps <xref ref-type="bibr" rid="B19">Douze et al., 2008</xref>, <xref ref-type="bibr" rid="B63">Lee and Yoo, 2008</xref>, <xref ref-type="bibr" rid="B108">Su et al., 2009</xref>. On the one hand, the fingerprinting computation generally includes video pre-processing (<italic>e.g.</italic>, letterboxing removal, frame resizing, frame dropping and/or key-frame detection), local feature extraction, global feature extraction, local/global feature description, temporal information retrieval, and the means for accelerating the search in the dataset (inversed file, <italic>etc</italic>.). On the other hand, the detection procedure generally includes some time-alignment operations (time origin synchronization, jitter cancelation, &#x2026; ), followed by information matching.</p>
<p>Significant differences occur in the ways these steps are implemented. Hence, this section will be structured into two categories, further referred to as <italic>conventional</italic> (<xref ref-type="sec" rid="s4-1">Section 4.1</xref>) and <italic>NN-based fingerprinting</italic> (<xref ref-type="sec" rid="s4-2">Section 4.2</xref>) methods. The former category relates to the earliest fingerprinting methods (e.g., 2009&#x2013;2019) and stems from image processing and machine learning, being backboned by information theory concepts. The latter category is incremental with respect to the former one, as it (partially) considers concepts and tools belonging to the NN realm for achieving fingerprint extraction and matching. Of course, studies combining conventional and NN tools also exist <xref ref-type="bibr" rid="B84">Nie et al., 2015</xref>, <xref ref-type="bibr" rid="B85">Nie X. et al., 2017</xref>, <xref ref-type="bibr" rid="B21">Duan et al., 2019</xref>, <xref ref-type="bibr" rid="B136">Zhou et al., 2019</xref> that will be discussed in <xref ref-type="sec" rid="s4-2">Section 4.2</xref>.</p>
<sec id="s4-1">
<title>4.1 Conventional methods</title>
<sec id="s4-1-1">
<title>4.1.1 Main directions</title>
<p>As a common ground, these methods stem from image processing, machine learning, and information theory concepts and leverage the fingerprinting extraction on three incremental levels <xref ref-type="bibr" rid="B27">Garboan and Mitrea, 2016</xref>.</p>
<p>First, in an attempt to get to frame aspect distortion invariance, the fingerprinting is extracted from derived representations such as 2D-DWT (2D Discrete Wavelet Transform) coefficients <xref ref-type="bibr" rid="B27">Garboan and Mitrea, 2016</xref>, 3D-DCT (3D Discrete Cosine Transform) coefficients <xref ref-type="bibr" rid="B11">Coskun et al., 2006</xref>, pixel differences between consecutive frames, temporal ordinal measure of average intensity blocks in successive frames <xref ref-type="bibr" rid="B32">Hampapur and Bolle, 2001</xref>, visual attention regions <xref ref-type="bibr" rid="B108">Su et al., 2009</xref>, quantized block motion vectors, ordinal ranking of average gray level of frame blocks, quantized compact Fourier&#x2013;Mellin transform coefficients, ordinal histograms of frames <xref ref-type="bibr" rid="B54">Kim and Vasudev, 2005</xref>, <xref ref-type="bibr" rid="B96">Sarkar et al., 2008</xref>, color layout descriptor, ...</p>
<p>Secondly, frame content distortion invariance can be achieved by the complementary between global features incorporating geometric information (<italic>e.g.</italic>, centroid of gradient orientations of keyframes <xref ref-type="bibr" rid="B63">Lee and Yoo, 2008</xref> or invariant moments of frames edge representation) and local features based on interest points (corner features, Hessian-Affine, Harris points, SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features)) generally described under the BoVW (Bag of Visual Words) framework <xref ref-type="bibr" rid="B19">Douze et al., 2008</xref>, <xref ref-type="bibr" rid="B47">Jiang et al., 2011</xref>.</p>
<p>Thirdly, video format distortion invariance is generally handled by using a large variety of additional synchronization mechanisms, pair designed with the feature selection, from synchronization block, based on wavelet coefficients to K-Nearest Neighbors matching <xref ref-type="bibr" rid="B61">Law-To et al., 2007a</xref> of interest points or Viterbi-like algorithms <xref ref-type="bibr" rid="B100">Shikui et al., 2011</xref>.</p>
<p>These main directions as well as their mutual combinations will be considered in the next section as structuring elements. They will be illustrated by a selection of 15 studies, published between 2009 and 2019, that will be presented in chronological order. The functional synergies among and between these studies are synoptically presented in <xref ref-type="fig" rid="F5">Figure 5</xref> that is structured in three layers, shaped as hemicycles:<list list-type="simple">
<list-item>
<p>&#x2022; the outer blue layer relates to local feature description, exemplified through MPEG-CDVS (Compact Descriptors for Visual Search), ORB (Oriented Fast and Rotated BRIEF), SURF, Transformed domains, SIFT, CS-LBP (Center-symmetric Local Binary Patterns), and HOG (Histogram of Oriented Gradient).</p>
</list-item>
<list-item>
<p>&#x2022; the middle gray layer relates to global features, exemplified through luminance component, color histograms, and BoVW.</p>
</list-item>
<list-item>
<p>&#x2022; the inner blue layer relates to the temporal features, exemplified through luminance spectrogram, motion vectors, histogram correlation, optical flow, and TIRI (Temporal Informative Representative Image).</p>
</list-item>
</list>
</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Conventional fingerprinting method synopsis: the hemicycles (areas) related to the local, global, and temporal features are located at the outer, middle, and inner parts of the figure, respectively. Inside each hemicycle, examples of state-of-the-art solutions are presented. Conventional methods are presented in gray-shadowed rectangles while NN-based methods that also include conventional modules are represented in white rectangles.</p>
</caption>
<graphic xlink:href="frsip-02-984169-g005.tif"/>
</fig>
<p>The order of the classes in each hemicycle is chosen to allow for a better visual representation of the synergies among them. The studies represented in gray-shadowed rectangles correspond to conventional methods while the studies represented in white rectangles correspond to NN-based method that also include conventional modules.</p>
</sec>
<sec id="s4-1-2">
<title>4.1.2 Methods overview</title>
<p>The complementarity between visual similarity and temporal consistency is exploited in <xref ref-type="bibr" rid="B111">Tan et al. (2009)</xref> to achieve scalability during the detection and localization of video content replicas. The video content synchronization is modeled as a network flow problem. Specifically, the chronological matching of the frames between the two video sequences is replaced by the search for a maximal path that carries the maximum capacity in transmission network, under constraints of type <italic>must-link</italic> and <italic>cannot-link</italic>. As the theoretical solution thus obtained can feature a large complexity, the study also suggests an <italic>a posteriori</italic> simplification based 7 heuristic constraints. The study exploits the idea that the temporal alignment leverages the constraints on visual feature effectiveness and to prove this, a Hessian-Affine detector and PCA-SIFT (Principal Component Analysis) feature are considered in the experiments. On the detection side, key-point matching is considered. The experiments are structured at four levels: partial segments of full-length movies to videos crawled from YouTube, detection of near-duplicates in a dataset of more than 500&#xa0;h, near-duplicate shot detection and copy detection on TRECVID and Muscle-VCD-2007 datasets, respectively.</p>
<p>A fingerprinting method that is optimized for searching of strongly modified sequences in reduced-size video datasets is presented in <xref ref-type="bibr" rid="B20">Douze et al. (2010)</xref>. The fingerprints are computed from a subset of frames, either periodically sampled from the video sequence or chosen according to a visual content rule (<italic>key frames</italic>). The local visual information is extracted through Hessian-Affine detectors followed by SIFT and CS-LBP descriptors <xref ref-type="bibr" rid="B35">Heikkila et al., 2009</xref>. The descriptors are subsequently clustered by a bag of words approach combined to a Hamming Embedding procedure. To improve search efficiency, an inverted file structure is finally considered. For the fingerprinting retrieval, a spatio-temporal verification is performed to reduce the number of potential candidates. The experiments are carried out on the TRECVID 2008 dataset and show how the method parameters can be adjusted to reach a trade-off between accuracy and efficiency.</p>
<p>The study reported in <xref ref-type="bibr" rid="B127">Yang et al. (2012)</xref> is based on SURF points <xref ref-type="bibr" rid="B7">Bay et al., 2008</xref> that are first extracted at the frame level. After dividing the frame into 16 even square blocks, the number of SURF points in each quadrant is traversed to build a third-order Hilbert curve that will pass through each quadrant, resulting in an adjacent grid that keeps the same neighborhood as the original image. Finally, the hash bits are computed as the differences of SURF points. To match two fingerprints, the CSR (Clip Similarity Rate) or the SSR (Sequence Similarity Rate) are calculated when the query and the reference videos have the same length, or different length, respectively. The former (CSR) relates to the mean of the matching distances between the 2 hashes while the latter (SSR) represents a weighted average of matched, mismatched, and re-matched frames in query video. The experiments select 40 source videos from TRECVID 2011 framework and 60 of their replicas (logo insertion, picture in picture, video flipping, Gaussian noise). Three types of metrics are used to evaluate the method: <italic>Prec</italic>, <italic>Rec</italic> and <italic>ROC</italic>. For a preestablished <italic>Prec</italic> value (set at 0.8 in the experiments), the advanced algorithm has the best <italic>Rec</italic> value (0.92) compared to the solutions advanced in <xref ref-type="bibr" rid="B132">Zhao et al. (2008)</xref> (<italic>Rec</italic> &#x3d; 0.78) and in <xref ref-type="bibr" rid="B54">Kim and Vasudev (2005)</xref> (<italic>Rec</italic> &#x3d; 0.57).</p>
<p>Aiming at obtaining fingerprint invariance against rotations, <xref ref-type="bibr" rid="B43">Jiang et al., 2012</xref> suggests the joint use of HOG and RMI (Relative Mean Intensity) to express the visual characteristics in the frames. The fingerprinting matching is based on the Chi-square statistics. The experimental results are obtained by processing the Muscle corpus and are expressed in terms of <italic>matching quality</italic>, computed as the ratio of correct answers to total number of queries.</p>
<p>An early work presented in <xref ref-type="bibr" rid="B83">Ngo et al. (2005)</xref> considers an approach to video summarization that models the video as a temporal graph, by detecting its highlights based on analyzing motion vectors. That work is the backbone of the fingerprinting technique presented in <xref ref-type="bibr" rid="B66">Li and Vishal (2013)</xref> with a focus on the compactness of the fingerprint. The key steps of the algorithm are preprocessing and segment extraction, computing the SGM (Structural Graphical Model), graph partitioning using the graph normalized cuts method, fingerprint extraction, and fingerprint quantization by applying RAQ (Randomized Adaptive Quantizer). For the fingerprint extraction step, the authors selected the TIRI method, based on frame averaging followed by 2-D DCT, <xref ref-type="bibr" rid="B23">Esmaeili et al., 2011</xref>. The Hamming distance is used during the matching stage. To test the proposed method, 600 different videos were collected from YouTube, then copy videos were created using 8 different attacks of three major types: signal processing attacks, frame geometric attacks, and temporal attacks. The results present better accuracy particularly with restricted fingerprint length compared to TIRI <xref ref-type="bibr" rid="B23">Esmaeili et al., 2011</xref>, CGO (Centroids of Gradient Orientations) <xref ref-type="bibr" rid="B64">Lee and Yoo, 2006</xref>, and RASH (Radial hASHing) <xref ref-type="bibr" rid="B94">Roover et al., 2005</xref>.</p>
<p>The method presented in <xref ref-type="bibr" rid="B113">Thomas and Sumesh (2015)</xref> stands for a simple yet robust color-based video copy detection technique. The first step consists of summarizing the video by extracting the key frames, then generating the TIRI, thus including temporal information in the fingerprint. The second step extracts the color correlation group of each pixel of the TIRIs. The color correlation is clustered into 6 groups, by comparing the intensity of each component in the RGB color space (<italic>e.g.</italic>, group 1 corresponds to <inline-formula id="inf20">
<mml:math id="m22">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2265;</mml:mo>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2265;</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>). Finally, the histogram of the color correlation values is considered as the fingerprinting representation of the video. The matching is done by calculating the normalized distance of the histograms representing the source and the query video clips. The experiments are run on a dataset of 22 source videos and of some of their basic near-duplicated versions (letterboxing, pillarboxing, rotations, &#x2026; ). Compared to other state-of-the-art techniques like SIFT or basic color histogram, the advanced color correlation histogram system shows better performances (for the considered modifications) but remains sensitive to color changes (such as grey scale conversion or contrast changes).</p>
<p>A multifeatured video fingerprinting system, designed to jointly improve the accuracy and the robustness is advanced in <xref ref-type="bibr" rid="B39">Hou et al. (2015)</xref>. The fingerprint computation starts by extracting spatial features from the key frames that have been preprocessed (size and frame rate uniformization), then partitioned into <inline-formula id="inf21">
<mml:math id="m23">
<mml:mrow>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mi>x</mml:mi>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mi>y</mml:mi>
</mml:msub>
<mml:mo>&#xa0;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> non overlapping blocks. Both global and local features are extracted as the mix of the color histogram of all sub images and SURF points, respectively. Additionally, an optical flow feature is extracted as a temporal domain feature: it is represented as a two-dimensional vector that reflects the motion among successive frames. The fingerprinting detection is based on a multiple feature detection matching method that combines the local color histogram feature and the optical flow of SURF points. For the experiments, 30 videos from TRECVID 2010 dataset are processed. Compared to other video fingerprinting algorithms based on local descriptor, such as CGO <xref ref-type="bibr" rid="B38">Hong et al., 2010</xref> and Harris <xref ref-type="bibr" rid="B63">Lee and Yoo, 2008</xref>, the video detection <italic>Prec</italic> and <italic>Rec</italic> are slightly improved.</p>
<p>Belonging to the DWT-based fingerprinting family, the study presented in <xref ref-type="bibr" rid="B84">Nie et al. (2015)</xref> also focusses on the fingerprint dimensionality. The advanced fingerprinting scheme consists of two types of coefficients, intra-cluster and inter-cluster, thus preserving both global and local information. After normalizing the video clips (at 300 &#xd7; 240 pixels, 500 frames), the first step is to cluster the frames, according to a graph model based on the K-means algorithm, whose parameters are estimated from the relationships among frames. To select the feature that represents a frame, the fourth order Cumulant of Luminance Component is computed, thus ensuring invariance to different types of distortions (Gaussian noise addition, scaling, lossy compression, and low-pass filtering). The next step reduces the dimensionality while preserving the local and global structures thanks to an algorithm referred to as DOP (Double Optimal Projection): the dimensionality reduction is obtained by multiplying the cumulant coefficient matrix by a mapping matrix. The distance vector thus obtained results in two types of fingerprints: the statistical fingerprint, represented by the kurtosis coefficient of the distance vector and the geometrical fingerprint, represented by the binarization of the distance vector. The matching procedure is performed in two steps: first, according to the distance between the statistical fingerprints and then according to the Hamming distance between the geometric fingerprints (an empiric threshold of 0.18 is considered for the binary decision). The experiments are performed on a dataset of 300 original video clips and of some of their replicas (MPEG compression, letterboxing, frame change, blur, shifting, rotations) and result in both <italic>Prec</italic> and <italic>Rec</italic> larger than 0.95.</p>
<p>The technique presented in <xref ref-type="bibr" rid="B82">Mao et al. (2016)</xref> assumes that the probability that five identical successive scene frames occur in two different videos is very low. The fingerprint computation starts by frame resizing (down to <italic>108 &#xd7; 132</italic>) and division (into <italic>9 &#xd7; 11</italic> sub-regions). For each sub-region, two types of information are extracted from the luminance component: the mean value of the sub-region and 4 differential elements of the sub-region sub-blocks. This process generates 720 elements in total counting 144 mean values and 576 differential values. The fingerprint is subsequently quantized and clustered. A matching technique based on binary search of inverted file is implemented. A test dataset was created by collecting 510 Hollywood film clips and 756 of their replicas (re-encoding, logo addition, noise addition, picture in picture, &#x2026; ). An average detection rate of 0.98 is obtained.</p>
<p>The Shearlet transform is a multi-scale and multi-dimensional transform that is specifically designed to address anisotropic and directional information at different scales. This property can by be exploited in fingerprinting applications, as demonstrated in <xref ref-type="bibr" rid="B129">Yuan et al. (2016)</xref>, where a 4-scale Shearlet transform with 6 directions is considered. The fingerprinting definition considers both low and high frequency coefficients and is defined under the form of the normalized sum of SSCA (sub-band coefficient amplitudes). Low frequency coefficients are supposed to feature invariance with respect to common distortions, hence, to ensure the fingerprinting robustness. The high frequency coefficients, on their side, are supposed to keep visual content inner information, hence, to contribute to the method uniqueness. Such frame-level fingerprint is coupled to a TIRI of the video. The search efficiency is based on the use of IIF (Invert Index File) mechanism. The experimental results are carried out on visual content sampled from TRECVID 2010 and form INRIA Copy Day dataset <xref ref-type="bibr" rid="B42">Jegou et al., 2008</xref>. The replicas are obtained through geometrical distortions (letterboxing, rotation), luminance distortions, noise addition (salt and pepper, Gaussian), text insertion, and JPEG compression. The quantitative results are expressed in terms of <italic>TPR</italic>, <italic>FPR</italic>, and <italic>F1</italic> score and consider as ground two state-of-the-art methods based on the DCT on Ordinal Intensity Signature (OIS). The method main advantage is given by its resilience to geometric transformations (gains of about 0.3 in <italic>F1</italic> score).</p>
<p>The study <xref ref-type="bibr" rid="B119">Wang et al., 2016</xref> is centered around the usage of the temporal dimension expressed as the temporal correlation among successive frames in a video sequence. To this end, the video sequence is structured into groups of frames centered on some key frames (that is, the temporal context for a key frame is computed based on both preceding and succeeding frames). A fingerprint is subsequently extracted from each group of frames. From a conceptual standpoint, the fingerprint is based on the color correlation histogram computed on the frame sequence. Yet, to enhance the overall method speed, this visual information is processed through several types of operations. First, the dimensionality is reduced by projection on a random, bipolar (&#x2b;1/&#x2212;1) matrix. Secondly, a binary code is defined based on a weighted addition of the color correlation histogram elements. Finally, the search speed is accelerated by an LSH (Locality Sensitive Hashing) algorithm <xref ref-type="bibr" rid="B14">Datar et al., 2004</xref>. The matching algorithm is based on LCS (Longest Common Subsequence) algorithm. The experiments consider 8 transformations included in the TRECVID 2009 dataset and report results (expressed in terms of <italic>Prec</italic> and <italic>Rec</italic>) that are compared against a solution relaying on BoVW and SIFT <xref ref-type="bibr" rid="B131">Zhao et al., 2010</xref>: according to the type of attacks, absolute gains between 5 and 14% in <italic>Prec</italic> and between 6 and 12% in <italic>Rec</italic> are shown. Although the method was optimized for reducing the search time, no experimental result is reported in this respect.</p>
<p>A fingerprinting system based on contourlet HMT (Hidden Markov Tree) model is designed in <xref ref-type="bibr" rid="B109">Sun et al. (2017)</xref>. The contourlet is a multidirectional and multiscale transform that is expected to handle the directional plane information <xref ref-type="bibr" rid="B18">Do and Vetterli, 2004</xref> better than the well-known wavelets transform. HMT generates links between the hidden state of the coefficients and their respective children. Before the extraction of the fingerprint, a normalization phase takes place. It unifies the frame rate, the width, and the height, and converts the frames to grayscale. Once normalized, each frame is partitioned into equal blocks, thus preserving the local features. The contourlet transform is then applied to each block to obtain the contourlet coefficients which are fed to the HMT model to generate the standard deviation matrices. Finally, the SVD (Singular Value Decomposition) is used to reduce the dimension of the resultant standard deviation matrices. The video fingerprint is created by concatenating the fingerprints extracted from all the frames. This study adopts a 2-step matching algorithm. In the first step, the fingerprint of a random frame is used to compute its distance to all the fingerprints present in the dataset. The <italic>N</italic> best matches are further investigated in the second step where the squared Euclidean distance between all the frames presenting the query clip and a referenced clip is calculated. The reference video with the minimum distance is identified as the matching result. Compared to the CGO based method <xref ref-type="bibr" rid="B63">Lee and Yoo, 2008</xref>, the <xref ref-type="bibr" rid="B109">Sun et al., 2017</xref> method achieves better performances in terms of the probability of false alarm and the probability of true detection.</p>
<p>
<xref ref-type="bibr" rid="B90">Ouali et al., 2017</xref> extends some basic concepts from audio to video fingerprinting. To this end, the video sequence is considered as a sequence of frames that are first resized. The fingerprint encodes the positions of several <italic>salient</italic> regions in some binary images generated from the luminance spectrogram; in this study, the term <italic>salient</italic> designates the regions featuring the highest spectral values. The selection of the salient areas can be done at the level of the frame or at the level of successive frames. The former considers a window of spectrogram coefficients centered on the related median while the latter considers the regions that have the highest variations compared to the same regions in the previous frame. The experimental results are carried out on the TRECVID 2009 and 2010 datasets and show that the fingerprint extracted on sequences of frames outperforms the fingerprint extracted at the level of frames.</p>
<p>The study presented in <xref ref-type="bibr" rid="B76">Liu (2019)</xref> addresses the issue of reducing the complexity and the execution time of the fingerprint matching in large datasets. The method to extract the fingerprint is referred to as <italic>rHash</italic> and it is derived from the <italic>aHash</italic> method <xref ref-type="bibr" rid="B126">Yang et al., 2006</xref>. First, a pre-processing step reduces the frame rate to 10, uniformizes the resolution to <italic>144x176</italic>, and generates the TIRIs <xref ref-type="bibr" rid="B24">Esmaeili and Ward, 2010</xref>. Secondly, the <italic>rHash</italic> involves 4 steps: image resizing, division into blocks, block-wise local mean computation, and the binarization of each pixel based on the correspondent block mean value. The <italic>rHash</italic> outputs a fingerprint composed of 12 words of 9 bits each. For the matching process, an algorithm based on a look-up table, word counting, and ordering operations is advanced. The TRECVID 2011 and the VCDB <xref ref-type="bibr" rid="B45">Jiang and Wang, 2016</xref> datasets are processed when benchmarking the advanced method against <italic>aHash</italic> and <italic>DCT-2ac hash</italic> <xref ref-type="bibr" rid="B23">Esmaeili et al., 2011</xref> methods: higher accuracy as well as increased searching speed are thus brought to light.</p>
<p>As video content is preponderantly recorded, stored, and transmitted in compressed formats, fingerprints extracted directly from the compressed stream will beneficially eliminate the need for decoding operations. While early studies <xref ref-type="bibr" rid="B83">Ngo et al., 2005</xref>, <xref ref-type="bibr" rid="B66">Li and Vishal, 2013</xref> already considered MPEG motion vectors as a partial information in fingerprinting applications, <xref ref-type="bibr" rid="B93">Ren et al., 2016</xref> can be considered as an incremental step: the fingerprinting computation combines information extracted from the decompressed (pixel) domain to information extracted at the MPEG-2 stream level. First, from the decompressed <italic>I</italic> frames, key frames are selected according to their visual saliency. To this end, histogram-based contrast is computed for each <italic>I</italic> frames alongside with the underlying image entropy. Then, key frames are selected according to the Person&#x2019;s coefficient. For any selected key frame, both global and local features are extracted as the color histograms and ORB descriptors, respectively. Finally, motion vectors directly extracted from the MPEG-2 stream serve local temporal information: specifically, motion vectors angle histograms are computed. Hence, the key frame fingerprint is a combination of the color histograms, ORB descriptors and motion vector normalized histogram. The video fingerprint is computed as the set of key frame fingerprints. The matching procedure is individually performed at the level of the three components (<italic>i.e.</italic>, based on their individual appropriate matching criteria) and the overall decision is achieved through fusing decisions made on multiple features by a weighted additive voting model. In experiments, the color histogram, ORB descriptors and motion vector histograms weights are set to 0.2, 0.4, and 0.4, respectively. The experimental results are obtained by processing the TRECVID 2009 dataset and consider one state of the art measure based on SIFT. The gains of the advanced algorithm have been evaluated in terms of NDCR (Normalized Detection Cost Rate), <italic>F1</italic> score, and copy detection processing time.</p>
</sec>
<sec id="s4-1-3">
<title>4.1.3 Discussion</title>
<p>The previous section brings to light that the fingerprinting conventional methods form a fragmented landscape. While the general methodological framework is unitary (<italic>cf</italic>. <xref ref-type="sec" rid="s3">Section 3</xref>), each study ambitions to take a different applicative challenge, from searching of strongly modified sequences in reduced-size video datasets to reducing the complexity and the execution time of the fingerprint matching. The evaluation criteria are different, with a preponderancy of <italic>Prec</italic>, <italic>Rec</italic> and <italic>F1</italic> that are generally computed on datasets sampled from the corpora presented in <xref ref-type="table" rid="T1">Table 1</xref>; yet, the criteria of sampling the reference datasets are not always precised. In this context, no general and/or precise conclusion about the pros and the cons of the state-of-the-art methods can be drawn.</p>
<p>However, the value of these research efforts can be collectively judged by analyzing their steadily evolution, as illustrated in <xref ref-type="fig" rid="F6">Figure 6</xref>. This figure covers the 2009&#x2014;2019 time span and presents, for each analyzed year, the key conceptual ideas (the dark-blue, left block) as well as the methodological enablers in fingerprinting extraction (the blue, right-upper block) and matching (the light-blue, right-lower block)<xref ref-type="fn" rid="fn2">
<sup>2</sup>
</xref>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Incremental evolution of the conventional methods.</p>
</caption>
<graphic xlink:href="frsip-02-984169-g006.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F6">Figure 6</xref> and <xref ref-type="sec" rid="s4-1-2">Section 4.1.2</xref> show that the state-of-the-art is versatile enough to pragmatically offer solutions to specific applicative fields, without being able to provide the ultimate fingerprinting method. As an attempt in reaching such a solution, NN&#x2014;based solutions are considered for some or all of the blocks in the fingerprinting scheme, as explain ion <xref ref-type="sec" rid="s4-2">Section 4.2</xref>.</p>
</sec>
</sec>
<sec id="s4-2">
<title>4.2 NN-based methods</title>
<sec id="s4-2-1">
<title>4.2.1 Main directions</title>
<p>The class of NN-based video fingerprinting methods can be considered as an additional direction with respect to the conventional fingerprinting methods presented in <xref ref-type="sec" rid="s4-1">Section 4.1</xref>. They inherit its basic conceptual workflow: pre-processing video sequence, extracting spatial and temporal information, eventually aggregating them into various derived representations (be them binary or not), matching.</p>
<p>However, NN-based video fingerprinting methods rely (at least partially) on various types of NN, from AlexNet <xref ref-type="bibr" rid="B58">Krizhevsky et al., 2012</xref> and ResNet (Residual neural network) (<xref ref-type="bibr" rid="B34">He et al., 2016</xref>) to CapsNet (Capsule Neural Network) <xref ref-type="bibr" rid="B95">Sabour et al., 2017</xref> and LSTM <xref ref-type="bibr" rid="B37">Hochreiter and Schmidhuber, 1997</xref>, sometimes requiring specifically designed architectures <xref ref-type="bibr" rid="B134">Zhixiang et al., 2018</xref>. Yet, such an approach does not exclude the usage of partial conventional solutions in conjunction with NN, <italic>e.g.</italic>, BoVW can be considered as an aggregation tool of visual features extracted by CNN (Convolutional Neural Network) <xref ref-type="bibr" rid="B130">Zhang et al., 2019</xref>. Moreover, the matching algorithm generally comes across with the NN considered in the extraction phase.</p>
<p>These main directions will be illustrated by a selection of 20 studies, published since 2016, that will be presented in chronological order. The relationship among and between them is depicted in <xref ref-type="fig" rid="F7">Figure 7</xref>, that is also structured in three hemicycles (as <xref ref-type="fig" rid="F5">Figure 5</xref>), yet their meanings are slightly different:<list list-type="simple">
<list-item>
<p>&#x2022; the outer blue layer corresponds to the spatial features, exemplified through: CRBM(Conditional Restricted Boltzmann Machine), ResNet, NIP (Nested Invariance Pooling), VGGNet, AlexNet, GoogleNet, new structures designed to the fingerprinting purpose, RetinaNet, and Tracked HetConv-MK (heterogeneous convolutional multi-kernel).</p>
</list-item>
<list-item>
<p>&#x2022; the middle gray layer corresponds to temporal features, exemplified through: weight correlation, LSTM (Long-Short Term Memory), SiameseLSTM (Siamese LSTM), Deep Metric Learning, and BiLSTM (bidirectional LSTM).</p>
</list-item>
<list-item>
<p>&#x2022; the inner blue layer corresponds to spatial-temporal features, exemplified through: 3D-ResNet50, and CapsNets structures.</p>
</list-item>
</list>
</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>NN-based fingerprinting method synopsis: the hemicycles (areas) related to the spatial, temporal, and spatial-temporal features are located at the outer, middle, and inner parts of the figure, respectively. Inside each hemicycle, examples of state-of-the-art solutions are presented.</p>
</caption>
<graphic xlink:href="frsip-02-984169-g007.tif"/>
</fig>
<p>The order of the classes in each hemicycle is again chosen to allow for a better visual representation of the synergies among them.</p>
</sec>
<sec id="s4-2-2">
<title>4.2.2 Methods overview</title>
<p>The work presented in <xref ref-type="bibr" rid="B45">Jiang and Wang (2016)</xref> is twofold. First, the VCDB is organized and presented by a comparison to other existing datasets (<italic>e.g.</italic>, Muscle-VCD) used to evaluate video copy detection algorithms. In parallel, a fingerprinting method referred to as SCNN (Siamese Convolutional Neural Network) is advanced. SCNN is composed of two identical AlexNet <xref ref-type="bibr" rid="B58">Krizhevsky et al., 2012</xref> followed by a connection function layer that computes the Euclidean distance between the two AlexNet outputs, and finally a contrastive loss layer <xref ref-type="bibr" rid="B31">Hadsell et al., 2006</xref>. The information thus obtained is structured by BoVW. The experiments focus on the relationship between the dataset and the efficiency of the system. The rule of thumb that is thus stated is &#x201c;<italic>the bigger and the more heterogeneous the dataset, the harder for the systems to accurately detect copy videos</italic>&#x201d;. Specifically, the SCNN achieves <italic>F1</italic> &#x3d; 0.69 on VCDB.</p>
<p>A two-level fingerprint approach is presented in <xref ref-type="bibr" rid="B85">Nie X. et al. (2017)</xref>. First, LRF (Low-level Representation Fingerprint) is computed as a tensor-based model that fuses different visual features such as SURF and color histograms. Then, the DRF (Deep Representation Fingerprint) extracts the deep semantic features by using a pretrained VGGNet <xref ref-type="bibr" rid="B50">Karen and Andrew, 2014</xref> containing five convolutional layers and 3 fully connected layers. The DRF takes 224 &#xd7; 224 RGB images as input and outputs a 4096-dimension vector. The matching solution is also structured at two levels: the LRF component identifies a candidate set while the DRF further identifies the source video from the candidate set. The experiments consider both CC_WEB_VIDEO <xref ref-type="bibr" rid="B123">Wu et al., 2009</xref> and Open Video <xref ref-type="bibr" rid="B89">Open Video, 2022</xref> datasets, thus processing about 20,000 source clips. The method is benchmarked against four methods LRTA <xref ref-type="bibr" rid="B67">Li and Vishal, 2012</xref>, 3D DCT <xref ref-type="bibr" rid="B5">Baris et al., 2006</xref>, CGO <xref ref-type="bibr" rid="B63">Lee and Yoo, 2008</xref>, and CMF <xref ref-type="bibr" rid="B86">Nie et al., 2017b</xref> that it outperforms in terms of ROC curve.</p>
<p>The study in <xref ref-type="bibr" rid="B97">Schuster et al. (2017)</xref> discloses a video stream fingerprinting. The method takes advantage of the loophole in the MPEG-DASH standard <xref ref-type="bibr" rid="B104">Sodagar, 2011</xref> that induces an outburst of content dependent packet bursts, despite the stream encryption. The video is represented as information bursts that are sent to the end user from the streaming services. The data traffic features are captured via a script on the client device or intruding detectors in the network. To this end, a CNN composed of 3 convolution layers, max pooling, and 2 dense layers is designed. To train the model, an Adam optimizer <xref ref-type="bibr" rid="B55">Kingma and Jimmy, 2014</xref> was used as well as a categorical cross-entropy error function. The dataset is extracted from 100 Netflix titles, 3,558 YouTube videos, 10 Vimeo and 10 Amazon titles. A different model is trained for each streaming platform. The classifier achieved 92% accuracy. Inspired by these results, the study in <xref ref-type="bibr" rid="B69">Li (2018)</xref> further investigates the aspects specifically related to network, by extracting the information from the Wi-Fi traffic, where both transport and MAC (Media Access Control) layers are encrypted via TLS (Transport Layer Security), and WPA-2 (Wi-Fi Protected Access 2), respectively. The Multi-Layer Perceptron (MLP) model achieves 97% accuracy to identify videos from a small 10-video dataset.</p>
<p>Instead of using the output of the CNN as visual features, the study in <xref ref-type="bibr" rid="B56">Kordopatis-Zilos et al. (2017a)</xref> advances a method that extracts the image features starting from the activation values in the convolutional layers. The extracted information forms a frame-level histogram. A video-level histogram is then generated by summing all the frame-level histograms. For fast video retrieval, TF-IDF weighing is coupled to an inverted file indexing structure <xref ref-type="bibr" rid="B103">Sivic and Zisserman, 2003</xref>. To evaluate the proposed method, the CC_WEB_VIDEO <xref ref-type="bibr" rid="B123">Wu et al., 2009</xref> dataset is used as well as 3 pre-trained CNNs, namely AlexNet <xref ref-type="bibr" rid="B58">Krizhevsky et al., 2012</xref>, GoogleNet <xref ref-type="bibr" rid="B110">Szegedy, 2015</xref>, and VGGNet <xref ref-type="bibr" rid="B101">Simonyan and Zisserman, 2014</xref>. GoogleNet performed the best (<italic>mAP</italic> &#x3d; 0.958), followed by AlexNet (<italic>mAP</italic> &#x3d; 0.951) than VGGNet (<italic>mAP</italic> &#x3d; 0.937).</p>
<p>A deep learning architecture with a focus on DML (Deep Metric Learning) is presented in <xref ref-type="bibr" rid="B57">Kordopatis-Zilos et al. (2017b)</xref>. For feature extraction, the video is sampled to 1 frame per second then fed to a pre-trained CNN model (AlexNet <xref ref-type="bibr" rid="B58">Krizhevsky et al., 2012</xref> and GoogleNet <xref ref-type="bibr" rid="B110">Szegedy, 2015</xref> are considered). For the DML architecture, a triplet-based network is proposed where an anchor, a positive and a negative video are used to optimize the loss function. The first layer of the DML is composed of 3 parallel Siamese DNN (Deep Neural Network). In their turn, the Siamese DNN are composed of 3 dense fully connected layers followed by a normalization layer where the sizes of the layers and their outputs depend on the input size. The VCDB dataset <xref ref-type="bibr" rid="B45">Jiang and Wang, 2016</xref> is used to train the DML. To evaluate the proposed system, the CC_WEB_VIDEO <xref ref-type="bibr" rid="B123">Wu et al., 2009</xref> dataset is used. The system scores <italic>mAP</italic> &#x3d; 0.969 using the GoogleNet and <italic>mAP</italic> &#x3d; 0.964 when using AlexNet, thus increasing the performances presented in <xref ref-type="bibr" rid="B56">Kordopatis-Zilos et al. (2017a)</xref>.</p>
<p>The study presented in <xref ref-type="bibr" rid="B70">Li and Chen (2017)</xref> develops a deep learning model capable of extracting spatio-temporal correlations among video frames based on a CRBM <xref ref-type="bibr" rid="B112">Taylor et al., 2007</xref> that can simultaneously model the spatial and temporal correlations of a video. The spatial correlations are modeled by the connections between the visible and hidden layers at a given moment. The temporal correlations are modeled by the connections among the layers at different timestamps. The CRBM is paired with a denoising auto-encoder <xref ref-type="bibr" rid="B116">Vincent et al., 2010</xref> module that reduces the dimension of the CRBM output by reducing the redundancies and discovering the invariants to distortions. This process can be applied recursively. A so-called post-processing module takes as input 2 video fingerprints and decides whether they are similar or not. The TRECVID 2011 dataset is used for benchmarking. The advanced method reaches <italic>F1</italic> &#x3d; 0.98, thus outperforming four state-of-the-art techniques: SGM <xref ref-type="bibr" rid="B66">Li and Vishal, 2013</xref> <italic>F1</italic> &#x3d; 0.91), 3D-DCT <xref ref-type="bibr" rid="B23">Esmaeili et al., 2011</xref> <italic>F1</italic> &#x3d; 0.89, <xref ref-type="bibr" rid="B63">Lee and Yoo, 2008</xref> <italic>F1</italic> &#x3d; 0.78, and RASH <xref ref-type="bibr" rid="B94">Roover et al., 2005</xref> <italic>F1</italic> &#x3d; 0.79.</p>
<p>The study in <xref ref-type="bibr" rid="B118">Wang et al. (2017)</xref> investigates the influence of the frame sampling that is usually applied at the beginning of fingerprint extraction and sets its goal on computing a compact fingerprint without decreasing the frame rate (that is, without frame dropping). Three main steps are designed: frame feature extraction, video feature encoding, and video segment matching. The frame feature extraction is realized by means of a VGGNet-16 <xref ref-type="bibr" rid="B101">Simonyan and Zisserman, 2014</xref> composed of 13 convolutional layers, 3 fully connected layers, and 5 max-pooling layers inserted after the convolutional layers. This step follows by PCA whitening on the CNN output to reduce its dimensionality. The feature compression and aggregation are realized via the sparse coding technique and timeline aligning by pooling the frame features into 1sec. interval (max-pooling is chosen). The matching features fast retrieval is ensured by using a KD-tree to store the fingerprints and temporal alignment implemented according to the temporal network described in <xref ref-type="bibr" rid="B44">Jiang et al. (2014)</xref>. To run the tests, the VCDB dataset <xref ref-type="bibr" rid="B45">Jiang and Wang, 2016</xref> is used. The advanced method performs better than two baseline fingerprinting methods: CNN with AlexNet <xref ref-type="bibr" rid="B45">Jiang and Wang, 2016</xref>, and Fusion with SCNN <xref ref-type="bibr" rid="B45">Jiang and Wang 2016</xref>. The experiments also studied the impact of frame sampling: <italic>F1</italic> &#x3d; 0.7 when processing all the frames and it drops to <italic>F1</italic> &#x3d; 0.66 when processing 1 frame per second.</p>
<p>The study in <xref ref-type="bibr" rid="B40">Hu and Lu (2018)</xref> combines CNN and RNN (Recursive Neural Network) architectures for video copy detection purposes. The method is divided into 2 main steps. First, a CNN architecture extracts content features from each frame: by a ResNet model <xref ref-type="bibr" rid="B34">He et al., 2016</xref>, each frame is represented by a 2048-component vector. Secondly, spatio-temporal representations are generated on top of frame-level vectors. Thus, a Long-Short Term Memory unit based Siamese Recurrent Neural Networks (SiameseLSTM) is trained. The training is achieved by selecting clips with the same length (20 frames) from CC_WEB_VIDEO <xref ref-type="bibr" rid="B123">Wu et al., 2009</xref>. For video searching/matching purposes, the video is cut into 20 frame clips, before their respective spatial-temporal representations are generated. To identify the copied segments a graph based temporal network algorithm is used <xref ref-type="bibr" rid="B111">Tan et al., 2009</xref>. This algorithm is tested using the VCDB dataset <xref ref-type="bibr" rid="B45">Jiang and Wang, 2016</xref> and yields <italic>Prec</italic> &#x3d; 0.9, <italic>Rec</italic> &#x3d; 0.58, and <italic>F1</italic> &#x3d; 0.7233.</p>
<p>
<xref ref-type="bibr" rid="B75">Liu, 2018</xref> represents an example of spatial fingerprinting relying on CNN. The principle is to represent the video sequence as a collection of conceptual objects (in the computer vision sense) that are subsequently binarized. To compute the fingerprint, the video sequence is first space/time down sampled. For each down sampled frame, visual objects are computed using the RetinaNet structure <xref ref-type="bibr" rid="B71">Lin et al., 2017</xref>. The binarization of the detected objects is recursively block-wise achieved: each object is divided into a group of non-overlapping blocks and each block in several non-overlapping subblocks. The fingerprinting bits are assigned according to a thresholding operation: the subblock pixel value average is compared to the average of all the pixels in the corresponding block. The matching technique considers an IIF structure and a weighted Hamming distance. The experimental results concern the values of <italic>Prec</italic>, <italic>Rec</italic> and <italic>F1</italic>, computed on VCDB dataset and show that a 10% higher recall rate can be achieved with a decrease of 1% prediction rate [the comparison is made against an ML method based on SIFT descriptors as well as against the CNN method presented in <xref ref-type="bibr" rid="B118">Wang et al. (2017)</xref>].</p>
<p>The method presented by <xref ref-type="bibr" rid="B134">Zhixiang et al. (2018)</xref> proposes a nonlinear structural video hashing approach to retrieve videos in large datasets thanks to binary representations. To this purpose, a multi-layer neural network is designed to generate a compact <italic>L</italic>-bit binary representation for each frame of the video. To optimize the matching process, a subspace grouping method is applied to each video, thus decomposing the nonlinear representation to a set of linear subspaces. To compute the distance between 2 video clips, the distances between the underlying subspaces are integrated, where the Hamming distance is used to compute the distance between a pair of subspaces. CCV <xref ref-type="bibr" rid="B47">Jiang et al., 2011</xref>, YLI-MED <xref ref-type="bibr" rid="B8">Bend, 2015</xref> and ActivityNet <xref ref-type="bibr" rid="B36">Heilbron et al., 2015</xref> datasets are selected to test the performance of the algorithm that is benchmarked against DeepH (Deep Hashing) <xref ref-type="bibr" rid="B73">Liong et al., 2015</xref>, SDH (Supervised Discrete Hashing) <xref ref-type="bibr" rid="B99">Shen et al., 2015</xref>, and KSH (Kernel-Based Supervised Hashing) <xref ref-type="bibr" rid="B77">Liu et al., 2012</xref>. The experimental results show that the advanced method outperforms state-of-the-art solutions with the increase of code length.</p>
<p>An unsupervised learning video hashing technique is advanced in <xref ref-type="bibr" rid="B79">Ma et al. (2018)</xref>. The first step is to extract the spatial feature of the video frames using AlexNet <xref ref-type="bibr" rid="B58">Krizhevsky et al., 2012</xref>. The output of the CNN is fed to a single-layer LSTM network. Next, a time series pooling is applied. This step combines all the frame level features to form a single video level feature. Finally, an unsupervised hashing network extracts a compact binary representation of the video. To test its effectiveness, UCF-101 <xref ref-type="bibr" rid="B106">Soomro et al., 2012</xref> dataset and 100&#xa0;h worth of videos are downloaded from YouTube and used as dataset. Few unsupervised hashing networks were evaluated and the ITQ-ST (Iterative Quantizing&#x2014;Spatio-Temporal) and BA-ST (Binary Autoencoder&#x2014;Spatio-Temporal) <xref ref-type="bibr" rid="B10">Carreira-Perpin&#xe1;n and Raziperchikolaei, 2015</xref> methods worked the best to represent the videos, resulting into <inline-formula id="inf22">
<mml:math id="m24">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2265;</mml:mo>
<mml:mn>0.65</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>The joint use of CNN (ResNet <xref ref-type="bibr" rid="B34">He et al., 2016</xref>) and RNN (SiameseLSTM) is studied in <xref ref-type="bibr" rid="B128">Yaocong and Xiaobo (2018)</xref>. The selected CNN is ResNet50 that takes 224 &#xd7; 224 RGB frames as input and outputs a 2048-dimension vector per frame. The RNN achieves the spatio-temporal fusion and sequence matching. To further optimize spatio-temporal feature extraction, positive pairs (similar video content) and negative pairs (dissimilar video content) are fed to the SiameseLSTM. The resulting feature vectors are considered as the video fingerprint. For the matching process, a graph based temporal network <xref ref-type="bibr" rid="B111">Tan et al., 2009</xref> is used. For training, the CC_WEB_VIDEO <xref ref-type="bibr" rid="B123">Wu et al., 2009</xref> dataset is used, and the video clips are normalized to 20 frames. For evaluation, the VCDB dataset <xref ref-type="bibr" rid="B45">Jiang and Wang, 2016</xref> is used. The method yields <italic>Prec</italic> &#x3d; 90% and <italic>Rec</italic> &#x3d; 58%, which is slightly better than the solution advanced in <xref ref-type="bibr" rid="B118">Wang et al. (2017)</xref> and <xref ref-type="bibr" rid="B45">Jiang and Wang (2016)</xref>.</p>
<p>The challenge of retrieving the top-k video clips from a single frame is taken in <xref ref-type="bibr" rid="B130">Zhang et al. (2019)</xref>, where visual features are extracted by utilizing CNN and BoVW. The first step is to extract representative frames at fixed-time intervals and to resize them to 256 &#xd7; 256 pixels. The second step is to feed all those images to a CNN feature extractor, implemented by the AlexNet architecture <xref ref-type="bibr" rid="B58">Krizhevsky et al., 2012</xref>. For each frame, a 4096-dimension feature vector is generated. This vector is the input for the BoVW module which aims to create a visual dictionary for the reference video dataset via a feature matrix. The extraction of visual words from visual features is done via the K-means clustering method. To optimize the retrieval time, frame pre-clustering is done, also based on K-means. A VWII (Visual Word Inverted Index) is deployed to improve search efficiency. The performance of the algorithm is benchmarked against SIFT <xref ref-type="bibr" rid="B131">Zhao et al., 2010</xref>, and BF-PI <xref ref-type="bibr" rid="B15">de Ara&#xfa;jo and Girod, 2018</xref> methods, on two datasets, namely Youtube-8M <xref ref-type="bibr" rid="B1">Abu-El-Haija et al., 2016</xref> and Sports-1M <xref ref-type="bibr" rid="B51">Karpathy et al., 2014</xref>. The experimental results consider 4 criteria, namely precision evaluation on the size of dataset, precision evaluation on the number of visual words, efficiency evaluation (execution time) on the number of k results, and the efficiency evaluation (execution time) on the size of dataset.</p>
<p>
<xref ref-type="bibr" rid="B21">Duan et al., 2019</xref> presents an overview of the CDVA (Compact Descriptors for Video Analysis standard) promoted by ISO/IEC JTC 1 SC 29, a. k.a MPEG. The DVA framework is specified incrementally with respect to MPEG-CDVS. To extract video features, the key frames and the inter feature prediction are determined before being fed to a deep learning model based on CNN. The proposed CNN model is derived from NIP feature descriptors which adds robustness to the system. To make the system light weight and multiplatform, the NN was compressed by using the Lloyd-Max algorithm. To reduce the time of video retrieval time, the output of the CNN is binarized via a one-bit scalar quantizer. A Hamming distance is used for the fingerprint matching. For testing, a dataset with source and attacked clips is created gathering 4,693 matching pairs as well as 46,930 non-matching pairs. By coupling the deep learning extracted features with the handcrafted features proposed in CDVS, the system gained further precision.</p>
<p>The study in <xref ref-type="bibr" rid="B136">Zhou et al. (2019)</xref> presents a video copy detection method establishing synergies among CNN and conventional computer vision tools. The first step consists in dividing the video into equal-length video sequences, from which frames are sampled with a fixed period, thus allowing the computation of the TIRI for each sequence. The second step consists in extracting the spatial features using a pre-trained AlexNet <xref ref-type="bibr" rid="B58">Krizhevsky et al., 2012</xref> model, followed by a sum-pooling layer to reduce the matrix dimension. The model takes as input the TIRI and outputs a 256-dimension vector. The third step extracts the temporal features. In this respect, it starts by feeding all video sequence frames to the AlexNet and follows by averaging all frame matrices and by computing their centroids. Two matrices representing the distance in cylindrical coordinates (distance and angle) between the centroids are subsequently computed. The fourth step first creates a BoVW by clustering the extracted spatial features through a K-means algorithm and then structures the BOVW in an inverted index file. During the copy detection step, for each query-reference pair, three individual distances are computed: between spatial representations, between temporal distance representations and between temporal angle representations. These three distances are fused to compute a decision score that is compared to a pre-defined threshold, thus ascertaining whether the query is a copy version or not. Evaluated under the TRECVID 2008 framework, the method achieves <italic>mAP</italic> &#x3d; 0.65.</p>
<p>A supervised stacked HetConv-MK <xref ref-type="bibr" rid="B102">Singh et al., 2019</xref> and BiLSTM hashing model is designed in <xref ref-type="bibr" rid="B4">Anuranji and Srimathi (2020)</xref>. The model integrates two main blocks devoted to spatial and temporal feature extraction, respectively. First, the convolutional block computes the spatial features via passing the frames through a stacked convolutional filter and a max-pooling layer. Secondly, the BiLSTM model computes the stream forward and backward. Finally, a fully connected layer generates a binary fingerprint that integrates the output of the previous units. The experimental results are obtained out by processing 3 datasets: CCV <xref ref-type="bibr" rid="B47">Jiang et al., 2011</xref>, ActivityNet <xref ref-type="bibr" rid="B36">Heilbron et al., 2015</xref>, and HMDB (Human Metabolome Database) <xref ref-type="bibr" rid="B59">Kuehne et al., 2011</xref>, with a total of almost 30,000 clips. To determine the effectiveness of the algorithm, Hamming ranking, and Hamming lookup are used in conjunction with <italic>mAP</italic> and <italic>Prec</italic>. The advanced method is compared to existing methods such as SDH (Supervised Discrete Hashing) <xref ref-type="bibr" rid="B99">Shen et al., 2015</xref>, supervised deep learning <xref ref-type="bibr" rid="B73">Liong et al., 2015</xref>, Deep Hashing <xref ref-type="bibr" rid="B73">Liong et al., 2015</xref>, and ITQ (Iterative Quantization) <xref ref-type="bibr" rid="B29">Gong et al., 2013</xref>. The results show an improvement in accuracy introduced with large scale dataset.</p>
<p>A video hashing framework, referred to as CEDH (Classification-Enhancement Deep Hashing) is conceived in <xref ref-type="bibr" rid="B87">Nie et al. (2021)</xref>. CEDH is a deep learning model that is composed of 3 main layers. First, a VGGNet-19 <xref ref-type="bibr" rid="B101">Simonyan and Zisserman, 2014</xref> layer to extract frame-level features. Then, a LSTM <xref ref-type="bibr" rid="B37">Hochreiter and Schmidhuber, 1997</xref> network is adopted to capture temporal features. Finally, a classification module is implemented to enhance the label information. To train the model, the loss term is matched to the peculiarities of the layer: triplet loss, classification loss, and code constraint terms, respectively. To evaluate its performance, 3 video datasets are processed: the FCVID (Fudan-Columbia VIDeo) dataset <xref ref-type="bibr" rid="B46">Jiang et al., 2018</xref>, HMDB <xref ref-type="bibr" rid="B59">Kuehne et al., 2011</xref>, and UCF-101 <xref ref-type="bibr" rid="B106">Soomro et al., 2012</xref>, thus resulting in a total of 7,070 video clips for training and 3,030 clips for testing. The CEDH is benchmarked against 8 state-of-the-art solutions, namely: locality sensitive hashing <xref ref-type="bibr" rid="B14">Datar et al., 2004</xref>, PCA hashing <xref ref-type="bibr" rid="B117">Wang et al., 2010</xref>, iterative quantization <xref ref-type="bibr" rid="B29">Gong et al., 2013</xref>, spectral hashing <xref ref-type="bibr" rid="B121">Weiss et al., 2009</xref>, density sensitive hashing <xref ref-type="bibr" rid="B49">Jin et al., 2014</xref>, shift-invariant kernel local sensitive hashing <xref ref-type="bibr" rid="B92">Raginsky and Lazebnik, 2009</xref>, self-supervised video hashing <xref ref-type="bibr" rid="B105">Song et al., 2018</xref>, and deep video hashing <xref ref-type="bibr" rid="B72">Liong et al., 2017</xref>. The evaluation criteria are <italic>mAP</italic>, <italic>Prec</italic> and <italic>Rec</italic>.</p>
<p>A hybrid method combining deep learning and hashing techniques to achieve a video fingerprinting technique is presented in <xref ref-type="bibr" rid="B125">Xinwei et al. (2021)</xref>. The method is based on quadruplet fully connected CNN, centered around 4 <italic>3D ResNet-50</italic> networks that extract spatio-temporal features. The input is composed of 4 videos: the source clip, a copy of the clip (a modified version extracted from the original), and 2 clips that are not related to the original clip. The output consists of 2 elements: a 2048-dimension vector and a 16 bits binary code. For training and testing, three public datasets are considered: UCF-101 <xref ref-type="bibr" rid="B106">Soomro et al., 2012</xref>, HMDB <xref ref-type="bibr" rid="B59">Kuehne et al., 2011</xref> and FCVID <xref ref-type="bibr" rid="B46">Jiang et al., 2018</xref>. A normalization process of the 4,986 videos takes place before the training, where each video is downsized to <italic>320&#xd7;240</italic> and only the first 100 frames of each clip are used to identify the video. The proposed method is mainly compared to a similar deep learning method that shares global architectural similarities called NL_Triplet. The two methods have a similar performances and behaviors in the various benchmarking setups.</p>
<p>The study in <xref ref-type="bibr" rid="B68">Li et al. (2021)</xref> presents a fingerprinting method that takes advantage of the capabilities of the CapsNet <xref ref-type="bibr" rid="B95">Sabour et al., 2017</xref> to model the relationships among compressed features. The architecture of the convolution layers is composed of two 3D-convolution modules extracting spatio&#x2013;temporal features, followed by an average pooling module along temporal dimension and finally by a 2D-convolution module. The role of the primary capsule layer is convolution computation and dimension transformation, while the advanced capsule is composed of 32 neurons and is responsible for matrix transformations and dynamic routing <xref ref-type="bibr" rid="B95">Sabour et al., 2017</xref>. The output of this architecture is a 32-dimension fingerprint. A triplet network is designed for the matching. During the training, the matching network requires three inputs: an anchor sample (original video), a positive sample (a copy/modified of the original video), and a negative sample (non-related video). The dataset is composed of 4,000 videos randomly sampled from FCVID <xref ref-type="bibr" rid="B46">Jiang et al., 2018</xref>, TRECVID, and YouTube. The ROC and F1 scores are considered as evaluation criteria when comparing the advanced method to <italic>DML</italic> <xref ref-type="bibr" rid="B57">Kordopatis-Zilos et al., 2017b</xref>, <italic>CNN &#x2b; LSTM</italic> <xref ref-type="bibr" rid="B128">Yaocong and Xiaobo, 2018</xref>, and <italic>TIRI</italic> <xref ref-type="bibr" rid="B11">Coskun et al., 2006</xref>. The advanced method achieves a <italic>F1</italic> &#x3d; 0.99 compared to <italic>F1</italic> &#x3d; 0.97 for DML, <italic>F1</italic> &#x3d; 0.94 for CNN &#x2b; LSTM and <italic>F1</italic> &#x3d; 0.825 for TIRI.</p>
</sec>
<sec id="s4-2-3">
<title>4.2.3 Discussion</title>
<p>A global retrospective view on the investigated NN-based methods is presented in <xref ref-type="fig" rid="F8">Figure 8</xref> that is paired designed with <xref ref-type="fig" rid="F6">Figure 6</xref>. It originates in 2016 and presents, for each analyzed year, the key conceptual ideas (the dark-blue, left block) as well as the methodological enablers in fingerprinting (the blue, right block). Note that in this case the fingerprint extraction and matching are merged (as they are tightly coupled).</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>Evolution of the NN methods.</p>
</caption>
<graphic xlink:href="frsip-02-984169-g008.tif"/>
</fig>
<p>The previous section brings to light that the NN-based fingerprinting is still an emerging research field. It inherits its methodological framework from conventional fingerprinting, while updating both the fingerprint extraction and matching.</p>
<p>Since 2016, fingerprint extraction gradually shifted from considering NN solution at an individual level (e.g., spatial or temporal features) to holistic, 3D Nets able to simultaneously capture integrated spatio-temporal features. Intermediate solutions, combining NN and conventional image processing tools (e.g., SURF, TIRI, or BoVW) are also encountered. The fingerprinting matching generally comes across with the fingerprinting extraction.</p>
<p>The experimental testbed principles are also inherited from the case of conventional methods. Yet, the datasets are different in their size as well as in the fact that experimenter generally creates the attacked versions of the video content (cf. The last 5 lines in <xref ref-type="table" rid="T1">Table 1</xref>). The evaluation criteria generally cover <italic>Prec</italic>, <italic>Rec</italic>, <italic>F1</italic> and <italic>mAP</italic>. This variety in experimental conditions makes impossible for an objective performance comparison to be stated.</p>
<p>
<xref ref-type="fig" rid="F8">Figure 8</xref> and <xref ref-type="sec" rid="s4-2-2">Section 4.2.2</xref> demonstrate that the exploratory work of using NN in conjunction to conventional tools can be considered as successful and that the way towards effective NN&#x2014;only solutions is open <xref ref-type="bibr" rid="B68">Li et al., 2021</xref>.</p>
<p>However, when comparing current day conventional to NN&#x2014;based solutions, the quantitative results seem unbalanced in favor of conventional methods. Yet, quick conclusions should be avoided, as the datasets are of significantly different sizes and the task complexity is significantly different. The generic evaluation criteria introduced in <xref ref-type="sec" rid="s3">Section 3</xref> are seldom jointly evaluated, with each study focusing on a specific metric and/or a pair of metrics. Moreover, note that the computational complexity is seldom discussed as a true evaluation criterion, thus making a sharp decision even more complicated.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5 Challenges and perspectives</title>
<p>Fingerprint challenges and trends are structured according to the constraints set by current day video production and distribution, and to the new applicative fields in which fingerprinting can help, as discussed in <xref ref-type="sec" rid="s5-1">Sections 5.1</xref> and <xref ref-type="sec" rid="s5-2">5.2</xref>, respectively.</p>
<sec id="s5-1">
<title>5.1 Stronger constraints on video fingerprinting properties</title>
<p>Whilst not being either exhaustive or detailed, <xref ref-type="sec" rid="s4">Section 4</xref> is meant to bring light on the very complex, fragmented yet well-structured landscape of the video fingerprinting methods, as illustrated in <xref ref-type="fig" rid="F5">Figures 5</xref>&#x2013;<xref ref-type="fig" rid="F8">8</xref>.</p>
<p>Despite clear incremental progress, achieving the ultimate method for generic video content (TV/movies/social media) fingerprinting is still an open research topic that will continuously be faced to new challenges in terms of: 1) video content size and typology, 2) complexity of near-duplicated copies, 3) compressed stream extraction, and 4) energy consumption reduction.</p>
<p>First, the size of video content is expected to continuously increase. Social media, personalized video content, business oriented video content (e.g., videoconferencing) are expected to lead soon to an average of 38&#xa0;h a week of video consumption per person in US <xref ref-type="bibr" rid="B16">Delloite, 2022</xref>. Such quantity of content is expected to be processed, stored and retrieve without impairing the user experience, hence new challenges in reducing the complexity of fingerprinting matching are expected to be set.</p>
<p>Secondly, the image/video software editing solutions as well as professional video transmission technologies (such as broadcasting, encoding, or publishing) will increase the number, the variety and the complexity of the near-duplicated copies to be dealt with. As for time being these near-duplicated copies are rather considered one-by-one and no attempt to exploit would-be statistical models unitary representing them, this trend is expected to increase the constraints on fingerprinting robustness.</p>
<p>Thirdly, although the video content is mainly generated in compressed format, just few partial results related to fingerprinting extraction directly from the stream syntax elements are reported <xref ref-type="bibr" rid="B83">Ngo et al., 2005</xref>, <xref ref-type="bibr" rid="B66">Li and Vishal, 2013</xref>, <xref ref-type="bibr" rid="B93">Ren et al., 2016</xref>, <xref ref-type="bibr" rid="B97">Schuster et al., 2017</xref>. This highly contrast with related applicative fields, like indexing and watermarking, where more advanced results are already obtained <xref ref-type="bibr" rid="B80">Manerba et al., 2008</xref>; <xref ref-type="bibr" rid="B9">Benois-Pineau, 2010</xref>, <xref ref-type="bibr" rid="B33">Hasnaoui and Mitrea, 2014</xref>.</p>
<p>Finally, video fingerprinting is also expected to take the challenge of reducing the computational complexity, following a green computing trend in video processing <xref ref-type="bibr" rid="B22">Ejembi and Bhatti, 2015</xref>, <xref ref-type="bibr" rid="B25">Fernandes et al., 2015</xref>, <xref ref-type="bibr" rid="B53">Katayama et al., 2016</xref>. This working direction is expected to be coupled to the previous one, namely designing green compressed video fingerprinting solutions.</p>
<p>With respect to the above-mentioned four items, short term research efforts are expected to address several incremental aspects, from both methodological and applicative standpoints. The former encompasses aspects such as the explicability of the NN-based results, the relationship between semantics, content, and the human visual system, the questionable possibility of modeling the modifications induced in near-duplicated content, &#x2026; The latter is expected to investigate the very applicative utility of conventional performance criteria, the computational complexity balancing among extraction/detection in context of NN-based methods and massive datasets, the possibility of identifying a unique structure or a set of structures per performance criterion to be optimized, etc.</p>
<p>As a final remark, note that no convergence towards a theoretical model able to accommodate the current-day efforts can be identified and, in this respect, information theory, statistics and/or signal/image processing are expected to still be at stake during long-term research efforts. Such a theoretical model is expected to have different beneficial effects, from allowing a comparison among existing methods to be carried out with rigor to identifying the tools for answering the applicative expectancies and/or the theoretical bounds.</p>
</sec>
<sec id="s5-2">
<title>5.2 Emerging applicative domains</title>
<p>Fingerprinting benefits are likely to be become appealing for several new applicative domains, such as fake news identifying and tracking, unmanned vehicles video processing, metaverse content tracking, or medical imaging, to mention but a few. This extension rises new challenges not only in terms of applicative integration between fingerprinting and other technologies but also in terms of content type and composition.</p>
<p>In the sequel, we shall detail the cases of fingerprinting for visual fake news and for the video captured by unmanned vehicles.</p>
<sec id="s5-2-1">
<title>5.2.1 Visual fake news</title>
<p>While the concept of <italic>fake news</italic> does not still have a sharp and consensual definition <xref ref-type="bibr" rid="B52">Katarya and Massoudi, 2020</xref>, it can be considered that, in the video context, it relates to the malicious creation of a new video content, whose semantic is not genuine and/or whose interpretation yields to false conclusion. The fake news creation starts generally from some original video content that is subsequently edited. Hence, such a problem is multifold and various types of solutions can be envisaged: detecting whether a content is modified or not, detecting the original content that has been manipulated, detecting the last authorized modification of the original content, etc. Consequently, various video processing paradigms can contribute (individually and/or combined) to elucidate some of these aspects, <xref ref-type="bibr" rid="B60">Lago et al., 2018</xref>, <xref ref-type="bibr" rid="B135">Zhou et al., 2020</xref>, <xref ref-type="bibr" rid="B2">Agrawal and Sharma, 2021</xref>, <xref ref-type="bibr" rid="B17">Devi et al., 2021</xref>.</p>
<p>For instance, video forensics are generally considered as a tool to identify content modification, solely based on the analyzed content. On its side, watermarking provides effective solutions for identifying video content modifications and/or the last authorized user but requires the possibility of modifying the original content prior to its distribution.</p>
<p>Video fingerprinting affords the detection of the original content that has been manipulated to create the fake news content. <xref ref-type="fig" rid="F9">Figure 9</xref> illustrates the case<xref ref-type="fn" rid="fn3">
<sup>3</sup>
</xref> where two video contents, from two different repositories, are combined to create a fake content. In this respect, the challenge of designing fingerprinting methods robust to content cropping is expected to be taken soon. This example shows that fingerprint is complementary to forensics. With respect to watermarking, fingerprinting has as main advantage is passive behavior (it does not require the original content to be modified).</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>Fake video content can be tracked to its original sources thanks to fingerprinting.</p>
</caption>
<graphic xlink:href="frsip-02-984169-g009.tif"/>
</fig>
<p>Moreover, video fingerprinting is still expected to be complemented with security mechanisms, and blockchain (also referred to as Distributed Ledger Technologies&#x2014;DLT) seems very promising in this respect.</p>
<p>In a nutshell, a blockchain is a distributed information storage technology, ensuring trust in the tracking and the authentication of the binary data exchanged in a decentralized, peer-to-peer network: even the smallest (1 bit) modification in a message can be identified. As such a bit sensitivity property is incompatible with the digital document tracking (where multiple digital representations can be associated to a same semantic), the blockchain principles should be coupled to the visual fingerprints that ensure robustness to modifications <xref ref-type="bibr" rid="B3">Allouche et al., 2021</xref>.</p>
<p>From a methodological standpoint, various potential solutions can be conceived, according to the targeted applicative trade-off. In this respect, the na&#xef;ve solution would be to replace the DLT native hash function (<italic>e.g.</italic>, SHA256) by the fingerprinting extraction; yet, such a solution is likely to induce some pitfalls in the system security. Alternatively, the specification of management layers over existing DLT solution is also possible: while such an approach would not impact the DLT security, it is likely to drastically increase the system complexity (smart contract definition and deployment, complex operation execution, on-chain/off-chain load balancing). Intermediate solution can also be thought.</p>
</sec>
<sec id="s5-2-2">
<title>5.2.2 Unmanned vehicles</title>
<p>Drones, robots, and autonomous cars are steadily increasing their applicative scope, thus rising new challenges in a large variety of research fields, including video processing. For instance, large video repositories with data produced by unmanned vehicles are expected to be organized soon, for serving different applicative scopes: a posteriori analysis/disambiguation in case accidents occur, real-time assistance in case of partial failures (on-board cameras are partially out of order), distributed cloud-to-edge computing, etc. <xref ref-type="fig" rid="F10">Figure 10</xref> illustrates the case in which two out of the three cameras available on a delivery drone are out of order. As the delivery trajectory can never be 100% reproducible, video fingerprinting can be an effective tool for searching a near-duplicated video content in the archive, starting from the camera video stream still in use.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption>
<p>Usage of archive video for drone navigation in case of camera failure.</p>
</caption>
<graphic xlink:href="frsip-02-984169-g010.tif"/>
</fig>
<p>In this respect, new challenges related to the very video content type, to its composition as well as to its security (content integrity) are expected to be taken.</p>
<p>First, the video captured by the unmanned vehicles is no longer expected to be optimized for the human visual system peculiarities and should answer a new set of requirements allowing for a better and safer navigation. In a context in which the very concept of just noticeable difference should be extended <xref ref-type="bibr" rid="B48">Jin et al., 2022</xref>, new types of local/global features, matched to the navigation task specificities, are expected to be designed and evaluated.</p>
<p>Secondly, multiple cameras are generally positioned on an unmanned vehicle on fixed positions, thus producing a set of video streams (<italic>e.g.</italic>, 6 to 12 streams according to the type of vehicle). As these video streams are spatially corelated and aligned in time, they are expected to permit the development of new fingerprinting approaches based on global features, rather than of features extracted at the level of each stream. The main difficulty related to the fact that the unmanned video streams are neither independent nor complying with the multi-views paradigm.</p>
<p>Finally, the content integrity issue can be dealt with by considering DLT or watermarking solutions.</p>
</sec>
</sec>
</sec>
<sec id="s6">
<title>6 Conclusion</title>
<p>The present paper provides a generic view on video fingerprinting: conceptual basis, evaluation framework, and methodological approaches are first studied.</p>
<p>They show that the fingerprint landscape is complex yet well structured around the main steps in the fingerprinting workflow: pre-processing of the video sequence, extraction of spatio-temporal information, aggregation of basic features into various derived representations, and matching. While this generic framework is set some 20&#xa0;years ago, the NN advent positioned itself as a precious enabler in applicative-oriented optimizations. Moreover, both conventional and NN solutions can be integrated into global fingerprinting solutions that are able today to process datasets larger than 350,000&#xa0;h of video while featuring <italic>Prec</italic> and <italic>Rec</italic> values larger than 0.9! This opens the door for effective solutions based on 3D Nets able to simultaneously capture integrated spatio-temporal features.</p>
<p>Moreover, fingerprinting is still an open to research topic. From both methodological and applicative standpoints, it is expected to encompass aspects such as the explicability of the NN-based results, the relationship between semantics, content, and the human visual system, or the questionable possibility of modeling the modifications induced in near-duplicated content. Extracting the fingerprinting directly from the compressed stream syntax elements and synergies with green encoding approaches are also to be dealt with in the near future.</p>
<p>New challenges in terms of applicative integration between fingerprinting and other technologies as well as in terms of content type and composition will be raised by emerging trends in video production and distribution, such as fake news content tracking, unmanned vehicles video processing, or metaverse content tracking.</p>
</sec>
</body>
<back>
<sec id="s7">
<title>Author contributions</title>
<p>MA collected the largest majority of references and is the main contributor for <xref ref-type="sec" rid="s4">Section 4</xref>. MM is the main contributor for <xref ref-type="sec" rid="s2">Section 2</xref>, <xref ref-type="sec" rid="s3">3</xref>, and <xref ref-type="sec" rid="s5">5</xref>. Both MA and MM contributed to manuscript writing (all sections), revision, read, and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of interest</title>
<p>The MA is financed by Vidmizer.</p>
<p>The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>This term may be prone to confusion as <italic>robust video hashing</italic> generally denotes a related yet different research field, devoted to forensics applications <xref ref-type="bibr" rid="B26">Fridrich and Goljan, 2000</xref>, <xref ref-type="bibr" rid="B133">Zhao et al., 2013</xref>, <xref ref-type="bibr" rid="B91">Ouyang et al., 2015</xref>.</p>
</fn>
<fn id="fn2">
<label>2</label>
<p>For a specific year, the information presented in <xref ref-type="fig" rid="F6">Figure 6</xref> may correspond to several references.</p>
</fn>
<fn id="fn3">
<label>3</label>
<p>This example is not based on any real situation.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Abu-El-Haija</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kothari</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Natsev</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Toderici</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Varadarajan</surname>
<given-names>B.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <source>YouTube-8M: A large-scale video classification benchmark</source>&#x201d;. <comment>ArXiv, abs/1609.08675</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=YouTube-8M:+A+large-scale+video+classification+benchmark&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Agrawal</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Sharma</surname>
<given-names>D. K.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>A survey on video-based fake news detection techniques</article-title>,&#x201d; in <source>8th international conference on computing for sustainable global development (INDIACom)</source>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=A+survey+on+video-based+fake+news+detection+techniques&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Allouche</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Frikha</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Mitrea</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Memmi</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Chaabane</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Lightweight blockchain processing. Case study: Scanned document tracking on tezos blockchain</article-title>. <source>Appl. Sci. (Basel).</source> <volume>11</volume>, <fpage>7169</fpage>. <pub-id pub-id-type="doi">10.3390/app11157169</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/app11157169">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Lightweight+blockchain+processing.+Case+study:+Scanned+document+tracking+on+tezos+blockchain&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Anuranji</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Srimathi</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A supervised deep convolutional based bidirectional long short term memory video hashing for large scale video retrieval applications</article-title>. <source>Digit. Signal Process.</source> <volume>102</volume>, <fpage>102729</fpage>. <pub-id pub-id-type="doi">10.1016/j.dsp.2020.102729</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.dsp.2020.102729">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=A+supervised+deep+convolutional+based+bidirectional+long+short+term+memory+video+hashing+for+large+scale+video+retrieval+applications&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baris</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Bulent</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Nasir</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Spatio-temporal transform based video hashing</article-title>. <source>IEEE Trans. Multimed.</source> <volume>8</volume> (<issue>6</issue>), <fpage>1190</fpage>&#x2013;<lpage>1208</lpage>. <pub-id pub-id-type="doi">10.1109/tmm.2006.884614</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tmm.2006.884614">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Spatio-temporal+transform+based+video+hashing&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Basharat</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zhai</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Content based video matching using spatiotemporal volumes</article-title>. <source>Comput. Vis. Image Underst.</source> <volume>110</volume> (<issue>3</issue>), <fpage>360</fpage>&#x2013;<lpage>377</lpage>. <pub-id pub-id-type="doi">10.1016/j.cviu.2007.09.016</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.cviu.2007.09.016">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Content+based+video+matching+using+spatiotemporal+volumes&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bay</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Tuytelaars</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Gool</surname>
<given-names>L. V.</given-names>
</name>
<name>
<surname>Van Gool</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Surf: speeded up robust features</article-title>. <source>Comput. Vis. Image Underst.</source> <volume>110</volume> (<issue>3</issue>), <fpage>346</fpage>&#x2013;<lpage>359</lpage>. <pub-id pub-id-type="doi">10.1016/j.cviu.2007.09.014</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.cviu.2007.09.014">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Surf:+speeded+up+robust+features&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bend</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>The YLI-MED corpus: Characteristics, procedures, and plans</article-title>. <source>Comput. Res. Repos. ICSI Tech. Rep. TR-15-001</source>, <fpage>1</fpage>&#x2013;<lpage>46</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=The+YLI-MED+corpus:+Characteristics,+procedures,+and+plans&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B9">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Benois-Pineau</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2010</year>). &#x201c;<article-title>Indexing of compressed video: Methods, challenges, applications</article-title>,&#x201d; in <source>International conference on image processing theory</source>, <fpage>3</fpage>&#x2013;<lpage>4</lpage>. <comment>Tools and Applications</comment>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/ipta.2010.5586830">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Indexing+of+compressed+video:+Methods,+challenges,+applications&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B10">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Carreira-Perpin&#xe1;n</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Raziperchikolaei</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Hashing with binary autoencoders</article-title>,&#x201d; in <source>Proc. IEEE conf. Comput. Vis. Pattern recog.</source>, <fpage>557</fpage>&#x2013;<lpage>566</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Hashing+with+binary+autoencoders&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Coskun</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Sankur</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Memon</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Spatio-temporal transform based video hashing</article-title>. <source>IEEE Trans. Multimed.</source> <volume>8</volume> (<issue>6</issue>), <fpage>1190</fpage>&#x2013;<lpage>1208</lpage>. <pub-id pub-id-type="doi">10.1109/tmm.2006.884614</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tmm.2006.884614">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Spatio-temporal+transform+based+video+hashing&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B12">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Coudert</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Benois-Pineau</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Le Lann</surname>
<given-names>P.-Y.</given-names>
</name>
<name>
<surname>Barba</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>1999</year>). &#x201c;<article-title>Binkey: a system for video content analysis on the fly</article-title>,&#x201d; in <source>Proceedings IEEE international conference on multimedia computing and systems</source>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Binkey:+a+system+for+video+content+analysis+on+the+fly&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B13">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cox</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bloom</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fridrich</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kalker</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2007</year>). <source>Digital watermarking and steganography</source>. <publisher-loc>Burlington, MA, US</publisher-loc>: <publisher-name>Morgan Kaufmann</publisher-name>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Digital+watermarking+and+steganography&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B14">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Datar</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Immorlica</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Indyk</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Mirrokni</surname>
<given-names>V. S.</given-names>
</name>
</person-group> (<year>2004</year>). &#x201c;<article-title>Locality-sensitive hashing scheme based on p-stable distributions</article-title>,&#x201d; in <source>Proceedings of the twentieth annual symposium on Computational geometry</source>, <fpage>253</fpage>&#x2013;<lpage>262</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/997817.997857">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Locality-sensitive+hashing+scheme+based+on+p-stable+distributions&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>de Ara&#xfa;jo</surname>
<given-names>A. F.</given-names>
</name>
<name>
<surname>Girod</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Large-scale video retrieval using image queries</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>28</volume> (<issue>6</issue>), <fpage>1406</fpage>&#x2013;<lpage>1420</lpage>. <pub-id pub-id-type="doi">10.1109/tcsvt.2017.2667710</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tcsvt.2017.2667710">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Large-scale+video+retrieval+using+image+queries&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B16">
<citation citation-type="book">
<collab>Delloite</collab> (<year>2022</year>). <source>The future of the TV and video landscape by 2030</source>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://www2.deloitte.com/content/dam/Deloitte/be/Documents/technology-media-telecommunications/201809%20Future%20of%20Video_DIGITAL_FINAL.pdf">https://www2.deloitte.com/content/dam/Deloitte/be/Documents/technology-media-telecommunications/201809%20Future%20of%20Video_DIGITAL_FINAL.pdf</ext-link>
</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=The+future+of+the+TV+and+video+landscape+by+2030&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B17">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Devi</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Karthik</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Baga</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Bavatharani</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Indhumadhi</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Fake news and tampered image detection in social networks using machine learning</article-title>,&#x201d; in <source>2021 third international conference on inventive research in computing applications (ICIRCA)</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icirca51532.2021.9544661">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Fake+news+and+tampered+image+detection+in+social+networks+using+machine+learning&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Do</surname>
<given-names>M. N.</given-names>
</name>
<name>
<surname>Vetterli</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>The contourlet transform: an efficient directional multiresolution image representation</article-title>. <source>IEEE Trans. Image Process.</source> <volume>14</volume> (<issue>12</issue>), <fpage>2091</fpage>&#x2013;<lpage>2106</lpage>. <pub-id pub-id-type="doi">10.1109/tip.2005.859376</pub-id> <ext-link ext-link-type="uri" xlink:href="https://pubmed.ncbi.nlm.nih.gov/16370462/">PubMed Abstract</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tip.2005.859376">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=The+contourlet+transform:+an+efficient+directional+multiresolution+image+representation&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Douze</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gaidon</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Jegou</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Marsza&#x142;ek</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Schmid</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>INRIA-LEAR&#x2019;s video copy detection system</article-title>. <source>TRECVID</source>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=INRIA-LEAR&#x2019;s+video+copy+detection+system&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Douze</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>J&#xe9;gou</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Schmid</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>An image-based approach to video copy detection with spatio-temporal post-filtering</article-title>. <source>IEEE Trans. Multimed.</source> <volume>12</volume>, <fpage>257</fpage>&#x2013;<lpage>266</lpage>. <pub-id pub-id-type="doi">10.1109/tmm.2010.2046265</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tmm.2010.2046265">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=An+image-based+approach+to+video+copy+detection+with+spatio-temporal+post-filtering&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Duan</surname>
<given-names>L. -Y.</given-names>
</name>
<name>
<surname>Lou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Chandrasekhar</surname>
<given-names>V.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Compact descriptors for video analysis: The emerging MPEG standard</article-title>. <source>IEEE Multimed.</source> <volume>26</volume> (<issue>2</issue>), <fpage>44</fpage>&#x2013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.1109/mmul.2018.2873844</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/mmul.2018.2873844">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Compact+descriptors+for+video+analysis:+The+emerging+MPEG+standard&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ejembi</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Bhatti</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Go green with EnVI: the energy-video index</article-title>,&#x201d; in <source>2015 IEEE international symposium on multimedia (ISM)</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/ism.2015.50">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Go+green+with+EnVI:+the+energy-video+index&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Esmaeili</surname>
<given-names>M. M.</given-names>
</name>
<name>
<surname>Fatourechi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ward</surname>
<given-names>R. K.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>A robust and fast video copy detection system using content-based fingerprinting</article-title>. <source>IEEE Trans. Inf. Forensic. Secur.</source> <volume>6</volume> (<issue>1</issue>), <fpage>213</fpage>&#x2013;<lpage>226</lpage>. <pub-id pub-id-type="doi">10.1109/tifs.2010.2097593</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tifs.2010.2097593">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=A+robust+and+fast+video+copy+detection+system+using+content-based+fingerprinting&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Esmaeili</surname>
<given-names>M. M.</given-names>
</name>
<name>
<surname>Ward</surname>
<given-names>R. K.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Robust video hashing based on temporally informative representative images</article-title>. <source>Proc. IEEE ICCE</source>, <fpage>179</fpage>&#x2013;<lpage>180</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icce.2010.5418777">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+video+hashing+based+on+temporally+informative+representative+images&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fernandes</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Ducloux</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Faramarzi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Gendron</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wen</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>The green metadata standard for energy-efficient video consumption</article-title>. <source>IEEE Multimed.</source> <volume>22</volume> (<issue>1</issue>), <fpage>80</fpage>&#x2013;<lpage>87</lpage>. <pub-id pub-id-type="doi">10.1109/mmul.2015.18</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/mmul.2015.18">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=The+green+metadata+standard+for+energy-efficient+video+consumption&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fridrich</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Goljan</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2000</year>). <article-title>Robust hash functions for digital watermarking</article-title>. <source>Proc. Int. Conf. Inf. Technol. Coding Comput.</source>, <fpage>178</fpage>&#x2013;<lpage>183</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+hash+functions+for+digital+watermarking&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Garboan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Mitrea</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Live camera recording robust video fingerprinting</article-title>. <source>Multimed. Syst.</source> <volume>22</volume>, <fpage>229</fpage>&#x2013;<lpage>243</lpage>. <pub-id pub-id-type="doi">10.1007/s00530-014-0447-0</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s00530-014-0447-0">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Live+camera+recording+robust+video+fingerprinting&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gong</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lazebnik</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Gordo</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Perronnin</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>35</volume> (<issue>12</issue>), <fpage>2916</fpage>&#x2013;<lpage>2929</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2012.193</pub-id> <ext-link ext-link-type="uri" xlink:href="https://pubmed.ncbi.nlm.nih.gov/24136430/">PubMed Abstract</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tpami.2012.193">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Iterative+quantization:+A+procrustean+approach+to+learning+binary+codes+for+large-scale+image+retrieval&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B31">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hadsell</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Chopra</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>LeCun</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2006</year>). &#x201c;<article-title>Dimensionality reduction by learning an invariant mapping</article-title>,&#x201d; in <source>Proc. IEEE conf. Comput. Vis. Pattern recog.</source>, <fpage>1735</fpage>&#x2013;<lpage>1742</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Dimensionality+reduction+by+learning+an+invariant+mapping&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B32">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hampapur</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bolle</surname>
<given-names>R. M.</given-names>
</name>
</person-group> (<year>2001</year>). &#x201c;<article-title>Comparison of distance measures for video copy detection</article-title>,&#x201d; in <source>International conference on multimedia and expo</source>, <fpage>737</fpage>&#x2013;<lpage>740</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icme.2001.1237827">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Comparison+of+distance+measures+for+video+copy+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hasnaoui</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Mitrea</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Multi-symbol QIM video watermarking</article-title>. <source>Signal Process. Image Commun.</source> <volume>29</volume> (<issue>1</issue>), <fpage>107</fpage>&#x2013;<lpage>127</lpage>. <pub-id pub-id-type="doi">10.1016/j.image.2013.07.007</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.image.2013.07.007">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Multi-symbol+QIM+video+watermarking&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B34">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Deep residual learning for image recognition</article-title>,&#x201d; in <source>IEEE conference on computer vision and pattern recognition (CVPR)</source>, <fpage>770</fpage>&#x2013;<lpage>778</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2016.90">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Deep+residual+learning+for+image+recognition&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Heikkila</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Pietikainen</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Schmid</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Description of interest regions with local binary patterns</article-title>. <source>Pattern Recognit.</source> <volume>42</volume> (<issue>3</issue>), <fpage>425</fpage>&#x2013;<lpage>436</lpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2008.08.014</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.patcog.2008.08.014">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Description+of+interest+regions+with+local+binary+patterns&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B36">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Heilbron</surname>
<given-names>F. C.</given-names>
</name>
<name>
<surname>Escorcia</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Ghanem</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Niebles</surname>
<given-names>J. C.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>ActivityNet: A large-scale video benchmark for human activity understanding</article-title>,&#x201d; in <source>Proc. IEEE conf. Comput. Vis. Pattern recognit. (CVPR)</source>, <fpage>961</fpage>&#x2013;<lpage>970</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2015.7298698">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=ActivityNet:+A+large-scale+video+benchmark+for+human+activity+understanding&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B37">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hochreiter</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Schmidhuber</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>1997</year>). &#x201c;<article-title>Long short-term memory</article-title>,&#x201d; in <source>Neural comput</source>, <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>. <ext-link ext-link-type="uri" xlink:href="https://pubmed.ncbi.nlm.nih.gov/9377276/">PubMed Abstract</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1162/neco.1997.9.8.1735">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Long+short-term+memory&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B38">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hong</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Xiangyang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2010</year>). <source>SVD-SIFT for web nearduplicate image detection</source>. <publisher-name>IEEE ICIP</publisher-name>, <fpage>1445</fpage>&#x2013;<lpage>1448</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=SVD-SIFT+for+web+nearduplicate+image+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B39">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Multiple features video fingerprint algorithm based on optical flow feature</article-title>,&#x201d; in <source>International conference on computers, communications, and systems (ICCCS)</source>, <fpage>159</fpage>&#x2013;<lpage>162</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/ccoms.2015.7562893">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Multiple+features+video+fingerprint+algorithm+based+on+optical+flow+feature&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Learning spatial-temporal features for video copy detection by the combination of CNN and RNN</article-title>. <source>J. Vis. Commun. Image Represent.</source> <volume>55</volume>, <fpage>21</fpage>&#x2013;<lpage>29</lpage>. <pub-id pub-id-type="doi">10.1016/j.jvcir.2018.05.013</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.jvcir.2018.05.013">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Learning+spatial-temporal+features+for+video+copy+detection+by+the+combination+of+CNN+and+RNN&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Idris</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Panchanathan</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>1997</year>). <article-title>Review of image and video indexing techniques</article-title>. <source>J. Vis. Commun. Image Represent.</source> <volume>8</volume> (<issue>2</issue>), <fpage>146</fpage>&#x2013;<lpage>166</lpage>. <pub-id pub-id-type="doi">10.1006/jvci.1997.0355</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1006/jvci.1997.0355">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Review+of+image+and+video+indexing+techniques&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B42">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jegou</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Douze</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Schmid</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2008</year>). &#x201c;<article-title>Hamming Embedding and Weak geometry consistency for large scale image search</article-title>,&#x201d; in <source>Proceedings of the 10th European conference on Computer vision</source>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Hamming+Embedding+and+Weak+geometry+consistency+for+large+scale+image+search&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B43">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Cui</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2012</year>). &#x201c;<article-title>A rotation invariant descriptor for robust video copy detection</article-title>,&#x201d; in <source>The era of interactive media</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-1-4614-3501-3_46">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=A+rotation+invariant+descriptor+for+robust+video+copy+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B44">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>Y. G.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>VCDB: A large-scale database for partial copy detection in videos</article-title>,&#x201d; in <source>European conference on computer vision (ECCV)</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-319-10593-2_24">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=VCDB:+A+large-scale+database+for+partial+copy+detection+in+videos&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>Y. G.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Partial copy detection in videos: a benchmark and an evaluation of popular methods</article-title>. <source>IEEE Trans. Big Data</source> <volume>2</volume> (<issue>1</issue>), <fpage>32</fpage>&#x2013;<lpage>42</lpage>. <pub-id pub-id-type="doi">10.1109/tbdata.2016.2530714</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tbdata.2016.2530714">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Partial+copy+detection+in+videos:+a+benchmark+and+an+evaluation+of+popular+methods&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>Y. G.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Xue</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>S. F.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Exploiting feature and class relationships in video categorization with regularized deep neural networks</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>40</volume> (<issue>2</issue>), <fpage>352</fpage>&#x2013;<lpage>364</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2017.2670560</pub-id> <ext-link ext-link-type="uri" xlink:href="https://pubmed.ncbi.nlm.nih.gov/28221992/">PubMed Abstract</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tpami.2017.2670560">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Exploiting+feature+and+class+relationships+in+video+categorization+with+regularized+deep+neural+networks&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>Y. G.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>S.-F.</given-names>
</name>
<name>
<surname>Ellis</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Loui</surname>
<given-names>A. C.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Consumer video understanding: A benchmark database and an evaluation of human and machine performance</article-title>. <source>Proc. 1st ACM Int. Conf. Multimed. Retr. Art. No.</source> <volume>29</volume>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Consumer+video+understanding:+A+benchmark+database+and+an+evaluation+of+human+and+machine+performance&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhan</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Lou</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>Just noticeable difference for deep machine vision</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>32</volume> (<issue>6</issue>), <fpage>3452</fpage>&#x2013;<lpage>3461</lpage>. <pub-id pub-id-type="doi">10.1109/tcsvt.2021.3113572</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tcsvt.2021.3113572">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Just+noticeable+difference+for+deep+machine+vision&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jin</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Density sensitive hashing</article-title>. <source>IEEE Trans. Cybern.</source> <volume>44</volume> (<issue>8</issue>), <fpage>1362</fpage>&#x2013;<lpage>1371</lpage>. <pub-id pub-id-type="doi">10.1109/tcyb.2013.2283497</pub-id> <ext-link ext-link-type="uri" xlink:href="https://pubmed.ncbi.nlm.nih.gov/24158526/">PubMed Abstract</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tcyb.2013.2283497">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Density+sensitive+hashing&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B50">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Karen</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Andrew</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2014</year>). <source>Very deep convolutional networks for large-scale image recognition</source>. <comment>arXiv preprint arXiv:1409.1556</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Very+deep+convolutional+networks+for+large-scale+image+recognition&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B51">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karpathy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Toderici</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Shetty</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Leung</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Sukthankar</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Fei-Fei</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Large-scale video classification with convolutional neural networks</article-title>. <source>IEEE Conf. Comput. Vis. Pattern Recognit.</source>, <fpage>1725</fpage>&#x2013;<lpage>1732</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2014.223">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Large-scale+video+classification+with+convolutional+neural+networks&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B52">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Katarya</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Massoudi</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Recognizing fake news in social media with deep learning: A systematic review</article-title>,&#x201d; in <source>4th international conference on computer, communication and signal processing (ICCCSP)</source>, <fpage>1</fpage>&#x2013;<lpage>4</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icccsp49186.2020.9315255">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Recognizing+fake+news+in+social+media+with+deep+learning:+A+systematic+review&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B53">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Katayama</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Shimamoto</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Leu</surname>
<given-names>J.-S.</given-names>
</name>
</person-group> (<year>2016</year>).&#x201c;<article-title>NearReference frame selection algorithm of HEVC encoder for low power video device</article-title>,&#x201d; in <source>2016 2nd international conference on intelligent green building and smart grid (IGBSG)</source>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=NearReference+frame+selection+algorithm+of+HEVC+encoder+for+low+power+video+device&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B54">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Vasudev</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Spatiotemporal sequence matching for efficient video copy detection</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>15</volume>, <fpage>127</fpage>&#x2013;<lpage>132</lpage>. <pub-id pub-id-type="doi">10.1109/tcsvt.2004.836751</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tcsvt.2004.836751">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Spatiotemporal+sequence+matching+for+efficient+video+copy+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B55">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kingma</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Jimmy</surname>
<given-names>Ba.</given-names>
</name>
</person-group> (<year>2014</year>). <source>Adam: A method for stochastic optimization</source>. <comment>arXiv preprint arXiv:1412.6980</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Adam:+A+method+for+stochastic+optimization&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B56">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kordopatis-Zilos</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Papadopoulos</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Patras</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Kompatsiaris</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2017a</year>). &#x201c;<article-title>Near-duplicate video retrieval by aggregating intermediate cnn layers</article-title>,&#x201d; in <source>International conference on multimedia modeling</source>, <fpage>251</fpage>&#x2013;<lpage>263</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-319-51811-4_21">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Near-duplicate+video+retrieval+by+aggregating+intermediate+cnn+layers&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B57">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kordopatis-Zilos</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Papadopoulos</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Patras</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Kompatsiaris</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2017b</year>). &#x201c;<article-title>Near-duplicate video retrieval with deep metric learning</article-title>,&#x201d; in <source>IEEE international conference on computer vision workshops (ICCVW-2017)</source>, <fpage>347</fpage>&#x2013;<lpage>356</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/iccvw.2017.49">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Near-duplicate+video+retrieval+with+deep+metric+learning&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B58">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Krizhevsky</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sutskever</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G. E.</given-names>
</name>
</person-group> (<year>2012</year>). &#x201c;<article-title>ImageNet classification with deep con- volutional neural networks</article-title>,&#x201d; in <source>Advances in neural information processing sys- tems 25: 26th annual conference on neural information processing systems</source>, <fpage>1106</fpage>&#x2013;<lpage>1114</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=ImageNet+classification+with+deep+con-+volutional+neural+networks&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B59">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kuehne</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Jhuang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Garrote</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Poggio</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Serre</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2011</year>). &#x201c;<article-title>HMDB: a large video database for human motion recognition</article-title>,&#x201d; in <source>International conference on computer vision</source> (<publisher-name>IEEE</publisher-name>), <fpage>2556</fpage>&#x2013;<lpage>2563</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/iccv.2011.6126543">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=HMDB:+a+large+video+database+for+human+motion+recognition&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B60">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lago</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Phan</surname>
<given-names>Q.-T.</given-names>
</name>
<name>
<surname>Boato</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Image forensics in online news</article-title>,&#x201d; in <source>2018 IEEE 20th international workshop on multimedia signal processing (MMSP)</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/mmsp.2018.8547083">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Image+forensics+in+online+news&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B61">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Law-To</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Buisson</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Gouet-Brunet</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Boujemaa</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2007a</year>). &#x201c;<article-title>Video copy detection on the Internet: the challenges of copyright and multiplicity</article-title>,&#x201d; in <source>IEEE int&#x2019;l conf multimed expo</source>, <fpage>2082</fpage>&#x2013;<lpage>2085</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icme.2007.4285092">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+copy+detection+on+the+Internet:+the+challenges+of+copyright+and+multiplicity&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B62">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Law-To</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Joly</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Boujemaa</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2007b</year>). <comment>Muscle-VCD-2007: a live benchmark for video copy detection Available at: <ext-link ext-link-type="uri" xlink:href="http://www-rocq.inria.fr/%20imedia/civr-bench/">http://www-rocq.inria.fr/imedia/civr-bench/</ext-link>
</comment>.</citation>
</ref>
<ref id="B63">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yoo</surname>
<given-names>C. D.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Robust video fingerprinting for content-based video identification</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>18</volume> (<issue>7</issue>), <fpage>983</fpage>&#x2013;<lpage>988</lpage>. <pub-id pub-id-type="doi">10.1109/tcsvt.2008.920739</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tcsvt.2008.920739">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+video+fingerprinting+for+content-based+video+identification&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B64">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yoo</surname>
<given-names>C. D.</given-names>
</name>
</person-group> (<year>2006</year>). &#x201c;<article-title>Video fingerprinting based on centroids of gradient orientations</article-title>,&#x201d; in <source>Proc. IEEE int. Conf. Acoust., speech and signal process. (ICASSP)</source>, <volume>2</volume>.<issue>II</issue> <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+fingerprinting+based+on+centroids+of+gradient+orientations&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B65">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lefebvre</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Chupeau</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Massoudi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Diehl</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Image and video fingerprinting: Forensic applications</article-title>,&#x201d; in <source>Proc. SPIE</source> (<publisher-loc>San Jose, CA, US</publisher-loc>: <publisher-name>Media Forensics and Security</publisher-name>), <volume>7254</volume>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1117/12.806580">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Image+and+video+fingerprinting:+Forensic+applications&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B66">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Vishal</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Compact video fingerprinting via structural graphical models</article-title>. <source>IEEE Trans. Inf. Forensic. Secur.</source> <volume>8</volume>, <fpage>1709</fpage>&#x2013;<lpage>1721</lpage>. <pub-id pub-id-type="doi">10.1109/tifs.2013.2278100</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tifs.2013.2278100">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Compact+video+fingerprinting+via+structural+graphical+models&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B67">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Vishal</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Robust video hashing via multilinear subspace projections</article-title>. <source>IEEE Trans. Image Process.</source> <volume>21</volume> (<issue>10</issue>), <fpage>4397</fpage>&#x2013;<lpage>4409</lpage>. <pub-id pub-id-type="doi">10.1109/tip.2012.2206036</pub-id> <ext-link ext-link-type="uri" xlink:href="https://pubmed.ncbi.nlm.nih.gov/22752130/">PubMed Abstract</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tip.2012.2206036">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+video+hashing+via+multilinear+subspace+projections&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B68">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Compact video fingerprinting via an improved capsule net</article-title>. <source>Syst. Sci. Control Eng.</source> <volume>9</volume> (<issue>1</issue>), <fpage>122</fpage>&#x2013;<lpage>130</lpage>. <pub-id pub-id-type="doi">10.1080/21642583.2020.1833782</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1080/21642583.2020.1833782">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Compact+video+fingerprinting+via+an+improved+capsule+net&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B69">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Deep content: Unveiling video streaming content from encrypted WiFi traffic</article-title>,&#x201d; in <source>IEEE 17th international symposium on network computing and application</source>, <fpage>1</fpage>&#x2013;<lpage>8</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/nca.2018.8548317">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Deep+content:+Unveiling+video+streaming+content+from+encrypted+WiFi+traffic&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B70">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>Y. N.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X. P.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Robust and compact video descriptor learned by deep neural network</article-title>,&#x201d; in <source>IEEE international conference on acoustics, speech and signal processing (ICASSP)</source>, <fpage>2162</fpage>&#x2013;<lpage>2166</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icassp.2017.7952539">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+and+compact+video+descriptor+learned+by+deep+neural+network&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B71">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lin</surname>
<given-names>T. Y.</given-names>
</name>
<name>
<surname>Goyal</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Girshick</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Focal loss for dense object detection&#x201d;</source>. <comment>arXiv:1708.02002</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Focal+loss+for+dense+object+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B72">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liong</surname>
<given-names>V. E.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>Y. P.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Deep video hashing</article-title>. <source>IEEE Trans. Multimed.</source> <volume>19</volume> (<issue>6</issue>), <fpage>1209</fpage>&#x2013;<lpage>1219</lpage>. <pub-id pub-id-type="doi">10.1109/tmm.2016.2645404</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tmm.2016.2645404">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Deep+video+hashing&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B73">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liong</surname>
<given-names>V. E.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Moulin</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Deep hashing for compact binary codes learning</article-title>,&#x201d; in <source>Proc. IEEE conf. Comput. Vis. Pattern recognit. (CVPR)</source>, <fpage>2475</fpage>&#x2013;<lpage>2483</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2015.7298862">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Deep+hashing+for+compact+binary+codes+learning&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B74">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>H. T.</given-names>
</name>
<name>
<surname>Ngo</surname>
<given-names>C. W.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Near-duplicate video retrieval: Current research and future trends</article-title>. <source>ACM Comput. Surv.</source> <volume>45</volume> (<issue>4</issue>), <fpage>1</fpage>&#x2013;<lpage>23</lpage>. <comment>Art. No. 44</comment>. <pub-id pub-id-type="doi">10.1145/2501654.2501658</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/2501654.2501658">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Near-duplicate+video+retrieval:+Current+research+and+future+trends&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B75">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Content-based video copy detection using binary object fingerprints</article-title>,&#x201d; in <source>IEEE international conference on signal processing, communications and computing (ICSPCC)</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icspcc.2018.8567827">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Content-based+video+copy+detection+using+binary+object+fingerprints&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B76">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Po</surname>
<given-names>L. M.</given-names>
</name>
<name>
<surname>Ur Rehman</surname>
<given-names>Y. A.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Video copy detection by conducting fast searching of inverted files</article-title>. <source>Multimed. Tools Appl.</source> <volume>78</volume>, <fpage>10601</fpage>&#x2013;<lpage>10624</lpage>. <pub-id pub-id-type="doi">10.1007/s11042-018-6639-4</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s11042-018-6639-4">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+copy+detection+by+conducting+fast+searching+of+inverted+files&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B77">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Y. G.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>S. F.</given-names>
</name>
</person-group> (<year>2012</year>). &#x201c;<article-title>Supervised hashing with kernels</article-title>,&#x201d; in <source>Proc. IEEE conf. Comput. Vis. Pattern recognit. (CVPR)</source>, <fpage>2074</fpage>&#x2013;<lpage>2081</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Supervised+hashing+with+kernels&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B78">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lu</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Video fingerprinting for copy identification: from research to industry applications</article-title>. <source>Proc. SPIE - Media Forensics Secur. XI</source> <volume>7254</volume>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1117/12.805709">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+fingerprinting+for+copy+identification:+from+research+to+industry+applications&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B79">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Gu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Gong</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Unsupervised video hashing via deep neural network</article-title>. <source>Neural process. Lett.</source> <volume>47</volume> (<issue>3</issue>), <fpage>877</fpage>&#x2013;<lpage>890</lpage>. <pub-id pub-id-type="doi">10.1007/s11063-018-9812-x</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s11063-018-9812-x">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Unsupervised+video+hashing+via+deep+neural+network&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B80">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Manerba</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Benois-Pineau</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Leonardi</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Mansencal</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Multiple moving object detection for fast video content description in compressed domain</article-title>. <source>EURASIP J. Adv. Signal Process.</source>, <fpage>231930</fpage>&#x2013;<lpage>232015</lpage>. <pub-id pub-id-type="doi">10.1155/2008/231930</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1155/2008/231930">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Multiple+moving+object+detection+for+fast+video+content+description+in+compressed+domain&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B81">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mansencal</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Benois-Pineau</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bredin</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Quenot</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). <source>IRIM at TRECVID 2018: Instance search</source>. <comment>TRECVID</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=IRIM+at+TRECVID+2018:+Instance+search&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B82">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Gang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Weiguo</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yahong</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhiguo</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>A method for video authenticity based on the fingerprint of scene frame</article-title>. <source>Neurocomputing</source> <volume>173</volume>, <fpage>2022</fpage>&#x2013;<lpage>2032</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2015.09.001</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.neucom.2015.09.001">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=A+method+for+video+authenticity+based+on+the+fingerprint+of+scene+frame&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B83">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ngo</surname>
<given-names>C. W.</given-names>
</name>
<name>
<surname>Yu-Fei</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>J. Z.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Video summarization and scene detection by graph modeling</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>15</volume>, <fpage>296</fpage>&#x2013;<lpage>305</lpage>. <pub-id pub-id-type="doi">10.1109/tcsvt.2004.841694</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tcsvt.2004.841694">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+summarization+and+scene+detection+by+graph+modeling&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B84">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nie</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>W. K.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Graph-based video fingerprinting using double optimal projection</article-title>. <source>J. Vis. Commun. Image Represent.</source> <volume>32</volume>, <fpage>120</fpage>&#x2013;<lpage>129</lpage>. <pub-id pub-id-type="doi">10.1016/j.jvcir.2015.08.001</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.jvcir.2015.08.001">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Graph-based+video+fingerprinting+using+double+optimal+projection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B85">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Nie</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Jing</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>L. Y.</given-names>
</name>
<name>
<surname>Cui</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2017a</year>). &#x201c;<article-title>Two-layer video fingerprinting strategy for near- duplicate video detection</article-title>,&#x201d; in <source>IEEE international conference on multimedia &#x26; expo workshops (ICMEW)</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icmew.2017.8026322">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Two-layer+video+fingerprinting+strategy+for+near-+duplicate+video+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B86">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nie</surname>
<given-names>X. S.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>Y. L.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Chui C. R.</given-names>
</name>
<name>
<surname>Cui</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2017b</year>). <article-title>Comprehensive feature-based robust video fingerprinting using tensor model</article-title>. <source>IEEE Trans. Multimed.</source> <volume>19</volume> (<issue>4</issue>), <fpage>785</fpage>&#x2013;<lpage>796</lpage>. <pub-id pub-id-type="doi">10.1109/tmm.2016.2629758</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tmm.2016.2629758">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Comprehensive+feature-based+robust+video+fingerprinting+using+tensor+model&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B87">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nie</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Classification-enhancement deep hashing for large-scale video retrieval</article-title>. <source>Appl. Soft Comput.</source> <volume>109</volume>, <fpage>107467</fpage>. <pub-id pub-id-type="doi">10.1016/j.asoc.2021.107467</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.asoc.2021.107467">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Classification-enhancement+deep+hashing+for+large-scale+video+retrieval&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B88">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Oostveen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kalker</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Haitsma</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2002</year>). &#x201c;<article-title>Feature extraction and a database strategy for video fingerprinting</article-title>,&#x201d; in <source>Proceedings of the 5th international conference on recent advances in visual information systems</source>, <volume>2314</volume>, <fpage>117</fpage>&#x2013;<lpage>128</lpage>. <comment>Lecture Notes In Computer Science</comment>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/3-540-45925-1_11">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Feature+extraction+and+a+database+strategy+for+video+fingerprinting&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B89">
<citation citation-type="web">
<collab>Open Video</collab> (<year>2022</year>). <article-title>Open Video dataset</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://www.open-video.org">www.open-video.org</ext-link>
</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Open+Video+dataset&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B90">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ouali</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Dumouchel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Gupta</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Robust video fingerprints using positions of salient regions</article-title>,&#x201d; in <source>IEEE international conference on acoustics, speech and signal processing (ICASSP)</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icassp.2017.7952715">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+video+fingerprints+using+positions+of+salient+regions&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B91">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ouyang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Coatrieux</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Shu</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Robust hashing for image authentication using quaternion discrete Fourier transform and log-polar transform</article-title>. <source>Digit. Signal Process.</source> <volume>41</volume>, <fpage>98</fpage>&#x2013;<lpage>109</lpage>. <pub-id pub-id-type="doi">10.1016/j.dsp.2015.03.006</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.dsp.2015.03.006">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+hashing+for+image+authentication+using+quaternion+discrete+Fourier+transform+and+log-polar+transform&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B92">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Raginsky</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lazebnik</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Locality-sensitive binary codes from shift-invariant kernels</article-title>,&#x201d; in <source>Advances in neural information processing systems</source>, <fpage>1509</fpage>&#x2013;<lpage>1517</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Locality-sensitive+binary+codes+from+shift-invariant+kernels&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B93">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ren</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Zhuo</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Long</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Qu</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>MPEG-2 video copy detection method based on sparse representation of spatial and temporal features</article-title>,&#x201d; in <source>IEEE second international conference on multimedia big data</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/bigmm.2016.21">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=MPEG-2+video+copy+detection+method+based+on+sparse+representation+of+spatial+and+temporal+features&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B94">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Roover</surname>
<given-names>C. D.</given-names>
</name>
<name>
<surname>Vleeschouwer</surname>
<given-names>C. D.</given-names>
</name>
<name>
<surname>Lefebvre</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Macq</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Robust video hashing based on radial projections of key frames</article-title>. <source>IEEE Trans. Signal Process.</source> <volume>53</volume> (<issue>10</issue>), <fpage>4020</fpage>&#x2013;<lpage>4037</lpage>. <pub-id pub-id-type="doi">10.1109/tsp.2005.855414</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tsp.2005.855414">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+video+hashing+based+on+radial+projections+of+key+frames&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B95">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sabour</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Frosst</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G. E.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Dynamic routing between capsules</article-title>. <source>Adv. neural Inf. Process. Syst.</source> <volume>30</volume>. <ext-link ext-link-type="uri" xlink:href="https://pubmed.ncbi.nlm.nih.gov/29983538/">PubMed Abstract</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Dynamic+routing+between+capsules&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B96">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sarkar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Ghosh</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Moxley</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Manjunath</surname>
<given-names>B. S.</given-names>
</name>
</person-group> (<year>2008</year>). &#x201c;<article-title>Video fin- gerprinting: features for duplicate and similar video detection and query-based video retrieval</article-title>,&#x201d; in <source>Multimed content access algorithms syst II</source>, <fpage>68200E</fpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+fin-+gerprinting:+features+for+duplicate+and+similar+video+detection+and+query-based+video+retrieval&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B97">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Schuster</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Shmatikov</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Tromer</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Beauty and the Burst: Remote identification of encrypted video streams</article-title>,&#x201d; in <source>26th USENIX security symposium</source>, <fpage>1357</fpage>&#x2013;<lpage>1374</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Beauty+and+the+Burst:+Remote+identification+of+encrypted+video+streams&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B98">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Seidel</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Content fingerprinting from an industry perspective</article-title>,&#x201d; in <source>IEEE international conference on multimedia and expo</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icme.2009.5202794">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Content+fingerprinting+from+an+industry+perspective&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B99">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Shen</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>H. T.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Supervised discrete hashing</article-title>,&#x201d; in <source>Proc. IEEE conf. Comput. Vis. Pattern recognit. (CVPR)</source>, <fpage>37</fpage>&#x2013;<lpage>45</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2015.7298598">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Supervised+discrete+hashing&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B100">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shikui</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Ce</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Changsheng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhenfeng</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Frame fusion for video copy detection</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>21</volume> (<issue>1</issue>), <fpage>15</fpage>&#x2013;<lpage>28</lpage>. <pub-id pub-id-type="doi">10.1109/tcsvt.2011.2105554</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tcsvt.2011.2105554">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Frame+fusion+for+video+copy+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B101">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simonyan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Very deep convolutional networks for large-scale image recognition</article-title>&#x201d;, <comment>arXiv preprint arXiv:1409.1556</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Very+deep+convolutional+networks+for+large-scale+image+recognition&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B102">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Singh</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Verma</surname>
<given-names>V. K.</given-names>
</name>
<name>
<surname>Rai</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Namboodiri</surname>
<given-names>V. P.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>HetConv: Heterogeneous kernel-based convolutions for deep CNNs</article-title>,&#x201d; in <source>IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source>, <fpage>4830</fpage>&#x2013;<lpage>4839</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2019.00497">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=HetConv:+Heterogeneous+kernel-based+convolutions+for+deep+CNNs&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B103">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sivic</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2003</year>). &#x201c;<article-title>Video google: A text retrieval approach to object matching in videos</article-title>,&#x201d; in <source>Computer vision, IEEE international conference</source>, <fpage>1470</fpage>&#x2013;<lpage>1477</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/iccv.2003.1238663">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+google:+A+text+retrieval+approach+to+object+matching+in+videos&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B104">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sodagar</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>The MPEG-DASH standard for multimedia streaming over the Internet</article-title>. <source>IEEE Multimed.</source> <volume>18</volume> (<issue>4</issue>), <fpage>62</fpage>&#x2013;<lpage>67</lpage>. <pub-id pub-id-type="doi">10.1109/mmul.2011.71</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/mmul.2011.71">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=The+MPEG-DASH+standard+for+multimedia+streaming+over+the+Internet&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B105">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Self-supervised video hashing with hierarchical binary auto-encoder</article-title>. <source>IEEE Trans. Image Process.</source> <volume>27</volume> (<issue>7</issue>), <fpage>3210</fpage>&#x2013;<lpage>3221</lpage>. <pub-id pub-id-type="doi">10.1109/tip.2018.2814344</pub-id> <ext-link ext-link-type="uri" xlink:href="https://pubmed.ncbi.nlm.nih.gov/29641401/">PubMed Abstract</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tip.2018.2814344">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Self-supervised+video+hashing+with+hierarchical+binary+auto-encoder&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B106">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Soomro</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zamir</surname>
<given-names>A. R.</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2012</year>). <source>UCF101 - a dataset of 101 human actions classes from videos in the wild</source>. <comment>arXiv preprint arXiv:1212.0402</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=UCF101+-+a+dataset+of+101+human+actions+classes+from+videos+in+the+wild&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B107">
<citation citation-type="web">
<collab>Statista</collab> (<year>2022</year>). <article-title>Statista</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://www.statista.com/statistics/267222/global-data-volume-of-internet-video-to-tv-traffic/">https://www.statista.com/statistics/267222/global-data-volume-of-internet-video-to-tv-traffic/</ext-link>
</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Statista&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B108">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Su</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Robust video fingerprinting based on visual attention regions</article-title>. <source>IEEE Int&#x2019;l Conf. Acoust. Speech Signal Process</source> <volume>109</volume> (<issue>1</issue>), <fpage>1525</fpage>&#x2013;<lpage>1528</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+video+fingerprinting+based+on+visual+attention+regions&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B109">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Xiaoxing</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Jun</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Robust video fingerprinting scheme based on contourlet hidden Markov tree model</article-title>. <source>Optik</source> <volume>128</volume>, <fpage>139</fpage>&#x2013;<lpage>147</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijleo.2016.09.105</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.ijleo.2016.09.105">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+video+fingerprinting+scheme+based+on+contourlet+hidden+Markov+tree+model&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B110">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Szegedy</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Going deeper with convolutions</article-title>,&#x201d; in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>, <fpage>1</fpage>&#x2013;<lpage>9</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2015.7298594">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Going+deeper+with+convolutions&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B111">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tan</surname>
<given-names>H. K.</given-names>
</name>
<name>
<surname>Ngo</surname>
<given-names>C. W.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Chua</surname>
<given-names>T. S.</given-names>
</name>
</person-group> (<year>2009</year>). <source>Scalable detection of partial near-duplicate videos by visual-temporal consistency</source>. <comment>MM&#x2019;09</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Scalable+detection+of+partial+near-duplicate+videos+by+visual-temporal+consistency&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B112">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Taylor</surname>
<given-names>G. W.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G. E.</given-names>
</name>
<name>
<surname>Roweis</surname>
<given-names>S. T.</given-names>
</name>
</person-group> (<year>2007</year>). &#x201c;<article-title>Modeling human motion using binary latent variables</article-title>,&#x201d; in <source>Proc. Advances in neural information processing systems</source>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Modeling+human+motion+using+binary+latent+variables&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B113">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thomas</surname>
<given-names>R. M.</given-names>
</name>
<name>
<surname>Sumesh</surname>
<given-names>M. S.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A simple and robust colour based video copy detection on summarized videos</article-title>. <source>Procedia Comput. Sci.</source> <volume>46</volume>, <fpage>1668</fpage>&#x2013;<lpage>1675</lpage>. <pub-id pub-id-type="doi">10.1016/j.procs.2015.02.106</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.procs.2015.02.106">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=A+simple+and+robust+colour+based+video+copy+detection+on+summarized+videos&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B114">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thomee</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Shamma</surname>
<given-names>D. A.</given-names>
</name>
<name>
<surname>Friedland</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Elizalde</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Ni</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Poland</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>YFCC100M: The new data in multimedia research</article-title>. <source>Commun. ACM</source> <volume>59</volume> (<issue>2</issue>), <fpage>64</fpage>&#x2013;<lpage>73</lpage>. <pub-id pub-id-type="doi">10.1145/2812802</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/2812802">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=YFCC100M:+The+new+data+in+multimedia+research&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B115">
<citation citation-type="web">
<collab>Trecvid</collab> (<year>2022</year>). <article-title>trecvid</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://trecvid.nist.gov">https://trecvid.nist.gov</ext-link>
</comment>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=trecvid&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B116">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vincent</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Larochelle</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lajoie</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Manzagol</surname>
<given-names>P. A.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion</article-title>. <source>J. Mach. Learn. Res.</source> <volume>11</volume>, <fpage>3371</fpage>&#x2013;<lpage>3408</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Stacked+denoising+autoencoders:+Learning+useful+representations+in+a+deep+network+with+a+local+denoising+criterion&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B117">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2010</year>). &#x201c;<article-title>Semi-supervised hashing for scalable image Retrieval</article-title>,&#x201d; in <source>Proceedings of the IEEE conf. Comput. Vis. Pattern recognit.</source>, <fpage>3424</fpage>&#x2013;<lpage>3431</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cvpr.2010.5539994">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Semi-supervised+hashing+for+scalable+image+Retrieval&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B118">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Bao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Compact CNN based video representation for efficient video copy detection</article-title>,&#x201d; in <source>International conference on multimedia modeling</source>, <fpage>576</fpage>&#x2013;<lpage>587</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-319-51811-4_47">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Compact+CNN+based+video+representation+for+efficient+video+copy+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B119">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>R. B.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>J. L.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>Y. T.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Video copy detection based on temporal contextual hashing</article-title>,&#x201d; in <source>IEEE second international conference on multimedia big data</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/bigmm.2016.12">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+copy+detection+based+on+temporal+contextual+hashing&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B120">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wary</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Neelima</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>A review on robust video copy detection</article-title>. <source>Int. J. Multimed. Inf. Retr.</source> <volume>8</volume>, <fpage>61</fpage>&#x2013;<lpage>78</lpage>. <pub-id pub-id-type="doi">10.1007/s13735-018-0159-x</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s13735-018-0159-x">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=A+review+on+robust+video+copy+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B121">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Weiss</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Torralba</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Fergus</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Spectral hashing</article-title>,&#x201d; in <source>Advances in neural information processing systems</source>, <fpage>1753</fpage>&#x2013;<lpage>1760</lpage>. <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Spectral+hashing&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B122">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Hauptmann</surname>
<given-names>A. G.</given-names>
</name>
<name>
<surname>Ngo</surname>
<given-names>C. W.</given-names>
</name>
</person-group> (<year>2007b</year>). &#x201c;<article-title>Practical elimination of near-duplicates from web video search</article-title>,&#x201d; in <source>Proceedings of the 15th ACM international conference on multimedia</source>, <fpage>218</fpage>&#x2013;<lpage>227</lpage>. <comment>MM &#x2019;07</comment>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/1291233.1291280">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Practical+elimination+of+near-duplicates+from+web+video+search&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B123">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ngo</surname>
<given-names>C. W.</given-names>
</name>
<name>
<surname>Hauptmann</surname>
<given-names>A. G.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>H. K.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Real-time near-duplicate elimination for web video search with content and context</article-title>. <source>IEEE Trans. Multimed.</source> <volume>11</volume> (<issue>2</issue>), <fpage>196</fpage>&#x2013;<lpage>207</lpage>. <pub-id pub-id-type="doi">10.1109/tmm.2008.2009673</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tmm.2008.2009673">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Real-time+near-duplicate+elimination+for+web+video+search+with+content+and+context&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B124">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ngo</surname>
<given-names>C. W.</given-names>
</name>
</person-group> (<year>2007a</year>). &#x201c;<article-title>Near-duplicate keyframe retrieval with visual keywords and semantic context</article-title>,&#x201d; in <source>Proc. of the 6th ACM international conference on image and video retrieval (CIVR&#x2019;07)</source>, <fpage>162</fpage>&#x2013;<lpage>169</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/1282280.1282309">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Near-duplicate+keyframe+retrieval+with+visual+keywords+and+semantic+context&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B125">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xinwei</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Yi</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lianghao</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Video fingerprinting based on quadruplet convolutional neural network</article-title>. <source>Syst. Sci. Control Eng.</source> <volume>9</volume> (<issue>1</issue>), <fpage>131</fpage>&#x2013;<lpage>141</lpage>. <pub-id pub-id-type="doi">10.1080/21642583.2020.1822946</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1080/21642583.2020.1822946">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+fingerprinting+based+on+quadruplet+convolutional+neural+network&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B126">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Gu</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Niu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2006</year>). &#x201c;<article-title>Block mean value based image perceptual hashing</article-title>,&#x201d; in <source>IIH-MSP&#x2019;06 international conference on intelligent information hiding and multimedia signal processing</source> (<publisher-name>IEEE</publisher-name>), <fpage>167</fpage>&#x2013;<lpage>172</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/iih-msp.2006.265125">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Block+mean+value+based+image+perceptual+hashing&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B127">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>A robust hashing algorithm based on SURF for video copy detection</article-title>. <source>Comput. Secur.</source> <volume>31</volume>, <fpage>33</fpage>&#x2013;<lpage>39</lpage>. <pub-id pub-id-type="doi">10.1016/j.cose.2011.11.004</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.cose.2011.11.004">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=A+robust+hashing+algorithm+based+on+SURF+for+video+copy+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B128">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yaocong</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Xiaobo</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Learning spatial-temporal features for video copy detection by the combination of CNN and RNN</article-title>. <source>J. Vis. Commun. Image Represent.</source> <volume>55</volume>, <fpage>21</fpage>&#x2013;<lpage>29</lpage>. <pub-id pub-id-type="doi">10.1016/j.jvcir.2018.05.013</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.jvcir.2018.05.013">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Learning+spatial-temporal+features+for+video+copy+detection+by+the+combination+of+CNN+and+RNN&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B129">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yuan</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Po</surname>
<given-names>L. M.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>M. Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>X. Y.</given-names>
</name>
<name>
<surname>Jian</surname>
<given-names>W. H.</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>K.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>Shearlet based video fingerprint for content-based copy detection</article-title>. <source>J. Signal Inf. Process.</source> <volume>7</volume>, <fpage>84</fpage>&#x2013;<lpage>97</lpage>. <pub-id pub-id-type="doi">10.4236/jsip.2016.72010</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.4236/jsip.2016.72010">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Shearlet+based+video+fingerprint+for+content-based+copy+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B130">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>CNN-VWII: An efficient approach for large-scale video retrieval by image queries</article-title>. <source>Pattern Recognit. Lett.</source> <volume>123</volume>, <fpage>82</fpage>&#x2013;<lpage>88</lpage>. <pub-id pub-id-type="doi">10.1016/j.patrec.2019.03.015</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.patrec.2019.03.015">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=CNN-VWII:+An+efficient+approach+for+large-scale+video+retrieval+by+image+queries&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B131">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>W. L.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ngo</surname>
<given-names>C. W.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>On the annotation of web videos by efficient near duplicate search</article-title>. <source>IEEE Trans. Multimed.</source> <volume>12</volume>, <fpage>448</fpage>&#x2013;<lpage>461</lpage>. <pub-id pub-id-type="doi">10.1109/tmm.2010.2050651</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tmm.2010.2050651">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=On+the+annotation+of+web+videos+by+efficient+near+duplicate+search&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B132">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Robust hashing based on persistent points for video copy detection</article-title>. <source>Proc. Int. Conf. Comput. Intell. Secur. (CIS)</source> <volume>1</volume>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/cis.2008.175">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+hashing+based+on+persistent+points+for+video+copy+detection&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B133">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Robust hashing for image authentication using zernike moments and local features</article-title>. <source>IEEE Trans. Inf. Forensic. Secur.</source> <volume>8</volume> (<issue>1</issue>), <fpage>55</fpage>&#x2013;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.1109/tifs.2012.2223680</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tifs.2012.2223680">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Robust+hashing+for+image+authentication+using+zernike+moments+and+local+features&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B134">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhixiang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Nonlinear structural hashing for scalable video search</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>28</volume>, <fpage>1421</fpage>&#x2013;<lpage>1433</lpage>. <pub-id pub-id-type="doi">10.1109/tcsvt.2017.2669095</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/tcsvt.2017.2669095">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Nonlinear+structural+hashing+for+scalable+video+search&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B135">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Pun</surname>
<given-names>C-M.</given-names>
</name>
<name>
<surname>Tong</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>News image steganography: A novel architecture facilitates the fake news identification</article-title>,&#x201d; in <source>IEEE international conference on visual communications and image processing (VCIP)</source>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/vcip49819.2020.9301846">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=News+image+steganography:+A+novel+architecture+facilitates+the+fake+news+identification&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
<ref id="B136">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>C. N.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Video copy detection using spatio-temporal CNN features</article-title>. <source>IEEE Access</source> <volume>7</volume>, <fpage>100658</fpage>&#x2013;<lpage>100665</lpage>. <pub-id pub-id-type="doi">10.1109/access.2019.2930173</pub-id> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/access.2019.2930173">CrossRef Full Text</ext-link> &#x7c; <ext-link ext-link-type="uri" xlink:href="https://scholar.google.com/scholar?hl=en&#x0026;as_sdt=0%2C5&#x0026;q=Video+copy+detection+using+spatio-temporal+CNN+features&#x0026;btnG=">Google Scholar</ext-link>
</citation>
</ref>
</ref-list>
<sec id="s10">
<title>Glossary</title>
<def-list>
<def-item>
<term id="G1-frsip.2022.984169">
<bold>AUC</bold>
</term>
<def>
<p>Area under the curve</p>
</def>
</def-item>
<def-item>
<term id="G2-frsip.2022.984169">
<bold>BoVW</bold>
</term>
<def>
<p>Bag of Visual Words</p>
</def>
</def-item>
<def-item>
<term id="G3-frsip.2022.984169">
<bold>BRIEF</bold>
</term>
<def>
<p>Binary Robust Independent Elementary Features</p>
</def>
</def-item>
<def-item>
<term id="G4-frsip.2022.984169">
<bold>CCV</bold>
</term>
<def>
<p>Columbia Consumer Video</p>
</def>
</def-item>
<def-item>
<term id="G5-frsip.2022.984169">
<bold>CDVA</bold>
</term>
<def>
<p>Compact Descriptors for Video Analysis</p>
</def>
</def-item>
<def-item>
<term id="G6-frsip.2022.984169">
<bold>CDVS</bold>
</term>
<def>
<p>Compact Descriptors for Visual Search</p>
</def>
</def-item>
<def-item>
<term id="G7-frsip.2022.984169">
<bold>CEDH</bold>
</term>
<def>
<p>Classification-Enhancement Deep Hashing</p>
</def>
</def-item>
<def-item>
<term id="G8-frsip.2022.984169">
<bold>CGO</bold>
</term>
<def>
<p>Centroids of Gradient Orientations</p>
</def>
</def-item>
<def-item>
<term id="G9-frsip.2022.984169">
<bold>CNN</bold>
</term>
<def>
<p>Convolutional Neural Network</p>
</def>
</def-item>
<def-item>
<term id="G10-frsip.2022.984169">
<bold>CPU</bold>
</term>
<def>
<p>Computing Processing Unit</p>
</def>
</def-item>
<def-item>
<term id="G11-frsip.2022.984169">
<bold>CRBM</bold>
</term>
<def>
<p>Conditional Restricted Boltzmann Machine</p>
</def>
</def-item>
<def-item>
<term id="G12-frsip.2022.984169">
<bold>CS-LBP</bold>
</term>
<def>
<p>Center-symmetric Local Binary Patterns</p>
</def>
</def-item>
<def-item>
<term id="G13-frsip.2022.984169">
<bold>DCT</bold>
</term>
<def>
<p>Discrete Cosine Transform</p>
</def>
</def-item>
<def-item>
<term id="G14-frsip.2022.984169">
<bold>DeepH</bold>
</term>
<def>
<p>Deep Hashing</p>
</def>
</def-item>
<def-item>
<term id="G15-frsip.2022.984169">
<bold>DML</bold>
</term>
<def>
<p>Deep Metric Learning</p>
</def>
</def-item>
<def-item>
<term id="G16-frsip.2022.984169">
<bold>DOP</bold>
</term>
<def>
<p>Double Optimal Projection</p>
</def>
</def-item>
<def-item>
<term id="G17-frsip.2022.984169">
<bold>DRF</bold>
</term>
<def>
<p>Deep Representation Fingerprint</p>
</def>
</def-item>
<def-item>
<term id="G18-frsip.2022.984169">
<bold>DWT</bold>
</term>
<def>
<p>Discrete Wavelet Transform</p>
</def>
</def-item>
<def-item>
<term id="G19-frsip.2022.984169">
<bold>FAST</bold>
</term>
<def>
<p>Features from Accelerated Segment Test</p>
</def>
</def-item>
<def-item>
<term id="G20-frsip.2022.984169">
<inline-formula id="inf23">
<mml:math id="m25">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</term>
<def>
<p>
<inline-formula id="inf24">
<mml:math id="m26">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> score</p>
</def>
</def-item>
<def-item>
<term id="G21-frsip.2022.984169">
<bold>FCVID</bold>
</term>
<def>
<p>Fudan-Columbia Video Dataset</p>
</def>
</def-item>
<def-item>
<term id="G22-frsip.2022.984169">
<bold>
<italic>FPR</italic>
</bold>
</term>
<def>
<p>False Positive Rate</p>
</def>
</def-item>
<def-item>
<term id="G23-frsip.2022.984169">
<bold>GPU</bold>
</term>
<def>
<p>Graphical Processing Unit</p>
</def>
</def-item>
<def-item>
<term id="G24-frsip.2022.984169">
<bold>HetConv-MK</bold>
</term>
<def>
<p>heterogeneous convolutional multi-kernel</p>
</def>
</def-item>
<def-item>
<term id="G25-frsip.2022.984169">
<bold>HMDB</bold>
</term>
<def>
<p>Human Metabolome Database</p>
</def>
</def-item>
<def-item>
<term id="G26-frsip.2022.984169">
<bold>HOG</bold>
</term>
<def>
<p>Histogram of Oriented Gradient</p>
</def>
</def-item>
<def-item>
<term id="G27-frsip.2022.984169">
<bold>LCS</bold>
</term>
<def>
<p>Longest Common Subsequence</p>
</def>
</def-item>
<def-item>
<term id="G28-frsip.2022.984169">
<bold>LRF</bold>
</term>
<def>
<p>Low-level Representation Fingerprint</p>
</def>
</def-item>
<def-item>
<term id="G29-frsip.2022.984169">
<bold>LSH</bold>
</term>
<def>
<p>Locality Sensitive Hashing</p>
</def>
</def-item>
<def-item>
<term id="G30-frsip.2022.984169">
<bold>LSTM</bold>
</term>
<def>
<p>Long Short-Term Memory</p>
</def>
</def-item>
<def-item>
<term id="G31-frsip.2022.984169">
<bold>mAP</bold>
</term>
<def>
<p>mean Average Precision</p>
</def>
</def-item>
<def-item>
<term id="G32-frsip.2022.984169">
<bold>ML</bold>
</term>
<def>
<p>Machine Learning</p>
</def>
</def-item>
<def-item>
<term id="G33-frsip.2022.984169">
<bold>MLP</bold>
</term>
<def>
<p>Multi-Layer Perceptron</p>
</def>
</def-item>
<def-item>
<term id="G34-frsip.2022.984169">
<bold>NDCR</bold>
</term>
<def>
<p>Normalized Detection Cost Rate</p>
</def>
</def-item>
<def-item>
<term id="G35-frsip.2022.984169">
<bold>NIST</bold>
</term>
<def>
<p>National Institute of Standards and Technology</p>
</def>
</def-item>
<def-item>
<term id="G36-frsip.2022.984169">
<bold>NIP</bold>
</term>
<def>
<p>Nested Invariance Pooling</p>
</def>
</def-item>
<def-item>
<term id="G37-frsip.2022.984169">
<bold>NN</bold>
</term>
<def>
<p>Neural Network</p>
</def>
</def-item>
<def-item>
<term id="G38-frsip.2022.984169">
<bold>ORB descriptor</bold>
</term>
<def>
<p>Oriented Fast and Rotated Brief descriptor</p>
</def>
</def-item>
<def-item>
<term id="G39-frsip.2022.984169">
<inline-formula id="inf25">
<mml:math id="m27">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</term>
<def>
<p>Probability of false alarm</p>
</def>
</def-item>
<def-item>
<term id="G40-frsip.2022.984169">
<inline-formula id="inf26">
<mml:math id="m28">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</term>
<def>
<p>Probability of missed detection</p>
</def>
</def-item>
<def-item>
<term id="G41-frsip.2022.984169">
<bold>PCA</bold>
</term>
<def>
<p>Principal Component Analysis</p>
</def>
</def-item>
<def-item>
<term id="G42-frsip.2022.984169">
<bold>
<italic>Prec</italic>
</bold>
</term>
<def>
<p>Precision</p>
</def>
</def-item>
<def-item>
<term id="G43-frsip.2022.984169">
<bold>RAQ</bold>
</term>
<def>
<p>Randomized Adaptive Quantizer</p>
</def>
</def-item>
<def-item>
<term id="G44-frsip.2022.984169">
<bold>
<italic>Rec</italic>
</bold>
</term>
<def>
<p>Recall</p>
</def>
</def-item>
<def-item>
<term id="G45-frsip.2022.984169">
<bold>RMI</bold>
</term>
<def>
<p>Relative Mean Intensity</p>
</def>
</def-item>
<def-item>
<term id="G46-frsip.2022.984169">
<bold>RNN</bold>
</term>
<def>
<p>Recursive Neural Network</p>
</def>
</def-item>
<def-item>
<term id="G47-frsip.2022.984169">
<bold>ROC</bold>
</term>
<def>
<p>Receiver Operating Characteristic</p>
</def>
</def-item>
<def-item>
<term id="G48-frsip.2022.984169">
<bold>SCNN</bold>
</term>
<def>
<p>Siamese Convolutional Neural Network</p>
</def>
</def-item>
<def-item>
<term id="G49-frsip.2022.984169">
<bold>SDH</bold>
</term>
<def>
<p>Supervised Discrete Hashing</p>
</def>
</def-item>
<def-item>
<term id="G50-frsip.2022.984169">
<bold>SIFT</bold>
</term>
<def>
<p>Scale-Invariant Feature Transform</p>
</def>
</def-item>
<def-item>
<term id="G51-frsip.2022.984169">
<bold>SSCA</bold>
</term>
<def>
<p>Sub-Band Coefficient Amplitudes</p>
</def>
</def-item>
<def-item>
<term id="G52-frsip.2022.984169">
<bold>SURF</bold>
</term>
<def>
<p>Speeded Up Robust Features</p>
</def>
</def-item>
<def-item>
<term id="G53-frsip.2022.984169">
<bold>TF-IDF</bold>
</term>
<def>
<p>term frequency&#x2013;inverse document frequency</p>
</def>
</def-item>
<def-item>
<term id="G54-frsip.2022.984169">
<bold>TLS</bold>
</term>
<def>
<p>Transport Layer Security</p>
</def>
</def-item>
<def-item>
<term id="G55-frsip.2022.984169">
<bold>TRECVID</bold>
</term>
<def>
<p>TREC Video Retrieval Evaluation</p>
</def>
</def-item>
<def-item>
<term id="G56-frsip.2022.984169">
<bold>VCDB</bold>
</term>
<def>
<p>Large-Scale Video Copy Detection Database</p>
</def>
</def-item>
<def-item>
<term id="G57-frsip.2022.984169">
<bold>VWII</bold>
</term>
<def>
<p>Visual Word Inverted Index</p>
</def>
</def-item>
<def-item>
<term id="G58-frsip.2022.984169">
<bold>WPA-2</bold>
</term>
<def>
<p>Wi-Fi Protected Access 2.</p>
</def>
</def-item>
</def-list>
</sec>
</back>
</article>