<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="other" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">768516</article-id>
<article-id pub-id-type="doi">10.3389/frai.2021.768516</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Hypothesis and Theory</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Visual Features and Their Own Optical Flow</article-title>
<alt-title alt-title-type="left-running-head">Betti et&#x20;al.</alt-title>
<alt-title alt-title-type="right-running-head">Visual Features and Their Flow</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Betti</surname>
<given-names>Alessandro</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1515441/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Boccignone</surname>
<given-names>Giuseppe</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/66867/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Faggi</surname>
<given-names>Lapo</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1462353/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Gori</surname>
<given-names>Marco</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff4">
<sup>4</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1319638/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Melacci</surname>
<given-names>Stefano</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1110004/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<label>
<sup>1</sup>
</label>Department of Information Engineering and Mathematics, Universit&#xe0; degli Studi di Siena, <addr-line>Siena</addr-line>, <country>Italy</country>
</aff>
<aff id="aff2">
<label>
<sup>2</sup>
</label>PHuSe Lab, Department of Computer Science, Universit&#xe0; degli Studi di Milano, <addr-line>Milan</addr-line>, <country>Italy</country>
</aff>
<aff id="aff3">
<label>
<sup>3</sup>
</label>Department of Information Engineering, Universit&#xe0; degli Studi di Firenze, <addr-line>Firenze</addr-line>, <country>Italy</country>
</aff>
<aff id="aff4">
<label>
<sup>4</sup>
</label>Universit&#xe8; C&#xf4;te D&#x2019;Azur, Inria, CNRS, I3S, Maasai, <addr-line>Sophia-Antipolis</addr-line>, <country>France</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/587453/overview">Andrea Tacchetti</ext-link>, DeepMind Technologies Limited, United&#x20;Kingdom</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/814426/overview">Raffaello Camoriano</ext-link>, Italian Institute of Technology (IIT), Italy</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1479855/overview">Chiyuan Zhang</ext-link>, Google, United&#x20;States</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Lapo Faggi, <email>lapo.faggi@unifi.it</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Artificial Intelligence</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>01</day>
<month>12</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>4</volume>
<elocation-id>768516</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>08</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>10</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Betti, Boccignone, Faggi, Gori and Melacci.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Betti, Boccignone, Faggi, Gori and Melacci</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>Symmetries, invariances and conservation equations have always been an invaluable guide in Science to model natural phenomena through simple yet effective relations. For instance, in computer vision, translation equivariance is typically a built-in property of neural architectures that are used to solve visual tasks; networks with computational layers implementing such a property are known as Convolutional Neural Networks (CNNs). This kind of mathematical symmetry, as well as many others that have been recently studied, are typically generated by some underlying group of transformations (translations in the case of CNNs, rotations, etc.) and are particularly suitable to process highly structured data such as molecules or chemical compounds which are known to possess those specific symmetries. When dealing with video streams, common built-in equivariances are able to handle only a small fraction of the broad spectrum of transformations encoded in the visual stimulus and, therefore, the corresponding neural architectures have to resort to a huge amount of supervision in order to achieve good generalization capabilities. In the paper we formulate a theory on the development of visual features that is based on the idea that movement itself provides trajectories on which to impose consistency. We introduce the principle of Material Point Invariance which states that each visual feature is invariant with respect to the associated optical flow, so that features and corresponding velocities are an indissoluble pair. Then, we discuss the interaction of features and velocities and show that certain motion invariance traits could be regarded as a generalization of the classical concept of affordance. These analyses of feature-velocity interactions and their invariance properties leads to a <italic>visual field theory</italic> which expresses the dynamical constraints of motion coherence and might lead to discover the joint evolution of the visual features along with the associated optical&#x20;flows.</p>
</abstract>
<kwd-group>
<kwd>affordance</kwd>
<kwd>convolutional neural networks</kwd>
<kwd>feature flow</kwd>
<kwd>motion invariance</kwd>
<kwd>optical flow</kwd>
<kwd>transport equation</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Deep learning has revolutionized computer vision and visual perception. Amongst others, the great representational power of convolutional neural networks and the elegance and efficiency of Backpropagation have played a crucial role (<xref ref-type="bibr" rid="B30">Krizhevsky et&#x20;al., 2012</xref>). By and large, there is a strong scientific recognition of their capabilities, which is very well deserved. However, an important but often overlooked aspect is that natural images are swamped by nuisance factors such as lighting, viewpoint, part deformation and background. This makes the overall recognition problem much more difficult (<xref ref-type="bibr" rid="B31">Lee and Soatto, 2011</xref>; <xref ref-type="bibr" rid="B2">Anselmi et&#x20;al., 2016</xref>). Typical CNNs architectures, not structurally modelling these possible variations, require a large amount of data with high variability to gain satisfying generalization skills. Some recent works have addressed this aspect focusing on the construction of invariant (<xref ref-type="bibr" rid="B20">Gens and Domingos, 2014</xref>; <xref ref-type="bibr" rid="B2">Anselmi et&#x20;al., 2016</xref>) or equivariant (<xref ref-type="bibr" rid="B13">Cohen and Welling, 2016</xref>) features with respect to a priori specified symmetry groups of transformations. We argue that, when relying on massively supervised learning, we have been working on a problem that is&#x2014;from a computational point of view&#x2014;remarkably different and likely more difficult with respect to the one offered by Nature, where motion is in fact in charge for generating visual information. Motion is what offers us an object in all its poses. Classic translation, scale, and rotation invariances can clearly be obtained by appropriate movements of a given object (<xref ref-type="bibr" rid="B8">Betti et&#x20;al., 2020</xref>). However, the experimentation of visual interaction due to motion goes well beyond the need for these invariances and it includes the object deformation, as well as its obstruction. Could not be the case that motion is in fact nearly all we need for learning to see? Current deep learning approaches based on supervised images mostly neglect the crucial role of temporal coherence, ending up into problems where the extraction of visual concepts can only be based on spatial regularities. Temporal coherence plays a fundamental role in extracting meaningful visual features (<xref ref-type="bibr" rid="B34">Mobahi et&#x20;al., 2009</xref>; <xref ref-type="bibr" rid="B47">Zou et&#x20;al., 2011</xref>; <xref ref-type="bibr" rid="B43">Wang and Gupta, 2015</xref>; <xref ref-type="bibr" rid="B35">Pan et&#x20;al., 2016</xref>; <xref ref-type="bibr" rid="B38">Redondo-Cabrera and Lopez-Sastre, 2019</xref>) and, more specifically, when dealing with video-based tasks, such as video compression (<xref ref-type="bibr" rid="B9">Bhaskaran and Konstantinides, 1997</xref>). Some of these video-oriented works are specifically focused on disentangling content features (constant within the selected video clip) from pose and motion features (that codify information varying over time) (<xref ref-type="bibr" rid="B14">Denton and Birodkar, 2017</xref>; <xref ref-type="bibr" rid="B42">Villegas et&#x20;al., 2017</xref>; <xref ref-type="bibr" rid="B27">Hsieh et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B39">Tulyakov et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B44">Wang et&#x20;al., 2020</xref>). The problem of learning high-level features consistent with the way objects move was also faced in <xref ref-type="bibr" rid="B36">Pathak et&#x20;al. (2017)</xref> in the context of unsupervised object foreground versus background segmentation.</p>
<p>In this vein, we claim that feature learning arises mostly from motion invariance principles that turn out to be fundamental for detecting the object identity as well as characterizing interactions between features themselves. To understand that, let us start considering a moving object in a given visual scene. The object can be thought of as made up of different material points, each one with its own identity that does not change during the object motion. Consequently, the identity of the corresponding pixels has also to remain constant along their apparent motion on the retina. We will express this idea in terms of feature fields (i.e. functions of the given pixel and the specific time instant) that are invariant along the trajectories defined by their <italic>conjugate</italic> velocity fields, extending, in turn, the classical brightness invariance principle for the optical flow estimation (<xref ref-type="bibr" rid="B26">Horn and Schunck, 1981</xref>). Visual features and the corresponding optical flow fields make up an indissoluble pair linked by the motion invariance condition that drives the entire learning process. Each change in the visual features affects the associated velocity fields and vice versa. From a biological standpoint, recent studies have suggested that the ventral and dorsal pathways may not be as independent as originally thought (<xref ref-type="bibr" rid="B33">Milner, 2017</xref>). Following this insight, we endorse the joint discovery of visual features and the related optical flows, pairing their learning through a motion invariance constraint. Motion information does not only confer object identity, but also its affordance. As defined by Gibson in his seminal work (<xref ref-type="bibr" rid="B21">Gibson, 1966</xref>, <xref ref-type="bibr" rid="B22">1979</xref>), affordances essentially characterize the relation between an agent and its environment and, given a certain object, correspond to the possible actions that can be executed upon it. A chair, for example, offers the affordance of seating a human being, but it can have other potential uses. In other words, the way an agent interacts with a particular object is what defines its affordance, and this is strictly related to their relative motion. Extending and generalizing this classic notion of affordance to visual features, we will define the notion of affordance field, describing the interaction between pairs of visual features. Essentially, these interactions are defined by the relative motion of the features themselves so that the corresponding affordance fields will be required to be invariant with respect to such relative motion. Hence, in the rest of the paper, we will use the term affordance in this broader&#x20;sense.</p>
<p>This paper is organized as follows. <xref ref-type="sec" rid="s2">Section 2</xref> is focused on classical methods for optical flow estimation. In this case, the brightness is given by the input video and the goal is to determine the corresponding optical flow through the <italic>brightness invariance condition</italic>. Typical regularization issues, necessary to specify a unique velocity field, are also addressed. <xref ref-type="sec" rid="s3">Section 3</xref> is devoted to extend the previous approach to visual features. This time, features are not given in advance but are jointly learnt together with the corresponding velocity fields. Features and velocities are tied by the motion invariance principle. After that, the classical notion of affordance by <xref ref-type="bibr" rid="B21">Gibson (1966)</xref>, <xref ref-type="bibr" rid="B22">Gibson (1979)</xref> is introduced and extended to the case of visual features. Even in this case motion invariance (with respect to relative velocities) plays a pivotal role in defining the corresponding affordance fields. At the end of <xref ref-type="sec" rid="s3">Section 3</xref> regularization issues are also considered and a formulation of learning of the visual fields is sketched out, together with the description of a possible practical implementation of the proposed ideas through deep neural networks. Finally, <xref ref-type="sec" rid="s4">Section 4</xref> draws some conclusions.</p>
</sec>
<sec id="s2">
<title>2 Optical Flow</title>
<p>The fundamental problem of optical flow estimation has been receiving a lot of attention in computer vision. In spite of the growing evidence of performance improvement (<xref ref-type="bibr" rid="B18">Fischer et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B28">Ilg et&#x20;al., 2017</xref>; <xref ref-type="bibr" rid="B46">Zhai et&#x20;al., 2021</xref>), an in-depth analysis on the precise definition of the velocity to be attributed to each pixel is still questionable (<xref ref-type="bibr" rid="B41">Verri, 1987</xref>; <xref ref-type="bibr" rid="B40">Verri and Poggio, 1989</xref>; <xref ref-type="bibr" rid="B4">Aubert and Kornprobst, 2006</xref>). While a simple visual inspection of some recent top level optical flow estimation systems clearly indicates remarkable performance, the definition of &#x201c;optical flow&#x201d; is difficult and quite slippery. Basically, we need to associate each pixel with its velocity. When considering the temporal quantization, any sound definition of such a velocity does require to know where any pixel moves on the next frame. How can we trace any single pixel? Clearly, any such pixel corresponds to a &#x201c;point&#x201d; of an &#x201c;object&#x201d; in the visual environment and the fundamental requirement to fulfill is that of tracking the point of the object.</p>
<p>An enlightening answer to this question was given by Horn and Shunck in a seminal paper published at the beginning of the eighties (<xref ref-type="bibr" rid="B26">Horn and Schunck, 1981</xref>). The basic assumption is the local-in-time constancy of <italic>the brightness</italic> intensity function <italic>I</italic>: &#x3a9; &#xd7; [0, <italic>T</italic>] &#x2192; [0, 1] where &#x3a9; is a subset of <inline-formula id="inf1">
<mml:math id="m1">
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. In other words, &#x2200;<italic>t</italic>
<sub>0</sub> &#x3e; 0 there exists a <italic>&#x3c4;</italic> &#x3e; 0 such that for every <italic>x</italic>
<sub>0</sub> &#x2208; &#x3a9; we can define the trajectory <inline-formula id="inf2">
<mml:math id="m2">
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3c4;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>&#x2192;</mml:mo>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:math>
</inline-formula> that maps <inline-formula id="inf3">
<mml:math id="m3">
<mml:mi>t</mml:mi>
<mml:mo>&#x21a6;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:math>
</inline-formula> for which<disp-formula id="e1">
<mml:math id="m4">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right"/>
<mml:mtd columnalign="left">
<mml:mi>I</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>I</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mspace width="2em"/>
<mml:mo>&#x2200;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3c4;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>;</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="right"/>
<mml:mtd columnalign="left">
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(1)</label>
</disp-formula>
</p>
<p>Assuming smoothness we can approximate this condition to the first order taking into account only infinitesimal temporal distances and obtain at <italic>t</italic>&#x20;&#x3d; <italic>t</italic>
<sub>0</sub>:<disp-formula id="e2">
<mml:math id="m5">
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>I</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:mi>I</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
</mml:math>
<label>(2)</label>
</disp-formula>where <inline-formula id="inf4">
<mml:math id="m6">
<mml:mi>u</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2254;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> is the optical flow and &#x22c5; is the standard scalar product in&#x20;<inline-formula id="inf5">
<mml:math id="m7">
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>.</p>
<p>Assumption <xref ref-type="disp-formula" rid="e1">Eq. 1</xref> is reasonable when there are no occlusions and changes of the light source are assumed to be &#x201c;small.&#x201d; Of course, in real world applications of computer vision these scenarios are not always met. On the other hand, it is clear that the optical flow <italic>u</italic> could be derived from an invariance condition of the type <xref ref-type="disp-formula" rid="e1">Eq. 1</xref> applied to different and possibly more &#x201c;stable&#x201d; visual features rather than to the brightness itself. As shown in <xref ref-type="fig" rid="F1">Figure&#x20;1</xref>, this would give a different optical flow with respect to the one defined through the brightness invariance condition (<xref ref-type="fig" rid="F1">Figure&#x20;1C</xref>). For example, a feature responding to the entire barber&#x2019;s pole, that is standing still, would have an associated optical flow that is null everywhere (<xref ref-type="fig" rid="F1">Figure&#x20;1D</xref>). Still, we have to keep in mind that in both cases the resulting optical flow is different from the 2-D motion field (defined as the projection on the image plane of the 3-D velocity of the visual scene, see e.g. <xref ref-type="bibr" rid="B4">Aubert and Kornprobst (2006)</xref>) shown in <xref ref-type="fig" rid="F1">Figure&#x20;1B</xref>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Barber&#x2019;s pole example. <bold>(A)</bold> The 3-D object spinning counterclockwise. <bold>(B)</bold> The 2-D projection of the pole and the projected velocity on the retina &#x3a9;. <bold>(C)</bold> The brightness of the image and its optical flow pointing downwards. <bold>(D)</bold> A feature map that responds to the object and its conjugate (zero) optical flow.</p>
</caption>
<graphic xlink:href="frai-04-768516-g001.tif"/>
</fig>
<p>This indeed is the main motivation to couple the problem of feature extraction together with motion invariance constraints and the derivation of robust and meaningful optical flows associated to those visual features.</p>
<sec id="s2-1">
<title>2.1 Regularization of the Optical Flow</title>
<p>Before going on to lay out the theory for the extraction of motion invariant visual features, we need to recall some facts about the optical flow condition <xref ref-type="disp-formula" rid="e2">Eq. 2</xref>. Given a video stream described by its brightness intensity <italic>I</italic> as it is defined as in <xref ref-type="sec" rid="s1">Section 1</xref>, the problem of finding for each pixel of the frame spatial support at each time instant the velocity field <italic>u</italic>(<italic>x</italic>, <italic>t</italic>) satisfying<disp-formula id="e3">
<mml:math id="m8">
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>I</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:mi>I</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mspace width="1em"/>
<mml:mo>&#x2200;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(3)</label>
</disp-formula>is clearly ill posed since a scalar equation is not sufficient to properly constrain the two components of <italic>u</italic>. Locally, we can unequivocally determine only the component of <italic>u</italic> along &#x2207;<italic>I</italic>.</p>
<p>Although many methods have been proposed to overcome this issue (see for example the work of <xref ref-type="bibr" rid="B4">Aubert and Kornprobst (2006)</xref>), that is usually referred to as the <italic>aperture problem</italic>, here we are interested in the class of approaches that aims at regularizing the optical flow:<disp-formula id="e4">
<mml:math id="m9">
<mml:munder>
<mml:mrow>
<mml:mi>inf</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(4)</label>
</disp-formula>where <italic>A</italic>
<sub>
<italic>I</italic>
</sub> is a functional that enforces constraint <xref ref-type="disp-formula" rid="e3">Eq. 3</xref>, hence a standard choice is<disp-formula id="e5">
<mml:math id="m10">
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2254;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo>&#x222b;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>I</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:mi>I</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mspace width="0.17em"/>
<mml:mi>d</mml:mi>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
<p>
<italic>S</italic> imposes smoothness and <italic>H</italic> may be used to condition the extraction of the flow over spatially homogeneous regions. Depending on the regularity assumptions on <italic>I</italic> and the properties that we want to impose on the solution of this regularized problem (namely if we want to admit solutions that preserve discontinuities or not) the exact form of <italic>S</italic> and the form and presence of <italic>H</italic>
<sub>
<italic>I</italic>
</sub> may vary. For example, in the original approach proposed in <xref ref-type="bibr" rid="B26">Horn and Schunck (1981)</xref> we find:<disp-formula id="e6">
<mml:math id="m11">
<mml:mtext>Horn&#x2013;Schunck&#x2009;regularization</mml:mtext>
<mml:mo>:</mml:mo>
<mml:mi>S</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo>&#x222b;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi>d</mml:mi>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mspace width="1em"/>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2261;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
</mml:math>
<label>(6)</label>
</disp-formula>
</p>
<p>Note that since we are interested in extracting the optical flow for any frame of a video, namely the field <italic>u</italic>(<italic>x</italic>, <italic>t</italic>), it is useful for any function <inline-formula id="inf6">
<mml:math id="m12">
<mml:mi>f</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> to define <inline-formula id="inf7">
<mml:math id="m13">
<mml:msup>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>:</mml:mo>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> as <italic>f</italic> <sup>
<italic>t</italic>
</sup>(<italic>x</italic>)&#x2254;<italic>f</italic>(<italic>x</italic>, <italic>t</italic>). With this notation, when the infimum in <xref ref-type="disp-formula" rid="e4">Eq. 4</xref> is attained, we can pose.<disp-formula id="e7">
<mml:math id="m14">
<mml:mi>u</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mi>arg</mml:mi>
<mml:mspace width="0.17em"/>
<mml:mi>min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(7)</label>
</disp-formula>
</p>
<p>Notice that the brightness might not necessarily be the ideal signal to track. Since the brightness can be expressed as a weighed average of the red <italic>R</italic>, green <italic>G</italic>, and blue <italic>B</italic> components, one could think of tracking each single color component of the video signal by using the same invariance principle stated by <xref ref-type="disp-formula" rid="e3">Eq. 3</xref>. It could in fact be the case that one or more of the components <italic>R</italic>, <italic>G</italic>, <italic>B</italic> are more invariant in the sense of <xref ref-type="disp-formula" rid="e3">Eq. 3</xref> during the motion of the corresponding material point, see <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>. In that case, in general, each color can be associated with corresponding optical flows <italic>v</italic>
<sub>
<italic>R</italic>
</sub>, <italic>v</italic>
<sub>
<italic>G</italic>
</sub>, <italic>v</italic>
<sub>
<italic>B</italic>
</sub> that might differ. In doing so, instead of tracking the brightness, one can track the single colors. Instead, under the assumption that each color component has the same optical flow <italic>v</italic>&#x20;&#x3d; <italic>v</italic>
<sub>
<italic>R</italic>
</sub> &#x3d; <italic>v</italic>
<sub>
<italic>G</italic>
</sub> &#x3d; <italic>v</italic>
<sub>
<italic>B</italic>
</sub> we have.<disp-formula id="e8">
<mml:math id="m15">
<mml:mfrac>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mtable class="matrix">
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mi>R</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mi>G</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mi>B</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mtable class="matrix">
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mi>R</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mi>G</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mi>B</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
</mml:math>
<label>(8)</label>
</disp-formula>where <italic>v</italic>&#x20;&#x22c5;&#x2207;(<italic>R</italic>,<italic>G</italic>,<italic>B</italic>)&#x2032;&#x2254;(<italic>v</italic>&#x20;&#x22c5;&#x2207;<italic>R</italic>,<italic>v</italic>&#x20;&#x22c5;&#x2207;<italic>G</italic>,<italic>v</italic>&#x20;&#x22c5;&#x2207;<italic>B</italic>)&#x2032;. It is worth mentioning that the simultaneous tracking of different channels might contribute to a better positioning of the problem since, in general, rank&#x2207;(<italic>R</italic>, <italic>G</italic>, <italic>B</italic>)&#x2032; &#x3d; 2 and the system <xref ref-type="disp-formula" rid="e8">Eq. 8</xref> admits an unique solution.<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref> One can think of the color components as features that, unlike classical convolutional spatial features, are temporal features.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Tracking of different color components in a synthetic example. In this case, each color component is associated to a specific velocity&#x20;field.</p>
</caption>
<graphic xlink:href="frai-04-768516-g002.tif"/>
</fig>
<p>Before proceeding further, let us underline that some other optical flow methods try to directly solve the brightness invariance condition <xref ref-type="disp-formula" rid="e1">Eq. 1</xref> without differentiating it. This is the case, for example, of the Gunnar Farneb&#xe4;ck&#x2019;s algorithm (<xref ref-type="bibr" rid="B16">Farneb&#xe4;ck, 2003</xref>): the basic idea here is to approximate the brightness of the input images through polynomial expansions with variable coefficients, and the brightness invariance condition <xref ref-type="disp-formula" rid="e1">Eq. 1</xref> is then solved under this assumption. <xref ref-type="fig" rid="F3">Figure&#x20;3</xref> shows the optical flows extracted by the Horn-Schunck and Farneb&#xe4;cks methods in the barber&#x2019;s pole&#x20;case.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Barber&#x2019;s pole optical flow (sub-sampled). <bold>(A)</bold> Horn&#x2013;Schunck method (<xref ref-type="bibr" rid="B26">Horn and Schunck, 1981</xref>) with smoothing factor coefficient &#x3d; 1&#x20;<bold>(B)</bold> Gunnar Farneb&#xe4;ck&#x2019;s algorithm (<xref ref-type="bibr" rid="B16">Farneb&#xe4;ck, 2003</xref>) in the quadratic expansion&#x20;case.</p>
</caption>
<graphic xlink:href="frai-04-768516-g003.tif"/>
</fig>
<p>In the next section, we will discuss how to use a very similar approach, based on the consistency of features along apparent motion trajectories on the frame spatial support, to derive visual features <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> along with the corresponding <italic>conjugate</italic> optical flows <inline-formula id="inf8">
<mml:math id="m16">
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>. We anticipate that this motion consistency condition will also play a prominent role in defining affordance features, as described in <xref ref-type="sec" rid="s3-2">Section&#x20;3.2</xref>.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Feature Extraction and Conjugate Velocities</title>
<p>As we have already anticipated in the previous sections, the optical flow extracted by imposing an invariance condition like the one in <xref ref-type="disp-formula" rid="e3">Eq. 3</xref> strongly depends on the features on which we are imposing that invariance; hence it should be not surprising that different sets of features could give rise to different optical flows. This can be easily understood by considering the barber&#x2019;s pole example in <xref ref-type="fig" rid="F1">Figure&#x20;1</xref>. The related classical optical flow is depicted in <xref ref-type="fig" rid="F1">Figure&#x20;1C</xref>, see also <xref ref-type="fig" rid="F3">Figure&#x20;3</xref>, and it is different from the projection of the 3-D velocities on the frame spatial support <xref ref-type="fig" rid="F1">Figure&#x20;1B</xref> (the resulting optical flow is an optical illusion indeed). Let us now assume the existence of a visual feature <italic>&#x3c6;</italic>
<sub>
<italic>r</italic>
</sub> characterizing the red stripes, that is <italic>&#x3c6;</italic>
<sub>
<italic>r</italic>
</sub>(<italic>x</italic>, <italic>t</italic>) &#x3d; 1 iff (<italic>x</italic>, <italic>t</italic>) is inside a stripe. As the barber&#x2019;s pole rotates, the conjugate velocity <inline-formula id="inf9">
<mml:math id="m17">
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> is, in this simplified case, the same that one would have obtained from the brightness invariance condition <xref ref-type="disp-formula" rid="e3">Eq. 3</xref>. An additional level of abstraction can be gained when looking at the whole object. Again, we are assuming the existence of an higher level visual feature <italic>&#x3c6;</italic>
<sub>
<italic>object</italic>
</sub> characterizing it. Then, considering that the barber&#x2019;s pole is standing still, the velocity field associated to that feature is everywhere null, as shown in <xref ref-type="fig" rid="F1">Figure&#x20;1D</xref>.</p>
<p>This example clearly explains how different velocity fields can be associated to different visual features, but we still have to go one step further. Until now, mimicking the case of the classical optical flow estimation given the corresponding input brightness, we have described the construction of velocity fields starting from visual features whose existence was a priori assumed. Recent studies have suggested that the ventral and dorsal pathways may not be as independent as originally thought. Evidence for contributions from ventral stream systems to the dorsal stream indicates a crucial role in mediating complex and flexible visuomotor skills. Meanwhile, complementary evidence points to a role for posterior dorsal-stream visual analysis in certain aspects of 3-D perceptual function in the ventral stream (but see <xref ref-type="bibr" rid="B33">Milner, 2017</xref> for a review). As pointed out by <xref ref-type="bibr" rid="B33">Milner (2017)</xref> potential cross-stream interactions might take three forms:<list list-type="simple">
<list-item>
<p>1) Independent processing: computations along the separate pathways proceed independently and in parallel and reintegrate at some final stage of processing within a shared target brain region; this might be achieved via common projections to the lateral prefrontal cortex or superior temporal sulcus (STS);</p>
</list-item>
<list-item>
<p>2) Feedback: processing along the two pathways is modulated by the existence of feedback loops which transmit information from downstream brain regions, including information processed along the complementary stream; feedback is likely to involve projections to early retinotopic cortical&#x20;areas.</p>
</list-item>
<list-item>
<p>3) Continuous cross-talk: information is transferred at multiple stages and locations along the two pathways.</p>
</list-item>
</list>
</p>
<p>The three forms need not be mutually exclusive and a resolution of the problems of visual integration might involve a combination of such possibilities (<xref ref-type="bibr" rid="B33">Milner, 2017</xref>).</p>
<p>Yet, from a learning standpoint, the cross-talk mode is intriguing for setting some minimal conditions for an agent (either biological or artificial) in order to develop visual capabilities. Following this biological insight, we endorse the indissoluble conjunction of features and velocities and, consequently, their joint discovery based on the motion invariance condition.<disp-formula id="e9">
<mml:math id="m18">
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>&#x2200;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mo>&#x2200;</mml:mo>
<mml:mspace width="0.17em"/>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo>,</mml:mo>
</mml:math>
<label>(9)</label>
</disp-formula>where we are considering <italic>d</italic> different visual features. Locally, this equation means that, at each pixel <italic>x</italic> of the frame spatial support and specific time instant <italic>t</italic>, features <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> are preserved along the trajectories defined by the corresponding velocity fields <inline-formula id="inf10">
<mml:math id="m19">
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> and starting at (<italic>x</italic>, <italic>t</italic>). An object clearly does not change its identity while it is moving. Consequently, the identity of the corresponding pixels on the frame spatial support has to remain invariant along the apparent motion defined by the associated optical flows. Thinking of the brightness as the simplest visual feature based on single pixels, <xref ref-type="disp-formula" rid="e9">Eq. 9</xref> correctly reduces to the brightness invariance condition <xref ref-type="disp-formula" rid="e3">Eq. 3</xref>. Notice that if there is no optical flow for a given pixel <inline-formula id="inf11">
<mml:math id="m20">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>, that is, if <inline-formula id="inf12">
<mml:math id="m21">
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:math>
</inline-formula> for all <italic>t</italic>&#x20;&#x2208; [0, <italic>T</italic>], then <inline-formula id="inf13">
<mml:math id="m22">
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:math>
</inline-formula>. This means that the absence of the optical flow in <inline-formula id="inf14">
<mml:math id="m23">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> results into <inline-formula id="inf15">
<mml:math id="m24">
<mml:mi>&#x3c6;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> for all <italic>t</italic>&#x20;&#x2208; [0, <italic>T</italic>], which is the obvious consistency condition that one expects in this case. Likewise, a constant field <italic>&#x3c6;</italic>(<italic>x</italic>, <italic>t</italic>) in a subregion <inline-formula id="inf16">
<mml:math id="m25">
<mml:mi>C</mml:mi>
<mml:mo>&#x2282;</mml:mo>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mspace width="0.17em"/>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula> makes <xref ref-type="disp-formula" rid="e9">Eq. 9</xref> satisfied on <italic>C</italic> independently of <italic>v</italic>
<sub>
<italic>&#x3c6;</italic>
</sub>.</p>
<p>Like for the brightness, in general, the invariance condition <xref ref-type="disp-formula" rid="e9">Eq. 9</xref> generates an ill-posed problem. In particular, when the moving object has a uniform color, we can notice that brightness invariance holds for virtually infinite trajectories. Likewise, any of the features <italic>&#x3c6;</italic> is expected to be spatially smooth and nearly constant in small portions of the frame spatial support, and this restores the ill-posedness of the classical problem of determining the optical flow that has been addressed in the previous section. Unlike brightness invariance, in the case of visual features the ill-posedness of the problem has a double face. Just like in the classic case of estimating the optical flow, <italic>v</italic>
<sub>
<italic>&#x3c6;</italic>
</sub> is not uniquely defined (the aperture problem). On top of that, now the corresponding feature <italic>&#x3c6;</italic> is not uniquely defined, too. We will address regularization issues in <xref ref-type="sec" rid="s3-3">Section 3.3</xref> where, including additional information other than coherence on motion trajectories, we will make the learning process well-posed. Of course, the regularization process will also involve a term similar to the one invoked for the optical flow <italic>v</italic>, see <xref ref-type="disp-formula" rid="e6">Eq. 6</xref>, that will be imposed on <italic>v</italic>
<sub>
<italic>&#x3c6;</italic>
</sub>. Given what we have discussed so far, we can also expect the presence of some regularization term concerning the features themselves and their regularity. Finally, these terms will be complemented with an additional &#x201c;prediction&#x201d; index necessary to avoid trivial features&#x2019; solutions (we postpone its description to <xref ref-type="sec" rid="s3-3">Section&#x20;3.3</xref>).</p>
<p>The basic notion at the core of this section is that <inline-formula id="inf17">
<mml:math id="m26">
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> can be treated as indissoluble pairs bounded by the motion invariance condition that steers the entire learning process. The structure of each <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> affects the associated velocity <inline-formula id="inf18">
<mml:math id="m27">
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> and vice versa&#x2014;it is therefore natural to pair their learning. Leaving aside for the moment regularization issues of <xref ref-type="sec" rid="s3-3">Section 3.3</xref>, learning is based on a functional generalizing to <xref ref-type="disp-formula" rid="e5">Eq. 5</xref>, that is<disp-formula id="e10">
<mml:math id="m28">
<mml:mi>A</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mspace width="0.17em"/>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2254;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mo>&#x222b;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x393;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi>d</mml:mi>
<mml:mi>&#x3bc;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(10)</label>
</disp-formula>where &#x393; &#x3d; &#x3a9; &#xd7; [0, <italic>T</italic>] and <italic>&#x3bc;</italic> is an appropriately weighted Lebesgue measure on <inline-formula id="inf19">
<mml:math id="m29">
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>; its exact form defines the dynamics of the learning process itself. The minimization of such functional (plus the additional regularitation terms) is expected to return the pairs <inline-formula id="inf20">
<mml:math id="m30">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> satisfying the motion consistency condition <xref ref-type="disp-formula" rid="e9">Eq. 9</xref>. Sometimes, in what follows and when the notation is clear from the context, we will drop the subscript <italic>&#x3c6;</italic> of <italic>v</italic>
<sub>
<italic>&#x3c6;</italic>
</sub> so that <inline-formula id="inf21">
<mml:math id="m31">
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> will be denoted as&#x20;<italic>v</italic>
<sub>
<italic>i</italic>
</sub>.</p>
<sec id="s3-1">
<title>3.1 Feature Grouping</title>
<p>As already noticed, when we consider color images, what is done in the case of brightness invariance can be applied to the separated components R,G,B. Interestingly, for a material point of a certain color, given by a mixture of the three components, we can establish the same brightness invariance principle, since those components move with the same velocity. Said in other words, there could be group of different visual features <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>, <italic>i</italic>&#x20;&#x3d; 1, &#x2026;, <italic>m</italic> that share the same velocity (<italic>v</italic>
<sub>
<italic>i</italic>
</sub> &#x3d; <italic>v</italic>) and are consistent with it, that is <italic>&#x2202;</italic>
<sub>
<italic>t</italic>
</sub>
<italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> &#x2b; <italic>v</italic>&#x20;&#x22c5;&#x2207;<italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> &#x3d; 0&#x20;&#x2200;<italic>i</italic>&#x20;&#x3d; 1, &#x2026;, <italic>m</italic>. Thus, we can promptly see that any feature <italic>&#x3c6;</italic> of span(<italic>&#x3c6;</italic>
<sub>1</sub>, <italic>&#x2026;</italic>, <italic>&#x3c6;</italic>
<sub>
<italic>m</italic>
</sub>) is still conjugated with <italic>v</italic>; we can think of span(<italic>&#x3c6;</italic>
<sub>1</sub>, <italic>&#x2026;</italic>, <italic>&#x3c6;</italic>
<sub>
<italic>m</italic>
</sub>) as a functional space conjugated with&#x20;<italic>v</italic>.</p>
<p>Let us now consider the feature group <inline-formula id="inf22">
<mml:math id="m32">
<mml:mi>&#x3d5;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and the corresponding invariance condition.<disp-formula id="e11">
<mml:math id="m33">
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>&#x3d5;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:mi>&#x3d5;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
</mml:math>
<label>(11)</label>
</disp-formula>where <inline-formula id="inf23">
<mml:math id="m34">
<mml:mo>&#x2207;</mml:mo>
<mml:mi>&#x3d5;</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> is the matrix with elements <inline-formula id="inf24">
<mml:math id="m35">
<mml:msub>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x2207;</mml:mo>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> and <inline-formula id="inf25">
<mml:math id="m36">
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:mi>&#x3d5;</mml:mi>
<mml:mo>&#x2254;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mspace width="0.17em"/>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mspace width="0.17em"/>
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. An important observation, very related to the discussion about color tracking of <xref ref-type="disp-formula" rid="e8">Eq. 8</xref>, is the following one. Notice that, if we consider the case in which the only scalar feature we are dealing with is the brightness, then <xref ref-type="disp-formula" rid="e11">Eq. 11</xref> boils down to a single equation with two unknowns (the velocity components). Differently, in the case of the feature group <italic>&#x3d5;</italic>, we have <italic>m</italic> equations and still two unknowns. The dimension <italic>m</italic> of matrix &#x2207;<italic>&#x3d5;</italic> can enforce the increment of its rank, which leads to a better posedness of the problem of estimating the optical flow <italic>v</italic>. Because of the two-dimensional structure of the frame spatial support, which leads to <inline-formula id="inf26">
<mml:math id="m37">
<mml:mi>v</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, and since <inline-formula id="inf27">
<mml:math id="m38">
<mml:mo>&#x2207;</mml:mo>
<mml:mi>&#x3d5;</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, with <italic>m</italic>&#x20;&#x2265; 2, it turns out that feature grouping regularizes the velocity discovery. In order to understand the effect of feature grouping we can in fact simply notice that, under the assumption rank&#x2207;<italic>&#x3d5;</italic> &#x3d; rank(&#x2207;<italic>&#x3d5;</italic>&#x2223; &#x2212; <italic>&#x3d5;</italic>
<sub>
<italic>t</italic>
</sub>), a random choice of the features yields rank&#x2207;<italic>&#x3d5;</italic> &#x3d; 2. As a consequence, by Rouch&#xe9;-Capelli theorem, linear <xref ref-type="disp-formula" rid="e11">Eq. 11</xref> admits a unique solution in <italic>v</italic>. However, this regularization effect of feature grouping does not prevent ill-posedness, since <italic>&#x3d5;</italic> is far from being a random map. On the opposite, it is supposed to extract a uniform value in portions of the frame spatial support that are characterized by the same feature. Hence, rank&#x2207;<italic>&#x3d5;</italic> &#x3d; 1 is still possible whenever the features of the group are somewhat dependent.</p>
<p>Feature groups, that are characterized by their common velocity, can give rise to more structured features belonging to the same group. This can promptly be understood when we go beyond linear spaces and consider for a set of indices <inline-formula id="inf28">
<mml:math id="m39">
<mml:mi mathvariant="script">F</mml:mi>
</mml:math>
</inline-formula>.<disp-formula id="e12">
<mml:math id="m40">
<mml:mfenced open="{" close="">
<mml:mrow>
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right"/>
<mml:mtd columnalign="left">
<mml:mi>&#x3b1;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">F</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="right"/>
<mml:mtd columnalign="left">
<mml:mi>&#x3b7;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(12)</label>
</disp-formula>Evaluating <italic>&#x2202;</italic>
<sub>
<italic>t</italic>
</sub>
<italic>&#x3b7;</italic> &#x2b; <italic>v</italic>&#x20;&#x22c5;&#x2207;<italic>&#x3b7;</italic> we obtain indeed<disp-formula id="e13">
<mml:math id="m41">
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>&#x3b7;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:mi>&#x3b7;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3c3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3c3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">F</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(13)</label>
</disp-formula>and we conclude that if <inline-formula id="inf29">
<mml:math id="m42">
<mml:mo>&#x2200;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">F</mml:mi>
</mml:math>
</inline-formula> we have <italic>&#x2202;</italic>
<sub>
<italic>t</italic>
</sub>
<italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub> &#x2b; <italic>v</italic>&#x20;&#x22c5;&#x2207;<italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub> &#x3d; 0 then also the feature <italic>&#x3b7;</italic> defined by <xref ref-type="disp-formula" rid="e12">Eq. 12</xref> is conjugated with <italic>v</italic>, that is <italic>&#x2202;</italic>
<sub>
<italic>t</italic>
</sub>
<italic>&#x3b7;</italic> &#x2b; <italic>v</italic>&#x20;&#x22c5;&#x2207;<italic>&#x3b7;</italic> &#x3d; 0. However, the vice versa does not hold true. Basically, the inheritance of conjugation with <italic>v</italic> holds in the direction towards more abstract features. Of course, the feedforward-like recursive application of the derivation stated by <xref ref-type="disp-formula" rid="e12">Eq. 12</xref> yields a feature that is still conjugated with&#x20;<italic>v</italic>.</p>
</sec>
<sec id="s3-2">
<title>3.2&#x20;Affordance-Related Features</title>
<p>Any learning process that relies on the motion of a given object can only aspire to discover the identity of that object, along with its characterizing visual features such as pose and shape. The motion invariance process is in fact centered around the object itself and, as such, it does reveal its own features in all possible expositions that are gained during motion. Humans, and likely most animals, also conquer a truly different understanding of visual scenes that goes beyond the conceptualization with single object identities. In the early Sixties, James J.&#x20;Gibson coined the notion of affordance in (<xref ref-type="bibr" rid="B21">Gibson, 1966</xref>), even though a more refined analysis came later in (<xref ref-type="bibr" rid="B22">Gibson, 1979</xref>). In his own words: <italic>&#x201c;The affordances of the environment are what it offers the animal, what it provides or furnishes, either for good or ill. The verb to afford is found in the dictionary, the noun affordance is not. I have made it up. I mean by it something that refers to both the environment and the animal in a way that no existing term does. It implies the complementarity of the animal and the environment.&#x201d;</italic> Considering this animal-centric view, we gain the understanding that affordance can be interpreted as what characterizes the &#x201c;interaction&#x201d; between animals and their surrounding environment. In more general terms, the way an agent interacts with a particular object is what defines its affordance, and this is strictly related to their relative motion. In the last decades, computer scientists have also being working on this general idea, trying to quantitatively implement it in the fields of computer vision and robotics (<xref ref-type="bibr" rid="B3">Ard&#xf3;n et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B25">Hassanin et&#x20;al., 2021</xref>). As far as visual affordance is concerned, that is, extracting affordance information from still images and videos, different cognitive tasks have been considered so far, as for example affordance recognition and affordance segmentation, see (<xref ref-type="bibr" rid="B25">Hassanin et&#x20;al., 2021</xref>) for a recent review.</p>
<p>In the spirit of the previous section, we will consider a more abstract notion of affordance, characterizing the interaction between different visual features along with their corresponding conjugate velocity fields. We will focus our attention on actions that are perceivable from single pictures<xref ref-type="fn" rid="fn2">
<sup>2</sup>
</xref> and on the related local notion of affordance, that will be defined by some function characterizing the interaction between feature <italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub> and feature <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> when considering the pixel <italic>x</italic> at the specific time instant <italic>t</italic>. As we will see, the principle of motion invariance can be extended to naturally define (explicitly or implicitly) this generalized notion of affordance. A natural choice is to consider what we will denote as the <italic>affordance field &#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub> as a function of space and time. To implicitly codify the interaction between features <italic>i</italic> and <italic>j</italic>, <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub>(<italic>x</italic>, <italic>t</italic>) has to be constrained by some relation of the form <inline-formula id="inf30">
<mml:math id="m43">
<mml:mi>g</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:math>
</inline-formula>, where we are considering only first order derivatives of the affordance field and <italic>g</italic> is a scalar function. In the lower order approximation (we also need quadratic terms to build scalars from vectors):<disp-formula id="e14">
<mml:math id="m44">
<mml:mi>g</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>7</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:math>
<label>(14)</label>
</disp-formula>where <italic>a</italic>
<sub>1</sub>, &#x2026;, <italic>a</italic>
<sub>7</sub> are scalars. Considering the case in which the motion field associated to feature <italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub> is everywhere null, the affordance field <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub> will codify a property only related to <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> itself and strictly related to its identity, that has to be invariant with respect to <italic>v</italic>
<sub>
<italic>i</italic>
</sub>. Thus, from this observation, we can infer <italic>a</italic>
<sub>1</sub> &#x3d; <italic>a</italic>
<sub>2</sub> &#x3d; <italic>a</italic>
<sub>3</sub> &#x3d; <italic>a</italic>
<sub>4</sub> &#x3d; 0 and <italic>a</italic>
<sub>5</sub> &#x3d; <italic>a</italic>
<sub>6</sub> so that the above constraint becomes:<disp-formula id="e15">
<mml:math id="m45">
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:math>
<label>(15)</label>
</disp-formula>where <italic>b</italic>
<sub>
<italic>j</italic>
</sub> &#x3d; <italic>a</italic>
<sub>7</sub>/<italic>a</italic>
<sub>5</sub>. Requiring <italic>b</italic>
<sub>
<italic>j</italic>
</sub> &#x3d; &#x2212;1 this constraint assumes a very reasonable physical meaning, that is the motion invariance of the affordance field <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub> in the reference of feature <italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub>. Within this choice, the affordance field is conjugated with the velocity <italic>v</italic>
<sub>
<italic>i</italic>
</sub> &#x2212; <italic>v</italic>
<sub>
<italic>j</italic>
</sub> indeed, which is in fact the relative velocity of feature <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> in the reference of feature <italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub>. Considering points at the border of <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>, this can lead to slightly expand <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub> outside the region defined by <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> itself, as shown in <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>. In the case <italic>v</italic>
<sub>
<italic>i</italic>
</sub> &#x3d; 0, the motion consistency is forced &#x201c;backward&#x201d; along the pixels&#x2019; trajectories defined by &#x2212; <italic>v</italic>
<sub>
<italic>j</italic>
</sub>. In the case <italic>b</italic>
<sub>
<italic>j</italic>
</sub> &#x3d; 1&#x20;<xref ref-type="disp-formula" rid="e15">Eq. 15</xref> becomes symmetric under permutations instead so that <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub> and <italic>&#x3c8;</italic>
<sub>
<italic>ji</italic>
</sub> will be developed exploiting the same constraint. This will likely result in the same affordance feature unless some other factor (let us think for example to different initializations in neural architectures) breaks that symmetry. From the classic affordance perspective this is not a desirable property as we can easily understand considering, for example, a knife that is used to slice bread: the affordance transmitted by the knife to the bread would be strictly related to the possibility of being cut or sliced, that is clearly a property that could not be attached to the&#x20;knife.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Illustration of <xref ref-type="disp-formula" rid="e15">Eq. 15</xref> with <italic>b</italic>
<sub>
<italic>j</italic>
</sub> &#x3d; &#x2212; 1. The two considered features <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> (diagonal lines), <italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub> (wavy lines) translate over the frame spatial support with uniform velocities <italic>v</italic>
<sub>
<italic>i</italic>
</sub>, <italic>v</italic>
<sub>
<italic>j</italic>
</sub>. The green area represents where <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub> is on, while the red border identifies the region where <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> and <italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub> overlap. On the overlapping region the velocity fields of the two features are both present and here the affordance field <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub>(<italic>x</italic>, <italic>t</italic>) is constrained to be consistent along the direction <italic>v</italic>
<sub>
<italic>i</italic>
</sub> &#x2212; <italic>v</italic>
<sub>
<italic>j</italic>
</sub> (red arrows). Outside and on the left of the red border, the consistency term in <xref ref-type="disp-formula" rid="e15">Eq. 15</xref> essentially collapses to the feature identity constraint <xref ref-type="disp-formula" rid="e9">Eq. 9</xref> defined by the invariance motion property with respect to <italic>v</italic>
<sub>
<italic>i</italic>
</sub> (blue arrow). Finally, in those region where <italic>v</italic>
<sub>
<italic>i</italic>
</sub> &#x3d; 0, motion consistency of <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub> is required along &#x2212; <italic>v</italic>
<sub>
<italic>j</italic>
</sub> (orange arrow).</p>
</caption>
<graphic xlink:href="frai-04-768516-g004.tif"/>
</fig>
<p>Another viable and different alternative to codify the interaction between features may be the one of directly evaluating the affordance as function of the feature fields and their respective velocities: <inline-formula id="inf31">
<mml:math id="m46">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>. Here, we are giving up the previous field theory approach, being explicitly codifying the interaction between features in the computational scheme of the <inline-formula id="inf32">
<mml:math id="m47">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> function. On the other hand, since we have already distinguished <inline-formula id="inf33">
<mml:math id="m48">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> and <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> by the different computational structure, requiring the same motion invariance property of <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> with respect to <italic>v</italic>
<sub>
<italic>i</italic>
</sub> for the affordance function <inline-formula id="inf34">
<mml:math id="m49">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> appears a very natural choice, see also <xref ref-type="fig" rid="F5">Figure&#x20;5</xref>:<disp-formula id="e16">
<mml:math id="m50">
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
</mml:math>
<label>(16)</label>
</disp-formula>
</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Illustration of <xref ref-type="disp-formula" rid="e16">Eq. 16</xref>. The two considered features <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> (diagonal lines), <italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub> (wavy lines) translate over the frame spatial support with uniform velocities <italic>v</italic>
<sub>
<italic>i</italic>
</sub>, <italic>v</italic>
<sub>
<italic>j</italic>
</sub>. The green area represents where <inline-formula id="inf35">
<mml:math id="m51">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> is on, while the red border identifies the region where <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> and <italic>&#x3c6;</italic>
<sub>
<italic>j</italic>
</sub> overlap. In this case, the motion invariance property of the affordance feature <inline-formula id="inf36">
<mml:math id="m52">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> is the same of the original feature field <inline-formula id="inf37">
<mml:math id="m53">
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>. Blue arrows identify the direction along which motion coherence of <inline-formula id="inf38">
<mml:math id="m54">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> is required.</p>
</caption>
<graphic xlink:href="frai-04-768516-g005.tif"/>
</fig>
<p>Given the possible great variability of velocity fields in a visual scene, let us underline that within this second approach some problems in the learning of the affordance function <inline-formula id="inf39">
<mml:math id="m55">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mo>&#x303;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula> may emerge. Moreover, to pursue the fascinating idea to describe all the visual processes entirely through visual fields defined on the frame spatial support, in the following we will only consider the affordance field <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub>(<italic>x</italic>, <italic>t</italic>) and the related motion invariance property <xref ref-type="disp-formula" rid="e15">Eq. 15</xref> with <italic>b</italic>
<sub>
<italic>j</italic>
</sub> &#x3d; &#x2212;1.</p>
<p>Given a certain visual environment we can easily realize that, as time goes by, object interactions begin obeying statistical regularities and the interactions of feature <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> with the others become very well defined. Hence, the notion of <italic>&#x3c8;</italic>
<sub>
<italic>ij</italic>
</sub> can be evolved towards the <italic>inherent affordance &#x3c8;</italic>
<sub>
<italic>i</italic>
</sub> of feature <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>, which is in fact a property associated with <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> while living in a certain visual environment. For example, thinking in terms of the classic notion of affordance, when considering a knife the related inherent affordance property is gained by being manipulated, in a certain way, by a virtually unbounded number of different people. Based on <xref ref-type="disp-formula" rid="e15">Eq. 15</xref> (<italic>b</italic>
<sub>
<italic>j</italic>
</sub> &#x3d; &#x2212;1) we define the inherent feature affordance as the function <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub>(<italic>x</italic>, <italic>t</italic>) which satisfies.<disp-formula id="e17">
<mml:math id="m56">
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mspace width="2em"/>
<mml:mspace width="2em"/>
<mml:mn>1</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>.</mml:mo>
</mml:math>
<label>(17)</label>
</disp-formula>
</p>
<p>Let us note that the above formula can also be interpreted as the motion invariance property of <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub> with respect to the velocity <inline-formula id="inf40">
<mml:math id="m57">
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:math>
</inline-formula>. The identification feature <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> pairs with the corresponding affordance feature <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub>, and the visual scene turns out to be effectively described by the collection of visual fields <inline-formula id="inf41">
<mml:math id="m58">
<mml:mi mathvariant="script">V</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>. In a sense, <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub> can be thought of as the abstraction of <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>, as it arises from its environmental interactions. A few comments are in order concerning these visual fields.<list list-type="simple">
<list-item>
<p>&#x2022; The pairing of <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> and <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub> relies on the same optical flow which comes from <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>
<italic>.</italic> This makes sense, since the inherent affordance is a feature that is expected to gain abstraction coming from the interactions with other features, whereas the actual optical flow can only come from identifiable entities that are naturally defined by <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>
<italic>.</italic>
</p>
</list-item>
<list-item>
<p>&#x2022; The inherent affordance features still bring with them a significant amount of redundant information. This can be understood when considering especially high level features that closely resemble objects. For example, while we may have many different chairs in a certain environment, one would expect to have only a single concept of chair. On the opposite, <italic>&#x3c8;</italic> assigns many different affordance variables that are somewhat stimulated by a specific identifiable feature. This corresponds to thinking of these affordance features as entities that are generated by a corresponding identity feature.</p>
</list-item>
<list-item>
<p>&#x2022; The collection of visual fields <inline-formula id="inf42">
<mml:math id="m59">
<mml:mi mathvariant="script">V</mml:mi>
</mml:math>
</inline-formula> is the support for high-level decisions. Of course, the recognition of specific objects does only involve the field <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>, whereas the abstract affordance semantic labeling is supported by features <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub>
<italic>.</italic>
</p>
</list-item>
</list>
</p>
<p>In order to abstract the notion of affordance even further we can, for instance, proceed as follows: for each <italic>&#x3ba;</italic> &#x3d; 1,<italic>&#x2026;</italic>, <italic>n</italic> we can consider another set of fields <inline-formula id="inf43">
<mml:math id="m60">
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c7;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:mi mathvariant="normal">&#x393;</mml:mi>
<mml:mo>&#x2192;</mml:mo>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:math>
</inline-formula> each of which satisfies the following condition<disp-formula id="e18">
<mml:math id="m61">
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c7;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c7;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>.</mml:mo>
</mml:math>
<label>(18)</label>
</disp-formula>
</p>
<p>In this way the variables <italic>&#x3c7;</italic>
<sub>
<italic>&#x3ba;</italic>
</sub> do not depend, unlike for <italic>&#x3c8;</italic>, on a particular <italic>v</italic>
<sub>
<italic>i</italic>
</sub>, which contributes to lose the link with its firing feature. Moreover, they need to take into account, during their development, multiple motion fields which results in a motion invariant property with respect to the average velocity <inline-formula id="inf44">
<mml:math id="m62">
<mml:msubsup>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:math>
</inline-formula> and in a greater level of abstraction.</p>
<p>Once the set of the <italic>&#x3c7;</italic>
<sub>
<italic>&#x3ba;</italic>
</sub> is given, a method to select the most relevant affordances could simply be achieved through a linear combination. In other words, a subselection of <italic>&#x3c7;</italic>
<sub>1</sub>, <italic>&#x2026;</italic>, <italic>&#x3c7;</italic>
<sub>
<italic>n</italic>
</sub> can be performed by considering for each <italic>l</italic>&#x20;&#x3d; 1, <italic>&#x2026;</italic>, <italic>n</italic>
<sub>
<italic>&#x3c7;</italic>
</sub> &#x3c; <italic>n</italic> the linear combinations<disp-formula id="e19">
<mml:math id="m63">
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2254;</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>&#x3ba;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c7;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:math>
<label>(19)</label>
</disp-formula>where <inline-formula id="inf45">
<mml:math id="m64">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>&#x3ba;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c7;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> is a matrix of learnable parameters. Notice that since <italic>X</italic>
<sub>
<italic>l</italic>
</sub> &#x2208; span(<italic>&#x3c7;</italic>
<sub>1</sub>, <italic>&#x2026;</italic>, <italic>&#x3c7;</italic>
<sub>
<italic>n</italic>
</sub>), as we remarked in <xref ref-type="sec" rid="s3-4">Section 3.4</xref>, then <italic>&#x2202;</italic>
<sub>
<italic>t</italic>
</sub>
<italic>X</italic>
<sub>
<italic>l</italic>
</sub> &#x2b; <italic>v</italic>
<sub>
<italic>j</italic>
</sub> &#x22c5;&#x2207;<italic>X</italic>
<sub>
<italic>l</italic>
</sub> &#x3d; 0 for all <italic>j</italic>&#x20;&#x3d; 1, <italic>&#x2026;</italic>, <italic>n</italic>. It is worth mentioning that the learning of coefficients <italic>a</italic>
<sub>
<italic>l&#x3ba;</italic>
</sub> does not involve motion invariance principles. Interestingly, they can be used for additional developmental steps like that of object recognition. For example, they can be learned under the classic supervised framework along with the correspondent regularization.</p>
</sec>
<sec id="s3-3">
<title>3.3 Regularization Issues</title>
<p>We have already discussed the ill-posed definition of features conjugated with their corresponding optical flow. Interestingly, we have also shown that a feature group <inline-formula id="inf46">
<mml:math id="m65">
<mml:mi>&#x3d5;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> with conjugate velocity <italic>v</italic> exhibits an inherent regularization that, however, does not prevent ill-positioning, especially when one is interested in developing abstract features that are likely constant over large regions of the frame spatial support.</p>
<p>Let us assume that we are given <italic>n</italic> feature groups <italic>&#x3d5;</italic>
<sub>
<italic>i</italic>
</sub>, <italic>i</italic>&#x20;&#x3d; 1, &#x2026;<italic>n</italic>, each one composed of <italic>m</italic>
<sub>
<italic>i</italic>
</sub> single features (<italic>m</italic>
<sub>
<italic>i</italic>
</sub>-dimensional feature vector) <inline-formula id="inf47">
<mml:math id="m66">
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. Furthermore, let <italic>v</italic>
<sub>
<italic>i</italic>
</sub> be the velocity field shared by each component of feature group <italic>&#x3d5;</italic>
<sub>
<italic>i</italic>
</sub> and let us also denote with <bold>
<italic>&#x3d5;</italic>
</bold> &#x3d; (<italic>&#x3d5;</italic>
<sub>1</sub>, <italic>&#x2026;</italic>, <italic>&#x3d5;</italic>
<sub>
<italic>n</italic>
</sub>) and with <bold>
<italic>v</italic>
</bold> &#x3d; (<italic>v</italic>
<sub>1</sub>, <italic>&#x2026;</italic>, <italic>v</italic>
<sub>
<italic>n</italic>
</sub>). We can then impose the following generalization of the smoothness term (6), used for the classical optical flow estimation, to the velocities and the corresponding visual features:<disp-formula id="e20">
<mml:math id="m67">
<mml:mi>E</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mo>&#x222b;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x393;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mfenced open="&#x2016;" close="&#x2016;">
<mml:mrow>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mfenced open="|" close="|">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2207;</mml:mo>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mfenced open="&#x2016;" close="&#x2016;">
<mml:mrow>
<mml:mo>&#x2207;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.17em"/>
<mml:mi>d</mml:mi>
<mml:mi>&#x3bc;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(20)</label>
</disp-formula>
</p>
<p>Here, the notation <inline-formula id="inf48">
<mml:math id="m68">
<mml:msup>
<mml:mrow>
<mml:mfenced open="&#x2016;" close="&#x2016;">
<mml:mrow>
<mml:mi>Z</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> (generic argument matrix <italic>Z</italic>) means <inline-formula id="inf49">
<mml:math id="m69">
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>, while <italic>&#x3bb;</italic>
<sub>
<italic>&#x3c6;</italic>
</sub>, <italic>&#x3bb;</italic>
<sub>&#x2207;</sub> are positive constants that express the relative weight of the regularization terms. First of all, notice that <italic>E</italic> is a functional of the pairs {(<italic>&#x3d5;</italic>
<sub>
<italic>i</italic>
</sub>, <italic>v</italic>
<sub>
<italic>i</italic>
</sub>)}, that is, once they are given, we can compute <italic>E</italic>(<bold>
<italic>&#x3d5;</italic>
</bold>, <bold>
<italic>v</italic>
</bold>). On the contrary, the index used to regularize the classical Horn&#x2013;Schunck optical flow <xref ref-type="disp-formula" rid="e6">Eq. 6</xref> only depends on <italic>v</italic>. The dependence on visual features and their temporal dynamic in <xref ref-type="disp-formula" rid="e20">Eq. 20</xref> is explained considering that, while the brightness is given, the features are learned as time goes by, which is just another facet of the feature-velocity conjugation. Moreover, it is worth mentioning that <italic>E</italic> only involves spatial smoothness whereas it doesn&#x2019;t contain any time regularization term. There is also another difference with respect to the classic optical flow regularization <xref ref-type="disp-formula" rid="e6">Eq. 6</xref>, that is the penalizing term <inline-formula id="inf50">
<mml:math id="m70">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> which favors the development of <italic>&#x3d5;</italic>
<sub>
<italic>i</italic>
</sub> &#x3d; 0. Of course, there is no such requirement in classic optical flow since, as already stated, the brightness is given. On the opposite, the discovery of visual features is expected to be driven by motion information, but their &#x201c;default value&#x201d; is expected to be null. We can promptly see that the introduction of the regularization term <xref ref-type="disp-formula" rid="e20">Eq. 20</xref> does not suffice to achieve a well-posed learning problem. The motion invariance condition <xref ref-type="disp-formula" rid="e9">Eq. 9</xref> is still satisfied by the trivial constant solution <italic>&#x3c6;</italic> &#x3d; <italic>c</italic>
<sub>
<italic>&#x3c6;</italic>
</sub> indeed.</p>
<p>Important additional information comes from the need of exhibiting the human visual skill of reconstructing pictures from our symbolic representation. At a certain level of abstraction, the features that are gained by motion invariance possess a certain degree of semantics that is needed to interpret the scene. However, visual agents are also expected to deal with actions and react accordingly. As such, a uniform cognitive task that visual agents are expected to carry out is that of predicting what will happen next, which is translated into the capability of guessing the next incoming few frames in the scene. We can think of a predictive computational scheme based on the <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> codes <inline-formula id="inf51">
<mml:math id="m71">
<mml:mi>y</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3d5;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf52">
<mml:math id="m72">
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#xd7;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and the prediction <italic>y</italic> needs to satisfy the condition established by the index<disp-formula id="e21">
<mml:math id="m73">
<mml:mi>R</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:msub>
<mml:mrow>
<mml:mo>&#x222b;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x393;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mspace width="0.17em"/>
<mml:mi>d</mml:mi>
<mml:mi>&#x3bc;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(21)</label>
</disp-formula>
</p>
<p>Of course, as the visual agent gains the capability of predicting what will come next, it means that the developed internal representation based on features <bold>
<italic>&#x3d5;</italic>
</bold> cannot correspond with the mentioned trivial solution. Interestingly, it looks like visual perception does not come alone: the typical paired skill of visual prediction that animals exhibit helps regularizing the problem of developing invariant features. Clearly, the <italic>&#x3c6;</italic> &#x3d; <italic>c</italic>
<sub>
<italic>&#x3c6;</italic>
</sub> which satisfies motion invariance is no longer acceptable since it does not reconstruct the input. This motivates the involvement of prediction skills typical of action that, again, seems to be interwound with perception.</p>
<p>Having described all the regularization terms necessary to the well-posedness of the learning problem, we can introduce the following functional.<disp-formula id="e22">
<mml:math id="m74">
<mml:mi>S</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3d5;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">v</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>A</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3d5;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">v</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>E</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3d5;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">v</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>R</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3d5;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">v</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(22)</label>
</disp-formula>where <italic>A</italic>(<bold>
<italic>&#x3d5;</italic>
</bold>, <bold>
<italic>v</italic>
</bold>) is the direct generalization to feature groups of <xref ref-type="disp-formula" rid="e10">Eq. 10</xref>, that is <inline-formula id="inf53">
<mml:math id="m75">
<mml:mi>A</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3d5;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">v</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:msubsup>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mo>&#x222b;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi>d</mml:mi>
<mml:mi>&#x3bc;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>. Here, <italic>&#x3bb;</italic>
<sub>
<italic>R</italic>
</sub> and <italic>&#x3bb;</italic>
<sub>
<italic>E</italic>
</sub> &#x3e; 0 are the regularization parameters. Learning to see means to discover the indissoluble pair (<bold>
<italic>&#x3d5;</italic>
</bold>
<sup>&#x22c6;</sup>, <bold>
<italic>v</italic>
</bold>
<sup>&#x22c6;</sup>) such that<disp-formula id="e23">
<mml:math id="m76">
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c6;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x03BD;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c6;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mi>arg</mml:mi>
<mml:mspace width="0.17em"/>
<mml:mi>min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3d5;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">&#x03BD;</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:munder>
<mml:mi>S</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3d5;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">&#x03BD;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(23)</label>
</disp-formula>
</p>
<p>Basically, the minimization is expected to return the pair (<bold>
<italic>&#x3d5;</italic>
</bold>, <bold>
<italic>v</italic>
</bold>), whose terms should nominally be conjugated. The case in which we reduce to consider only the brightness, that is when the only <bold>
<italic>&#x3d5;</italic>
</bold> is <italic>I</italic>, corresponds with the classic problem of optical flow estimation. Of course, in this case the term <italic>R</italic> is absent and the problem has a classic solution. Another special case is when there is no motion, so as the integrand in the definition <xref ref-type="disp-formula" rid="e10">Eq. 10</xref> of <italic>A</italic> is simply null <italic>&#x2200;i</italic> &#x3d; 1, <italic>&#x2026;</italic>, <italic>n</italic>. In this case, the learning problem reduces to the unsupervised extraction of features <bold>
<italic>&#x3d5;</italic>
</bold>.</p>
<p>The learning of <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub> can be based on a formulation that closely resembles what has been done for <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>, for which we have already considered the regularization issues. In the case of <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub> we can get rid of the trivial constant solution by minimizing.<disp-formula id="e24">
<mml:math id="m77">
<mml:msub>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mo>&#x222b;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x393;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.17em"/>
<mml:mi>d</mml:mi>
<mml:mi>&#x3bc;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(24)</label>
</disp-formula>which comes from the p-norm translation (<xref ref-type="bibr" rid="B24">Gori and Melacci, 2013</xref>; <xref ref-type="bibr" rid="B23">Gnecco et&#x20;al., 2015</xref>) of the logic implication <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> &#x21d2; <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub>. Here we are assuming that <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>, <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub> range in [0, 1], so as whenever <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> gets close to 1, it forces the same for <italic>&#x3c8;</italic>
<sub>
<italic>i</italic>
</sub>. This yields a well-posed formulation thus avoiding the trivial solution.</p>
<p>As far as the <italic>&#x3c7;</italic> are concerned, like for <italic>&#x3c8;</italic>, we ask for the minimization of<disp-formula id="e25">
<mml:math id="m78">
<mml:msub>
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c7;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mo>&#x222b;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x393;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c7;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.17em"/>
<mml:mi>d</mml:mi>
<mml:mi>&#x3bc;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(25)</label>
</disp-formula>that comes from the p-norm translation of the logic implication <italic>&#x3c8;</italic>
<sub>
<italic>&#x3ba;</italic>
</sub> &#x21d2; <italic>&#x3c7;</italic>
<sub>
<italic>&#x3ba;</italic>
</sub>. While this regularization term settles the value of <italic>&#x3c7;</italic>
<sub>
<italic>&#x3ba;</italic>
</sub> on the corresponding <italic>&#x3c8;</italic>
<sub>
<italic>&#x3ba;</italic>
</sub>, notice that the motion invariance condition (18) does not assume any privilege with respect to the <italic>firing feature &#x3c8;</italic>
<sub>
<italic>&#x3ba;</italic>
</sub>.</p>
</sec>
<sec id="s3-4">
<title>3.4 Deep Networks-Based Realization of Vision Fields</title>
<p>In the previous sections we discussed invariance properties of visual features that lead to model the processes of computational vision as transport equations on the visual fields, see <xref ref-type="disp-formula" rid="e9">Eqs 9</xref>, <xref ref-type="disp-formula" rid="e15">15</xref>, <xref ref-type="disp-formula" rid="e17">17</xref>, <xref ref-type="disp-formula" rid="e18">18</xref>. Some of those properties are based on the concept of consistency under motion, others lead to a generalization of the concept of affordance. In this section we will discuss how the features <italic>&#x3c6;</italic>, <italic>&#x3c8;</italic> and <italic>&#x3c7;</italic>, along with the velocity fields <italic>v</italic>
<sub>
<italic>&#x3c6;</italic>
</sub>, can be represented in terms of neural networks that operate on a visual stream and how the above theory can be interpreted in a classical framework of machine learning.</p>
<p>The first step we need to perform consists in moving to a discretized frame spatial support <inline-formula id="inf54">
<mml:math id="m79">
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c4;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="double-struck">N</mml:mi>
<mml:mo>:</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>&#x3c;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>,</mml:mo>
<mml:mspace width="1em"/>
<mml:mn>0</mml:mn>
<mml:mo>&#x3c;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>. As a consequence, the fields <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> and the velocities can be viewed as vector-valued functions of time <inline-formula id="inf55">
<mml:math id="m80">
<mml:mi>t</mml:mi>
<mml:mo>&#x21a6;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c4;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and <inline-formula id="inf56">
<mml:math id="m81">
<mml:mi>t</mml:mi>
<mml:mo>&#x21a6;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c4;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>; similarly the discretized brightness can be seen as a map <inline-formula id="inf57">
<mml:math id="m82">
<mml:mi>t</mml:mi>
<mml:mo>&#x21a6;</mml:mo>
<mml:mi>I</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c4;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>.<xref ref-type="fn" rid="fn3">
<sup>3</sup>
</xref> Then, features <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub> (and similarly the fields <italic>&#x3c8;</italic> and <italic>&#x3c7;</italic>) can be modelled as neural networks <inline-formula id="inf58">
<mml:math id="m83">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c4;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c4;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> that given the brightness <italic>I</italic> at a certain instant and a set of <italic>N</italic> weights <inline-formula id="inf59">
<mml:math id="m84">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> yield the value <inline-formula id="inf60">
<mml:math id="m85">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> of <italic>&#x3c6;</italic>
<sub>
<italic>i</italic>
</sub>. Similarly, the velocities <italic>v</italic>
<sub>
<italic>&#x3c6;</italic>
</sub> can be estimated by a neural network <inline-formula id="inf61">
<mml:math id="m86">
<mml:msub>
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c4;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c4;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22c4;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> that takes as inputs the temporal partial derivative <inline-formula id="inf62">
<mml:math id="m87">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
<mml:mo>&#x307;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> (that is the discrete version of the term <italic>&#x2202;</italic>
<sub>
<italic>t</italic>
</sub>
<italic>I</italic>), the discrete spatial gradient &#x2207;<italic>I</italic> of <italic>I</italic> and a given a set of <italic>M</italic> weights <inline-formula id="inf63">
<mml:math id="m88">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> in order to predict <inline-formula id="inf64">
<mml:math id="m89">
<mml:msub>
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>I</mml:mi>
</mml:mrow>
<mml:mo>&#x307;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mo>&#x2207;</mml:mo>
<mml:mi>I</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> as the velocity field <inline-formula id="inf65">
<mml:math id="m90">
<mml:msub>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>.</p>
<p>It should be noted that, within this framework, the learning problem for the fields <italic>&#x3c6;</italic>, <italic>&#x3c7;</italic>, <italic>&#x3b7;</italic> and <italic>v</italic>
<sub>
<italic>&#x3c6;</italic>
</sub>, that is based on the principles described in this paper and that is defined by the optimization problem of the form described in <xref ref-type="disp-formula" rid="e23">Eq. 23</xref>, becomes a finite-dimensional learning problem on the weights of the neural models.</p>
<p>Thus, learning will be affected by the structure (i.e. the architecture) of the network that we choose. Recent successes of deep learning within the realm of computer vision suggest that natural choices for &#x3a6; would be Deep Convolutional Neural Networks (DCNN). More precisely, the features extracted at level <italic>&#x2113;</italic> of a DCNN can be identified with a group of &#x3a6;<sub>
<italic>i</italic>
</sub>; in this way we are establishing a hierarchy between features that, in turn, suggests a natural way in which we could perform the grouping operation that we discussed in <xref ref-type="sec" rid="s3-1">Section 3.1</xref>. In this way, features that are at the same level of a CNN share the same velocity (see <xref ref-type="fig" rid="F6">Figure&#x20;6</xref>). In the case of velocities, CNN-based architectures like the one employed by FlowNet (see <xref ref-type="bibr" rid="B18">Fischer et&#x20;al. (2015)</xref>) have been already proven to be suitable to model velocity fields.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Different visual features along with the corresponding velocity fields.</p>
</caption>
<graphic xlink:href="frai-04-768516-g006.tif"/>
</fig>
<p>It is also important to bear in mind that the choice of a specific neural architecture has strong repercussions on the way the invariance conditions are satisfied. For instance, let us consider the case of Convolutional Networks together with the fundamental condition expressed by <xref ref-type="disp-formula" rid="e9">Eq. 9</xref>. In this case, since CNN are equivariant under translations, any feature that tracks a uniformly translating motion of the brightness will automatically satisfy <xref ref-type="disp-formula" rid="e9">Eq. 9</xref> with the same velocity of the translation of the&#x20;input.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Discussion</title>
<p>In this paper we have proposed motion invariance principles that lead to discover identification features and more abstract features that are somewhat inspired to the notion of affordance. Those principles are expressed by motion invariance equations that characterize the interesting visual field interactions. The conjunction with features <italic>&#x3c6;</italic> leads to believe that those features and their own velocities represent an indissoluble pair. Basically, the presence of a visual feature in the frame spatial support corresponds with its own optical flow, so as they must be jointly detectable. Apparently, in case of a visual agent without relative movement with respect to an object, this sounds odd. What if the object is fixed? Interestingly, foveate animals always experiment movement in the frame of reference of their eyes, and something similar can be experimented in computers by the simulation of focus of attention <xref ref-type="bibr" rid="B45">Zanca et&#x20;al. (2019)</xref>, <xref ref-type="bibr" rid="B15">Faggi et&#x20;al. (2020)</xref>. Hence, apart from the case of saccadic movements, foveate animals, like haplorhine primates, are always in front of motion, and conjugation of features with the corresponding optical flow does not result in trivial conditions.</p>
<p>The overall field interaction of features and velocities leads to compose a more abstract picture, since in the extreme case of features that represent objects, as already pointed out, we see the emergence of the classic notion of affordance. Interestingly, the described mechanisms of field interaction go well beyond the connection with such a high-level cognitive notion. We can promptly realize that it is impossible to understand whether the discussed field interactions come from different objects or if they are in fact generated within the same object. Overall, the discussed field interactions represent a natural mechanism for transmitting information from the video by local mechanisms.</p>
<p>It has been shown that in order to get a well-posedness of the motion invariance problems of <xref ref-type="disp-formula" rid="e9">Eqs 9</xref>, <xref ref-type="disp-formula" rid="e15">15</xref>, <xref ref-type="disp-formula" rid="e17">17</xref>, <xref ref-type="disp-formula" rid="e18">18</xref> we need to involve appropriate regularization. In particular, the development of visual features <italic>&#x3c6;</italic> requires the correspondent minimization of <xref ref-type="disp-formula" rid="e21">Eq. 21</xref>, that somewhat indicates the need of involving action together with perception. Indeed, visual perception coupled with gaze shifts should be considered the <italic>Drosophila</italic> of perception-action loops. Among the variety of active behaviors the organism can fluently engage to purposively act upon and perceive the world (e.g, moving the body, turning the head, manipulating objects), oculomotor behavior is the minimal, least energy, unit. The history of these ideas has been recently reviewed in <xref ref-type="bibr" rid="B6">Bajcsy et&#x20;al. (2018)</xref>. At that time, such computational approaches were pervaded by the early work of <xref ref-type="bibr" rid="B48">Gibson (1950)</xref> who proposed that perception is due to the combination of the environment in which an agent exists and how that agent interacts with it. He was primarily interested in optic flow that is generated on the frame spatial support when moving through the environment (as when flying) realizing that it was the path of motion itself that enabled the perception of specific elements, while disenabling others. That path of motion was under the control of the agent and thus the agent chooses how it perceives its world and what is perceived within it <xref ref-type="bibr" rid="B6">Bajcsy et&#x20;al. (2018)</xref>. The basic idea of Gibson&#x2019;s view was that of the exploratory behaviour of the agent. It is worth noting that despite of the pioneering work of <xref ref-type="bibr" rid="B1">Aloimonos et&#x20;al. (1988)</xref>, <xref ref-type="bibr" rid="B7">Ballard (1991)</xref>, and <xref ref-type="bibr" rid="B5">Bajcsy and Campos (1992)</xref>, gaze dynamics has been by and large overlooked in computer vision. The current state of affairs is that most effort is spent on salience modelling <xref ref-type="bibr" rid="B11">Borji (2021)</xref>, <xref ref-type="bibr" rid="B10">Borji and Itti (2013)</xref> as a tool for predicting where/what to look at (the tacit though questionable assumption is that, once suitably computed, salience would be predictive of gaze). Interestingly enough, and rooted in the animate vision approach, Ballard set out the idea of predictive coding <xref ref-type="bibr" rid="B37">Rao and Ballard (1999)</xref>:</p>
<p>We describe a model of visual processing in which feedback connections from a higher-to a lower-order visual cortical area carry predictions of lower-level neural activities, whereas the feedforward connections carry the residual errors between the predictions and the actual lower level activities. When exposed to natural images, a hierarchical network of model neurons implementing such a model developed simple cell-like receptive fields. A subset of neurons responsible for carrying the residual errors showed endstopping and other extra-classical receptive field effects. These results suggest that rather than being exclusively feedforward phenomena, nonclassical surround effects in the visual cortex may also result from cortico-cortical feedback as a consequence of the visual system using an efficient hierarchical strategy for encoding natural images.</p>
<p>This idea has gained currency in recent research covering many fields from theoretical cognitive neuroscience (e.g., <xref ref-type="bibr" rid="B29">Knill and Pouget, 2004</xref>; <xref ref-type="bibr" rid="B32">Ma et&#x20;al., 2006</xref>) to philosophy <xref ref-type="bibr" rid="B12">Clark (2013)</xref>. Currently, the most influential approach in this perspective has been proposed by Friston (e.g., <xref ref-type="bibr" rid="B17">Feldman and Friston, 2010</xref>; <xref ref-type="bibr" rid="B19">Friston, 2010</xref>) who considered a variational approximation to Bayesian inference and prediction (free energy minimization, minimization of action functionals,&#x20;etc).</p>
<p>The principles on visual feature flow introduced in this paper might also have an impact in computer vision, since one can reasonably believe that the proposed invariances might overcome one of the major current limitation of supervised learning paradigms, namely the need of a huge amount of labeled examples. This being said, deep neural networks, along with their powerful approximation capabilities, could provide us with the ideal computational structure to complete the theoretical framework here proposed.</p>
</sec>
</body>
<back>
<sec id="s5">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.</p>
</sec>
<sec sec-type="COI-statement" id="s7">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s8">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ack>
<p>This work was partially supported by the PRIN 2017 project RexLearn (Reliable and Explainable Adversarial Machine Learning), funded by the Italian Ministry of Education, University and Research (grant no. 2017TWNMH2).</p>
</ack>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>Under the assumption <inline-formula id="inf66">
<mml:math id="m91">
<mml:mtext>rank</mml:mtext>
<mml:mo>&#x2207;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>G</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>rank</mml:mtext>
<mml:mfenced open="(" close="">
<mml:mrow>
<mml:mo>&#x2207;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>G</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mfenced open="|" close=")">
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x2202;</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>G</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>/</mml:mo>
<mml:mi>&#x2202;</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>.</p>
</fn>
<fn id="fn2">
<label>2</label>
<p>For example, we can understand that a person is sitting on or standing up from a chair just considering a still&#x20;image.</p>
</fn>
<fn id="fn3">
<label>3</label>
<p>Here we are overloading the symbols <italic>&#x3c6;</italic>, <italic>v</italic>
<sub>
<italic>&#x3c6;</italic>
</sub> and <italic>I</italic> in order to avoid a cumbersome notation. In the previous sections <italic>&#x3c6;</italic>, <italic>v</italic>
<sub>
<italic>&#x3c6;</italic>
</sub> and <italic>I</italic> are functions defined over the spatio-temporal cylinder &#x3a9; &#xd7; [0, <italic>T</italic>], here they are instead regarded as vector-valued functions of time&#x20;only.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aloimonos</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Weiss</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Bandyopadhyay</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>1988</year>). <article-title>Active Vision</article-title>. <source>Int. J.&#x20;Comput. Vis.</source> <volume>1</volume>, <fpage>333</fpage>&#x2013;<lpage>356</lpage>. <pub-id pub-id-type="doi">10.1007/bf00133571</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Anselmi</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Leibo</surname>
<given-names>J.&#x20;Z.</given-names>
</name>
<name>
<surname>Rosasco</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Mutch</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tacchetti</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Poggio</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Unsupervised Learning of Invariant Representations</article-title>. <source>Theor. Comput. Sci.</source> <volume>633</volume>, <fpage>112</fpage>&#x2013;<lpage>121</lpage>. <pub-id pub-id-type="doi">10.1016/j.tcs.2015.06.048</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ard&#xf3;n</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Pairet</surname>
<given-names>&#xc8;.</given-names>
</name>
<name>
<surname>Lohan</surname>
<given-names>K. S.</given-names>
</name>
<name>
<surname>Ramamoorthy</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Petrick</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Affordances in Robotic Tasks&#x2013;A Survey</source>. <comment>arXiv preprint arXiv:2004.07400</comment>. </citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Aubert</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Kornprobst</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2006</year>). <source>Mathematical Problems in Image Processing: Partial Differential Equations and the Calculus of Variations</source>, <volume>Vol. 147</volume>. <publisher-loc>New York City, NY</publisher-loc>: <publisher-name>Springer Science &#x26; Business Media</publisher-name>. </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bajcsy</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Campos</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>1992</year>). <article-title>Active and Exploratory Perception</article-title>. <source>CVGIP: Image Understanding</source> <volume>56</volume>, <fpage>31</fpage>&#x2013;<lpage>40</lpage>. <pub-id pub-id-type="doi">10.1016/1049-9660(92)90083-f</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bajcsy</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Aloimonos</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Tsotsos</surname>
<given-names>J.&#x20;K.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Revisiting Active Perception</article-title>. <source>Auton. Robot</source> <volume>42</volume>, <fpage>177</fpage>&#x2013;<lpage>196</lpage>. <pub-id pub-id-type="doi">10.1007/s10514-017-9615-3</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ballard</surname>
<given-names>D. H.</given-names>
</name>
</person-group> (<year>1991</year>). <article-title>Animate Vision</article-title>. <source>Artif. Intell.</source> <volume>48</volume>, <fpage>57</fpage>&#x2013;<lpage>86</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(91)90080-4</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Betti</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gori</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Melacci</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Learning Visual Features under Motion Invariance</article-title>. <source>Neural Networks</source> <volume>126</volume>, <fpage>275</fpage>&#x2013;<lpage>299</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2020.03.013</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bhaskaran</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Konstantinides</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>1997</year>). <source>Image and Video Compression Standards: Algorithms and Architectures</source>. <publisher-loc>Boston, MA</publisher-loc>: <publisher-name>Springer</publisher-name>. </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Borji</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Itti</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>State-of-the-art in Visual Attention Modeling</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>35</volume>, <fpage>185</fpage>&#x2013;<lpage>207</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2012.89</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Borji</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Saliency Prediction in the Deep Learning Era: Successes and Limitations</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>43</volume>, <fpage>679</fpage>&#x2013;<lpage>700</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2019.2935715</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Clark</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Whatever Next? Predictive Brains, Situated Agents, and the Future of Cognitive Science</article-title>. <source>Behav. Brain Sci.</source> <volume>36</volume>, <fpage>181</fpage>&#x2013;<lpage>204</lpage>. <pub-id pub-id-type="doi">10.1017/s0140525x12000477</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Cohen</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Welling</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Group Equivariant Convolutional Networks</article-title>,&#x201d; in <conf-name>International conference on machine learning (PMLR)</conf-name>, <conf-loc>New York City, NY</conf-loc>, <conf-date>June 20&#x2013;22, 2016</conf-date>, <fpage>2990</fpage>&#x2013;<lpage>2999</lpage>. </citation>
</ref>
<ref id="B14">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Denton</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Birodkar</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Unsupervised Learning of Disentangled Representations from Video</article-title>. <source>Advances in Neural Information Processing Systems</source>. Editors <person-group person-group-type="editor">
<name>
<surname>Guyon</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Luxburg</surname>
<given-names>U. V.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wallach</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Fergus</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Vishwanathan</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Garnett</surname>
<given-names>R.</given-names>
</name>
</person-group> (<publisher-loc>Red Hook, NY</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>) <volume>30</volume>. <comment>Available at <ext-link ext-link-type="uri" xlink:href="https://proceedings.neurips.cc/paper/2017/file/2d2ca7eedf739ef4c3800713ec482e1a-Paper.pdf">https://proceedings.neurips.cc/paper/2017/file/2d2ca7eedf739ef4c3800713ec482e1a-Paper.pdf</ext-link>
</comment> </citation>
</ref>
<ref id="B15">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Faggi</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Betti</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zanca</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Melacci</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Gori</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Wave Propagation of Visual Stimuli in Focus of Attention</article-title>. <source>CoRR</source>. <comment>arXiv preprint arXiv:2006.11035</comment>. </citation>
</ref>
<ref id="B16">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Farneb&#xe4;ck</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2003</year>). &#x201c;<article-title>Two-frame Motion Estimation Based on Polynomial Expansion</article-title>,&#x201d; in <conf-name>Scandinavian Conference on Image Analysis</conf-name>, <conf-loc>Halmstad, Sweden</conf-loc>, <conf-date>June 29&#x2013;July 2, 2003</conf-date> (<publisher-name>Springer</publisher-name>), <fpage>363</fpage>&#x2013;<lpage>370</lpage>. <pub-id pub-id-type="doi">10.1007/3-540-45103-x_50</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Feldman</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Friston</surname>
<given-names>K. J.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Attention, Uncertainty, and Free-Energy</article-title>. <source>Front. Hum. Neurosci.</source> <volume>4</volume>, <fpage>215</fpage>. <pub-id pub-id-type="doi">10.3389/fnhum.2010.00215</pub-id> </citation>
</ref>
<ref id="B18">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Fischer</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Dosovitskiy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Ilg</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>H&#xe4;usser</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Haz&#x131;rba&#x15f;</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Golkov</surname>
<given-names>V.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>Flownet: Learning Optical Flow with Convolutional Networks</article-title>. <source>Proceedings of the IEEE International Conference on Computer Vision</source>, <fpage>2758</fpage>&#x2013;<lpage>2766</lpage>. </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Friston</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>The Free-Energy Principle: a Unified Brain Theory?</article-title> <source>Nat. Rev. Neurosci.</source> <volume>11</volume>, <fpage>127</fpage>&#x2013;<lpage>138</lpage>. <pub-id pub-id-type="doi">10.1038/nrn2787</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gens</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Domingos</surname>
<given-names>P. M.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Deep Symmetry Networks</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>27</volume>, <fpage>2537</fpage>&#x2013;<lpage>2545</lpage>. <pub-id pub-id-type="doi">10.5555/2969033.2969110</pub-id> </citation>
</ref>
<ref id="B48">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gibson</surname>
<given-names>J. J.</given-names>
</name>
</person-group> (<year>1950</year>). <publisher-loc>Boston, MA</publisher-loc>: <source>The Perception of the Visual World</source>. <publisher-name>Houghton Mifflin</publisher-name>. </citation>
</ref>
<ref id="B21">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gibson</surname>
<given-names>J.&#x20;J.</given-names>
</name>
</person-group> (<year>1966</year>). <source>The Senses Considered as Perceptual Systems</source>, <volume>Vol. 2</volume>. <publisher-loc>Boston, MA</publisher-loc>: <publisher-name>Houghton Mifflin Boston</publisher-name>. </citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gibson</surname>
<given-names>J.&#x20;J.</given-names>
</name>
</person-group> (<year>1979</year>). <source>The Ecological Approach to Visual Perception</source>. <publisher-loc>Boston, MA</publisher-loc>: <publisher-name>Houghton Mifflin Comp</publisher-name>. </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gnecco</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Gori</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Melacci</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sanguineti</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Foundations of Support Constraint Machines</article-title>. <source>Neural Comput.</source> <volume>27</volume>, <fpage>388</fpage>&#x2013;<lpage>480</lpage>. <pub-id pub-id-type="doi">10.1162/neco_a_00686</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gori</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Melacci</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Constraint Verification with Kernel Machines</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst.</source> <volume>24</volume>, <fpage>825</fpage>&#x2013;<lpage>831</lpage>. <pub-id pub-id-type="doi">10.1109/tnnls.2013.2241787</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hassanin</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Khan</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Tahtali</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Visual Affordance and Function Understanding</article-title>. <source>ACM Comput. Surv.</source> <volume>54</volume>, <fpage>1</fpage>&#x2013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1145/3446370</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Horn</surname>
<given-names>B. K. P.</given-names>
</name>
<name>
<surname>Schunck</surname>
<given-names>B. G.</given-names>
</name>
</person-group> (<year>1981</year>). <article-title>Determining Optical Flow</article-title>. <source>Artif. Intell.</source> <volume>17</volume>, <fpage>185</fpage>&#x2013;<lpage>203</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(81)90024-2</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hsieh</surname>
<given-names>J.-T.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>D.-A.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>F.-F.</given-names>
</name>
<name>
<surname>Niebles</surname>
<given-names>J.&#x20;C.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Learning to Decompose and Disentangle Representations for Video Prediction</article-title>. <source>NeurIPS</source>, <fpage>514</fpage>&#x2013;<lpage>524</lpage>. <comment>Available at <ext-link ext-link-type="uri" xlink:href="http://papers.nips.cc/paper/7333-learning-to-decompose-and-disentangle-representations-for-video-prediction">http://papers.nips.cc/paper/7333-learning-to-decompose-and-disentangle-representations-for-video-prediction</ext-link>
</comment>. </citation>
</ref>
<ref id="B28">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ilg</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Mayer</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Saikia</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Keuper</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Dosovitskiy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Brox</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Flownet 2.0: Evolution of Optical Flow Estimation with Deep Networks</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, <conf-loc>honolulu, HI</conf-loc>, <conf-date>July 21&#x2013;26, 2017</conf-date>, <fpage>2462</fpage>&#x2013;<lpage>2470</lpage>. <pub-id pub-id-type="doi">10.1109/cvpr.2017.179</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Knill</surname>
<given-names>D. C.</given-names>
</name>
<name>
<surname>Pouget</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>The Bayesian Brain: the Role of Uncertainty in Neural Coding and Computation</article-title>. <source>Trends Neurosci.</source> <volume>27</volume>, <fpage>712</fpage>&#x2013;<lpage>719</lpage>. <pub-id pub-id-type="doi">10.1016/j.tins.2004.10.007</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krizhevsky</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sutskever</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G. E.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Imagenet Classification with Deep Convolutional Neural Networks</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>25</volume>, <fpage>1097</fpage>&#x2013;<lpage>1105</lpage>. </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Soatto</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Video-based Descriptors for Object Recognition</article-title>. <source>Image Vis. Comput.</source> <volume>29</volume>, <fpage>639</fpage>&#x2013;<lpage>652</lpage>. <pub-id pub-id-type="doi">10.1016/j.imavis.2011.08.003</pub-id> </citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>W. J.</given-names>
</name>
<name>
<surname>Beck</surname>
<given-names>J.&#x20;M.</given-names>
</name>
<name>
<surname>Latham</surname>
<given-names>P. E.</given-names>
</name>
<name>
<surname>Pouget</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Bayesian Inference with Probabilistic Population Codes</article-title>. <source>Nat. Neurosci.</source> <volume>9</volume>, <fpage>1432</fpage>&#x2013;<lpage>1438</lpage>. <pub-id pub-id-type="doi">10.1038/nn1790</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Milner</surname>
<given-names>A. D.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>How Do the Two Visual Streams Interact with Each Other?</article-title> <source>Exp. Brain Res.</source> <volume>235</volume>, <fpage>1297</fpage>&#x2013;<lpage>1308</lpage>. <pub-id pub-id-type="doi">10.1007/s00221-017-4917-4</pub-id> </citation>
</ref>
<ref id="B34">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Mobahi</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Collobert</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Weston</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Deep Learning from Temporal Coherence in Video</article-title>,&#x201d; in <conf-name>Proceedings of the 26th Annual International Conference on Machine Learning</conf-name>, <conf-loc>Montreal, QC</conf-loc>, <conf-date>June 14&#x2013;18, 2009</conf-date>, <fpage>737</fpage>&#x2013;<lpage>744</lpage>. <pub-id pub-id-type="doi">10.1145/1553374.1553469</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Pan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Mei</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Rui</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure</article-title>,&#x201d; in <conf-name>IJCAI</conf-name>, <conf-loc>New York City, NY</conf-loc>, <conf-date>July 9&#x2013;15, 2016</conf-date>, (<comment>Citeseer</comment>), <fpage>3832</fpage>&#x2013;<lpage>3838</lpage>. </citation>
</ref>
<ref id="B36">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Pathak</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Girshick</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Darrell</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Hariharan</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Learning Features by Watching Objects Move</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Honolulu, HI</conf-loc>, <conf-date>July 21&#x2013;26, 2017</conf-date>, <fpage>2701</fpage>&#x2013;<lpage>2710</lpage>. <pub-id pub-id-type="doi">10.1109/cvpr.2017.638</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rao</surname>
<given-names>R. P. N.</given-names>
</name>
<name>
<surname>Ballard</surname>
<given-names>D. H.</given-names>
</name>
</person-group> (<year>1999</year>). <article-title>Predictive Coding in the Visual Cortex: a Functional Interpretation of Some Extra-classical Receptive-Field Effects</article-title>. <source>Nat. Neurosci.</source> <volume>2</volume>, <fpage>79</fpage>&#x2013;<lpage>87</lpage>. <pub-id pub-id-type="doi">10.1038/4580</pub-id> </citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Redondo-Cabrera</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Lopez-Sastre</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Unsupervised Learning from Videos Using Temporal Coherency Deep Networks</article-title>. <source>Comput. Vis. Image Understanding</source> <volume>179</volume>, <fpage>79</fpage>&#x2013;<lpage>89</lpage>. <pub-id pub-id-type="doi">10.1016/j.cviu.2018.08.003</pub-id> </citation>
</ref>
<ref id="B39">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Tulyakov</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>M.-Y.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Kautz</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Mocogan: Decomposing Motion and Content for Video Generation</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, <conf-loc>Salt Lake City, UT</conf-loc>, <conf-date>June 18&#x2013;23, 2018</conf-date>, <fpage>1526</fpage>&#x2013;<lpage>1535</lpage>. <pub-id pub-id-type="doi">10.1109/cvpr.2018.00165</pub-id> </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Verri</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Poggio</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>1989</year>). <article-title>Motion Field and Optical Flow: Qualitative Properties</article-title>. <source>IEEE Trans. Pattern Anal. Machine Intell.</source> <volume>11</volume>, <fpage>490</fpage>&#x2013;<lpage>498</lpage>. <pub-id pub-id-type="doi">10.1109/34.24781</pub-id> </citation>
</ref>
<ref id="B41">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Verri</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>1987</year>). &#x201c;<article-title>Against Quantititive Optical Flow</article-title>,&#x201d; in <conf-name>Proc. First Int&#x2019;l Conf. Computer Vision</conf-name>, <conf-loc>London, UK</conf-loc>, <conf-date>June 8&#x2013;11, 1987</conf-date> (<publisher-loc>London</publisher-loc>, <fpage>171</fpage>&#x2013;<lpage>180</lpage>). </citation>
</ref>
<ref id="B42">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Villegas</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Decomposing Motion and Content for Natural Video Sequence Prediction</article-title>. <source>ICLR</source>. <comment>arXiv preprint arXiv:1706.08033</comment>. </citation>
</ref>
<ref id="B43">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Gupta</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Unsupervised Learning of Visual Representations Using Videos</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE international conference on computer vision</conf-name>, <conf-loc>Santiago, Chile</conf-loc>, <conf-date>December 7&#x2013;13, 2015</conf-date>, <fpage>2794</fpage>&#x2013;<lpage>2802</lpage>. <pub-id pub-id-type="doi">10.1109/iccv.2015.320</pub-id> </citation>
</ref>
<ref id="B44">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Bilinski</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Bremond</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Dantcheva</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>G3an: Disentangling Appearance and Motion for Video Generation</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Seattle, WA</conf-loc>, <conf-date>June 13&#x2013;19, 2020</conf-date>, <fpage>5264</fpage>&#x2013;<lpage>5273</lpage>. <pub-id pub-id-type="doi">10.1109/cvpr42600.2020.00531</pub-id> </citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zanca</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Melacci</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Gori</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Gravitational Laws of Focus of Attention</article-title>. <source>IEEE Trans. Pattern Anal. Mach Intell.</source> <volume>42</volume>, <fpage>2983</fpage>&#x2013;<lpage>2995</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2019.2920636</pub-id> </citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhai</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Xiang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Lv</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Kong</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Optical Flow and Scene Flow Estimation: A Survey</article-title>. <source>Pattern Recognit.</source> <volume>114</volume>, <fpage>107861</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2021.107861</pub-id> </citation>
</ref>
<ref id="B47">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zou</surname>
<given-names>W. Y.</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>A. Y.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2011</year>). &#x201c;<article-title>Unsupervised Learning of Visual Invariance with Temporal Coherence</article-title>,&#x201d; in <conf-name>NIPS 2011 workshop on deep learning and unsupervised feature learning</conf-name>, <conf-loc>Granada, Spain</conf-loc>, <conf-date>December 12&#x2013;17, 2011</conf-date>. </citation>
</ref>
</ref-list>
</back>
</article>