Visual Features and Their Own Optical Flow

Symmetries, invariances and conservation equations have always been an invaluable guide in Science to model natural phenomena through simple yet effective relations. For instance, in computer vision, translation equivariance is typically a built-in property of neural architectures that are used to solve visual tasks; networks with computational layers implementing such a property are known as Convolutional Neural Networks (CNNs). This kind of mathematical symmetry, as well as many others that have been recently studied, are typically generated by some underlying group of transformations (translations in the case of CNNs, rotations, etc.) and are particularly suitable to process highly structured data such as molecules or chemical compounds which are known to possess those specific symmetries. When dealing with video streams, common built-in equivariances are able to handle only a small fraction of the broad spectrum of transformations encoded in the visual stimulus and, therefore, the corresponding neural architectures have to resort to a huge amount of supervision in order to achieve good generalization capabilities. In the paper we formulate a theory on the development of visual features that is based on the idea that movement itself provides trajectories on which to impose consistency. We introduce the principle of Material Point Invariance which states that each visual feature is invariant with respect to the associated optical flow, so that features and corresponding velocities are an indissoluble pair. Then, we discuss the interaction of features and velocities and show that certain motion invariance traits could be regarded as a generalization of the classical concept of affordance. These analyses of feature-velocity interactions and their invariance properties leads to a visual field theory which expresses the dynamical constraints of motion coherence and might lead to discover the joint evolution of the visual features along with the associated optical flows.


INTRODUCTION
Deep learning has revolutionized computer vision and visual perception. Amongst others, the great representational power of convolutional neural networks and the elegance and efficiency of Backpropagation have played a crucial role (Krizhevsky et al., 2012). By and large, there is a strong scientific recognition of their capabilities, which is very well deserved. However, an important but often overlooked aspect is that natural images are swamped by nuisance factors such as lighting, viewpoint, part deformation and background. This makes the overall recognition problem much more difficult (Lee and Soatto, 2011;Anselmi et al., 2016). Typical CNNs architectures, not structurally modelling these possible variations, require a large amount of data with high variability to gain satisfying generalization skills. Some recent works have addressed this aspect focusing on the construction of invariant (Gens and Domingos, 2014;Anselmi et al., 2016) or equivariant (Cohen and Welling, 2016) features with respect to a priori specified symmetry groups of transformations. We argue that, when relying on massively supervised learning, we have been working on a problem that is-from a computational point of view-remarkably different and likely more difficult with respect to the one offered by Nature, where motion is in fact in charge for generating visual information. Motion is what offers us an object in all its poses. Classic translation, scale, and rotation invariances can clearly be obtained by appropriate movements of a given object . However, the experimentation of visual interaction due to motion goes well beyond the need for these invariances and it includes the object deformation, as well as its obstruction. Could not be the case that motion is in fact nearly all we need for learning to see? Current deep learning approaches based on supervised images mostly neglect the crucial role of temporal coherence, ending up into problems where the extraction of visual concepts can only be based on spatial regularities. Temporal coherence plays a fundamental role in extracting meaningful visual features (Mobahi et al., 2009;Zou et al., 2011;Wang and Gupta, 2015;Pan et al., 2016;Redondo-Cabrera and Lopez-Sastre, 2019) and, more specifically, when dealing with video-based tasks, such as video compression (Bhaskaran and Konstantinides, 1997). Some of these videooriented works are specifically focused on disentangling content features (constant within the selected video clip) from pose and motion features (that codify information varying over time) (Denton and Birodkar, 2017;Villegas et al., 2017;Hsieh et al., 2018;Tulyakov et al., 2018;Wang et al., 2020). The problem of learning high-level features consistent with the way objects move was also faced in Pathak et al. (2017) in the context of unsupervised object foreground versus background segmentation.
In this vein, we claim that feature learning arises mostly from motion invariance principles that turn out to be fundamental for detecting the object identity as well as characterizing interactions between features themselves. To understand that, let us start considering a moving object in a given visual scene. The object can be thought of as made up of different material points, each one with its own identity that does not change during the object motion. Consequently, the identity of the corresponding pixels has also to remain constant along their apparent motion on the retina. We will express this idea in terms of feature fields (i.e. functions of the given pixel and the specific time instant) that are invariant along the trajectories defined by their conjugate velocity fields, extending, in turn, the classical brightness invariance principle for the optical flow estimation (Horn and Schunck, 1981). Visual features and the corresponding optical flow fields make up an indissoluble pair linked by the motion invariance condition that drives the entire learning process. Each change in the visual features affects the associated velocity fields and vice versa. From a biological standpoint, recent studies have suggested that the ventral and dorsal pathways may not be as independent as originally thought (Milner, 2017). Following this insight, we endorse the joint discovery of visual features and the related optical flows, pairing their learning through a motion invariance constraint. Motion information does not only confer object identity, but also its affordance. As defined by Gibson in his seminal work (Gibson, 1966(Gibson, , 1979, affordances essentially characterize the relation between an agent and its environment and, given a certain object, correspond to the possible actions that can be executed upon it. A chair, for example, offers the affordance of seating a human being, but it can have other potential uses. In other words, the way an agent interacts with a particular object is what defines its affordance, and this is strictly related to their relative motion. Extending and generalizing this classic notion of affordance to visual features, we will define the notion of affordance field, describing the interaction between pairs of visual features. Essentially, these interactions are defined by the relative motion of the features themselves so that the corresponding affordance fields will be required to be invariant with respect to such relative motion. Hence, in the rest of the paper, we will use the term affordance in this broader sense. This paper is organized as follows. Section 2 is focused on classical methods for optical flow estimation. In this case, the brightness is given by the input video and the goal is to determine the corresponding optical flow through the brightness invariance condition. Typical regularization issues, necessary to specify a unique velocity field, are also addressed. Section 3 is devoted to extend the previous approach to visual features. This time, features are not given in advance but are jointly learnt together with the corresponding velocity fields. Features and velocities are tied by the motion invariance principle. After that, the classical notion of affordance by Gibson (1966), Gibson (1979) is introduced and extended to the case of visual features. Even in this case motion invariance (with respect to relative velocities) plays a pivotal role in defining the corresponding affordance fields. At the end of Section 3 regularization issues are also considered and a formulation of learning of the visual fields is sketched out, together with the description of a possible practical implementation of the proposed ideas through deep neural networks. Finally, Section 4 draws some conclusions.

OPTICAL FLOW
The fundamental problem of optical flow estimation has been receiving a lot of attention in computer vision. In spite of the growing evidence of performance improvement (Fischer et al., 2015;Ilg et al., 2017;Zhai et al., 2021), an in-depth analysis on the precise definition of the velocity to be attributed to each pixel is still questionable (Verri, 1987;Verri and Poggio, 1989;Aubert and Kornprobst, 2006). While a simple visual inspection of some recent top level optical flow estimation systems clearly indicates remarkable performance, the definition of "optical flow" is difficult and quite slippery. Basically, we need to associate each pixel with its velocity. When considering the temporal quantization, any sound definition of such a velocity does require to know where any pixel moves on the next frame. How can we trace any single pixel? Clearly, any such pixel corresponds to a "point" of an "object" in the visual environment and the fundamental requirement to fulfill is that of tracking the point of the object.
An enlightening answer to this question was given by Horn and Shunck in a seminal paper published at the beginning of the eighties (Horn and Schunck, 1981). The basic assumption is the local-in-time constancy of the brightness intensity function I: Assuming smoothness we can approximate this condition to the first order taking into account only infinitesimal temporal distances and obtain at t t 0 : where u(x 0 , t 0 )d(dc x0 /dt)(t 0 ) is the optical flow and · is the standard scalar product in R 2 . Assumption Eq. 1 is reasonable when there are no occlusions and changes of the light source are assumed to be "small." Of course, in real world applications of computer vision these scenarios are not always met. On the other hand, it is clear that the optical flow u could be derived from an invariance condition of the type Eq. 1 applied to different and possibly more "stable" visual features rather than to the brightness itself. As shown in Figure 1, this would give a different optical flow with respect to the one defined through the brightness invariance condition ( Figure 1C). For example, a feature responding to the entire barber's pole, that is standing still, would have an associated optical flow that is null everywhere ( Figure 1D). Still, we have to keep in mind that in both cases the resulting optical flow is different from the 2-D motion field (defined as the projection on the image plane of the 3-D velocity of the visual scene, see e.g. Aubert and Kornprobst (2006)) shown in Figure 1B.
This indeed is the main motivation to couple the problem of feature extraction together with motion invariance constraints and the derivation of robust and meaningful optical flows associated to those visual features.

Regularization of the Optical Flow
Before going on to lay out the theory for the extraction of motion invariant visual features, we need to recall some facts about the optical flow condition Eq. 2. Given a video stream described by its brightness intensity I as it is defined as in Section 1, the problem of finding for each pixel of the frame spatial support at each time instant the velocity field u(x, t) satisfying is clearly ill posed since a scalar equation is not sufficient to properly constrain the two components of u. Locally, we can unequivocally determine only the component of u along ∇I. Although many methods have been proposed to overcome this issue (see for example the work of Aubert and Kornprobst (2006)), that is usually referred to as the aperture problem, here we are interested in the class of approaches that aims at regularizing the optical flow: where A I is a functional that enforces constraint Eq. 3, hence a standard choice is S imposes smoothness and H may be used to condition the extraction of the flow over spatially homogeneous regions. Depending on the regularity assumptions on I and the properties that we want to impose on the solution of this regularized problem (namely if we want to admit solutions that preserve discontinuities or not) the exact form of S and the form and presence of H I may vary. For example, in the original approach proposed in Horn and Schunck (1981) we find: Horn-Schunck regularization: Note that since we are interested in extracting the optical flow for any frame of a video, namely the field u(x, t), it is useful for any function f: Ω × [0, T] → R p to define f t : Ω → R p as f t (x)df(x, t). With this notation, when the infimum in Eq. 4 is attained, we can pose.
Notice that the brightness might not necessarily be the ideal signal to track. Since the brightness can be expressed as a weighed average of the red R, green G, and blue B components, one could think of tracking each single color component of the video signal by using the same invariance principle stated by Eq. 3. It could in fact be the case that one or more of the components R, G, B are more invariant in the sense of Eq. 3 during the motion of the corresponding material point, see Figure 2. In that case, in general, each color can be associated with corresponding optical flows v R , v G , v B that might differ. In doing so, instead of tracking the brightness, one can track the single colors. Instead, under the assumption that each color component has the same where v ·∇(R,G,B)′d(v ·∇R,v ·∇G,v ·∇B)′. It is worth mentioning that the simultaneous tracking of different channels might contribute to a better positioning of the problem since, in general, rank∇(R, G, B)′ 2 and the system Eq. 8 admits an unique solution. 1 One can think of the color components as features that, unlike classical convolutional spatial features, are temporal features. Before proceeding further, let us underline that some other optical flow methods try to directly solve the brightness invariance condition Eq. 1 without differentiating it. This is the case, for example, of the Gunnar Farnebäck's algorithm (Farnebäck, 2003): the basic idea here is to approximate the brightness of the input images through polynomial expansions with variable coefficients, and the brightness invariance condition Eq. 1 is then solved under this assumption. Figure 3 shows the optical flows extracted by the Horn-Schunck and Farnebäcks methods in the barber's pole case.
In the next section, we will discuss how to use a very similar approach, based on the consistency of features along apparent motion trajectories on the frame spatial support, to derive visual features φ i along with the corresponding conjugate optical flows v φ i . We anticipate that this motion consistency condition will also play a prominent role in defining affordance features, as described in Section 3.2.

FEATURE EXTRACTION AND CONJUGATE VELOCITIES
As we have already anticipated in the previous sections, the optical flow extracted by imposing an invariance condition like the one in Eq. 3 strongly depends on the features on which we are imposing that invariance; hence it should be not surprising that different sets of features could give rise to different optical flows. This can be easily understood by considering the barber's pole example in Figure 1. The related classical optical flow is depicted in Figure 1C, see also Figure 3, and it is different from the projection of the 3-D velocities on the frame spatial support Figure 1B (the resulting optical flow is an optical illusion indeed). Let us now assume the existence of a visual feature φ r characterizing the red stripes, that is φ r (x, t) 1 iff (x, t) is inside a stripe. As the barber's pole rotates, the conjugate velocity v φ r is, in this simplified case, the same that one would have obtained from the brightness invariance condition Eq. 3. An FIGURE 2 | Tracking of different color components in a synthetic example. In this case, each color component is associated to a specific velocity field.
additional level of abstraction can be gained when looking at the whole object. Again, we are assuming the existence of an higher level visual feature φ object characterizing it. Then, considering that the barber's pole is standing still, the velocity field associated to that feature is everywhere null, as shown in Figure 1D.
This example clearly explains how different velocity fields can be associated to different visual features, but we still have to go one step further. Until now, mimicking the case of the classical optical flow estimation given the corresponding input brightness, we have described the construction of velocity fields starting from visual features whose existence was a priori assumed. Recent studies have suggested that the ventral and dorsal pathways may not be as independent as originally thought. Evidence for contributions from ventral stream systems to the dorsal stream indicates a crucial role in mediating complex and flexible visuomotor skills. Meanwhile, complementary evidence points to a role for posterior dorsal-stream visual analysis in certain aspects of 3-D perceptual function in the ventral stream (but see Milner, 2017 for a review). As pointed out by Milner (2017) potential cross-stream interactions might take three forms: 1) Independent processing: computations along the separate pathways proceed independently and in parallel and reintegrate at some final stage of processing within a shared target brain region; this might be achieved via common projections to the lateral prefrontal cortex or superior temporal sulcus (STS); 2) Feedback: processing along the two pathways is modulated by the existence of feedback loops which transmit information from downstream brain regions, including information processed along the complementary stream; feedback is likely to involve projections to early retinotopic cortical areas. 3) Continuous cross-talk: information is transferred at multiple stages and locations along the two pathways.
The three forms need not be mutually exclusive and a resolution of the problems of visual integration might involve a combination of such possibilities (Milner, 2017).
Yet, from a learning standpoint, the cross-talk mode is intriguing for setting some minimal conditions for an agent (either biological or artificial) in order to develop visual capabilities. Following this biological insight, we endorse the indissoluble conjunction of features and velocities and, consequently, their joint discovery based on the motion invariance condition.
where we are considering d different visual features. Locally, this equation means that, at each pixel x of the frame spatial support and specific time instant t, features φ i are preserved along the trajectories defined by the corresponding velocity fields v φ i and starting at (x, t).
An object clearly does not change its identity while it is moving. Consequently, the identity of the corresponding pixels on the frame spatial support has to remain invariant along the apparent motion defined by the associated optical flows. Thinking of the brightness as the simplest visual feature based on single pixels, Eq. 9 correctly reduces to the brightness invariance condition Eq. 3. Notice that if there is no optical flow for a given pixel This means that the absence of the optical flow in x results into φ( x, t) c φ for all t ∈ [0, T], which is the obvious consistency condition that one expects in this case. Likewise, a constant field φ(x, t) in a subregion C ⊂ Ω × [0, T] makes Eq. 9 satisfied on C independently of v φ .
Like for the brightness, in general, the invariance condition Eq. 9 generates an ill-posed problem. In particular, when the moving object has a uniform color, we can notice that brightness invariance holds for virtually infinite trajectories. Likewise, any of the features φ is expected to be spatially smooth and nearly constant in small portions of the frame spatial support, and this restores the ill-posedness of the classical problem of determining the optical flow that has been addressed in the previous section. Unlike brightness invariance, in the case of visual features the ill-posedness of the problem has a double face. Just like in the classic case of estimating the optical flow, v φ is not uniquely defined (the aperture problem). On top of that, now the corresponding feature φ is not uniquely defined, too. We will address regularization issues in Section 3.3 where, including additional information other than coherence on motion trajectories, we will make the learning process well-posed. Of course, the regularization process will also involve a term similar to the one invoked for the optical flow v, see Eq. 6, that will be imposed on v φ . Given what we have discussed so far, we can also expect the presence of some regularization term concerning the features themselves and their regularity. Finally, these terms will be complemented with an additional "prediction" index necessary to avoid trivial features' solutions (we postpone its description to Section 3.3).
The basic notion at the core of this section is that {(φ i , v φ i )} can be treated as indissoluble pairs bounded by the motion invariance condition that steers the entire learning process. The structure of each φ i affects the associated velocity v φ i and vice versa-it is therefore natural to pair their learning. Leaving aside for the moment regularization issues of Section 3.3, learning is based on a functional generalizing to Eq. 5, that is where Γ Ω × [0, T] and μ is an appropriately weighted Lebesgue measure on R 2 x × R t ; its exact form defines the dynamics of the learning process itself. The minimization of such functional (plus the additional regularitation terms) is expected to return the pairs (φ i , v φ i ) satisfying the motion consistency condition Eq. 9. Sometimes, in what follows and when the notation is clear from the context, we will drop the subscript φ of v φ so that v φ i will be denoted as v i .

Feature Grouping
As already noticed, when we consider color images, what is done in the case of brightness invariance can be applied to the separated components R,G,B. Interestingly, for a material point of a certain color, given by a mixture of the three components, we can establish the same brightness invariance principle, since those components move with the same velocity. Said in other words, there could be group of different visual features φ i , i 1, . . ., m that share the same velocity (v i v) and are consistent with it, that is z t φ i + v ·∇φ i 0 ∀i 1, . . ., m. Thus, we can promptly see that any feature φ of span(φ 1 , . . ., φ m ) is still conjugated with v; we can think of span(φ 1 , . . ., φ m ) as a functional space conjugated with v.
where ∇ϕ ∈ R m×2 is the matrix with elements (∇ϕ) ij (∇ϕ i ) j and v · ∇ϕd(v · φ 1 , . . . , v · φ m ) ′ . An important observation, very related to the discussion about color tracking of Eq. 8, is the following one. Notice that, if we consider the case in which the only scalar feature we are dealing with is the brightness, then Eq. 11 boils down to a single equation with two unknowns (the velocity components). Differently, in the case of the feature group ϕ, we have m equations and still two unknowns. The dimension m of matrix ∇ϕ can enforce the increment of its rank, which leads to a better posedness of the problem of estimating the optical flow v. Because of the two-dimensional structure of the frame spatial support, which leads to v ∈ R 2 , and since ∇ϕ ∈ R m×2 , with m ≥ 2, it turns out that feature grouping regularizes the velocity discovery. In order to understand the effect of feature grouping we can in fact simply notice that, under the assumption rank∇ϕ rank(∇ϕ| − ϕ t ), a random choice of the features yields rank∇ϕ 2. As a consequence, by Rouché-Capelli theorem, linear Eq. 11 admits a unique solution in v. However, this regularization effect of feature grouping does not prevent ill-posedness, since ϕ is far from being a random map. On the opposite, it is supposed to extract a uniform value in portions of the frame spatial support that are characterized by the same feature. Hence, rank∇ϕ 1 is still possible whenever the features of the group are somewhat dependent. Feature groups, that are characterized by their common velocity, can give rise to more structured features belonging to the same group. This can promptly be understood when we go beyond linear spaces and consider for a set of indices F .
Evaluating z t η + v ·∇η we obtain indeed and we conclude that if ∀j ∈ F we have z t φ j + v ·∇φ j 0 then also the feature η defined by Eq. 12 is conjugated with v, that is z t η + v ·∇η 0. However, the vice versa does not hold true. Basically, the inheritance of conjugation with v holds in the direction towards more abstract features. Of course, the feedforward-like recursive application of the derivation stated by Eq. 12 yields a feature that is still conjugated with v.

Affordance-Related Features
Any learning process that relies on the motion of a given object can only aspire to discover the identity of that object, along with its characterizing visual features such as pose and shape. The motion invariance process is in fact centered around the object itself and, as such, it does reveal its own features in all possible expositions that are gained during motion. Humans, and likely most animals, also conquer a truly different understanding of visual scenes that goes beyond the conceptualization with single object identities. In the early Sixties, James J. Gibson coined the notion of affordance in (Gibson, 1966), even though a more refined analysis came later in (Gibson, 1979). In his own words: "The affordances of the environment are what it offers the animal, what it provides or furnishes, either for good or ill. The verb to afford is found in the dictionary, the noun affordance is not. I have made it up. I mean by it something that refers to both the environment and the animal in a way that no existing term does. It implies the complementarity of the animal and the environment." Considering this animal-centric view, we gain the understanding that affordance can be interpreted as what characterizes the "interaction" between animals and their surrounding environment. In more general terms, the way an agent interacts with a particular object is what defines its affordance, and this is strictly related to their relative motion. In the last decades, computer scientists have also being working on this general idea, trying to quantitatively implement it in the fields of computer vision and robotics (Ardón et al., 2020;Hassanin et al., 2021). As far as visual affordance is concerned, that is, extracting affordance information from still images and videos, different cognitive tasks have been considered so far, as for example affordance recognition and affordance segmentation, see (Hassanin et al., 2021) for a recent review.
In the spirit of the previous section, we will consider a more abstract notion of affordance, characterizing the interaction between different visual features along with their corresponding conjugate velocity fields. We will focus our attention on actions that are perceivable from single pictures 2 and on the related local notion of affordance, that will be defined by some function characterizing the interaction between feature φ j and feature φ i when considering the pixel x at the specific time instant t. As we will see, the principle of motion invariance can be extended to naturally define (explicitly or implicitly) this generalized notion of affordance. A natural choice is to consider what we will denote as the affordance field ψ ij as a function of space and time. To implicitly codify the interaction between features i and j, ψ ij (x, t) has to be constrained by some relation of the form g(ψ ij , z t ψ ij , ∇ψ ij , (φ i , v i ), (φ j , v j )) 0, where we are considering only first order derivatives of the affordance field and g is a scalar function. In the lower order approximation (we also need quadratic terms to build scalars from vectors): 2 For example, we can understand that a person is sitting on or standing up from a chair just considering a still image. where a 1 , . . ., a 7 are scalars. Considering the case in which the motion field associated to feature φ j is everywhere null, the affordance field ψ ij will codify a property only related to φ i itself and strictly related to its identity, that has to be invariant with respect to v i . Thus, from this observation, we can infer a 1 a 2 a 3 a 4 0 and a 5 a 6 so that the above constraint becomes: where b j a 7 /a 5 . Requiring b j −1 this constraint assumes a very reasonable physical meaning, that is the motion invariance of the affordance field ψ ij in the reference of feature φ j . Within this choice, the affordance field is conjugated with the velocity v i − v j indeed, which is in fact the relative velocity of feature φ i in the reference of feature φ j . Considering points at the border of φ i , this can lead to slightly expand ψ ij outside the region defined by φ i itself, as shown in Figure 4. In the case v i 0, the motion consistency is forced "backward" along the pixels' trajectories defined by − v j . In the case b j 1 Eq. 15 becomes symmetric under permutations instead so that ψ ij and ψ ji will be developed exploiting the same constraint. This will likely result in the same affordance feature unless some other factor (let us think for example to different initializations in neural architectures) breaks that symmetry. From the classic affordance perspective this is not a desirable property as we can easily understand considering, for example, a knife that is used to slice bread: the affordance transmitted by the knife to the bread would be strictly related to the possibility of being cut or sliced, that is clearly a property that could not be attached to the knife.
Another viable and different alternative to codify the interaction between features may be the one of directly evaluating the affordance as function of the feature fields and their respective velocities:ψ ij (x, t, (φ i , v i ), (φ j , v j )). Here, we are giving up the previous field theory approach, being explicitly codifying the interaction between features in the computational scheme of theψ ij function. On the other hand, since we have already distinguishedψ ij and φ i by the different computational structure, requiring the same motion invariance property of φ i with respect to v i for the affordance functionψ ij appears a very natural choice, see also Figure 5: Given the possible great variability of velocity fields in a visual scene, let us underline that within this second approach some problems in the learning of the affordance functioñ ψ ij (x, t, (φ i , v i ), (φ j , v j )) may emerge. Moreover, to pursue the fascinating idea to describe all the visual processes entirely through visual fields defined on the frame spatial support, in the following we will only consider the affordance field ψ ij (x, t) and the related motion invariance property Eq. 15 with b j −1.
Given a certain visual environment we can easily realize that, as time goes by, object interactions begin obeying statistical regularities and the interactions of feature φ i with the others become very well defined. Hence, the notion of ψ ij can be evolved towards the inherent affordance ψ i of feature φ i , which is in fact a property associated with φ i while living in a certain visual environment. For example, thinking in terms of the classic notion of affordance, when considering a knife the related inherent affordance property is gained by being manipulated, in a certain way, by a virtually unbounded number of different people. Based on Eq. 15 (b j −1) we define the inherent feature affordance as the function ψ i (x, t) which satisfies.
FIGURE 4 | Illustration of Eq. 15 with b j − 1. The two considered features φ i (diagonal lines), φ j (wavy lines) translate over the frame spatial support with uniform velocities v i , v j . The green area represents where ψ ij is on, while the red border identifies the region where φ i and φ j overlap. On the overlapping region the velocity fields of the two features are both present and here the affordance field ψ ij (x, t) is constrained to be consistent along the direction v i − v j (red arrows). Outside and on the left of the red border, the consistency term in Eq. 15 essentially collapses to the feature identity constraint Eq. 9 defined by the invariance motion property with respect to v i (blue arrow). Finally, in those region where v i 0, motion consistency of ψ ij is required along − v j (orange arrow).
FIGURE 5 | Illustration of Eq. 16. The two considered features φ i (diagonal lines), φ j (wavy lines) translate over the frame spatial support with uniform velocities v i , v j . The green area represents whereψ ij is on, while the red border identifies the region where φ i and φ j overlap. In this case, the motion invariance property of the affordance featureψ ij is the same of the original feature field φ i (x, t). Blue arrows identify the direction along which motion coherence ofψ ij is required. Let us note that the above formula can also be interpreted as the motion invariance property of ψ i with respect to the velocity v i − n j 1 v j /n. The identification feature φ i pairs with the corresponding affordance feature ψ i , and the visual scene turns out to be effectively described by the collection of visual fields V {(φ i , ψ i , v i )}. In a sense, ψ i can be thought of as the abstraction of φ i , as it arises from its environmental interactions. A few comments are in order concerning these visual fields.
• The pairing of φ i and ψ i relies on the same optical flow which comes from φ i . This makes sense, since the inherent affordance is a feature that is expected to gain abstraction coming from the interactions with other features, whereas the actual optical flow can only come from identifiable entities that are naturally defined by φ i . • The inherent affordance features still bring with them a significant amount of redundant information. This can be understood when considering especially high level features that closely resemble objects. For example, while we may have many different chairs in a certain environment, one would expect to have only a single concept of chair. On the opposite, ψ assigns many different affordance variables that are somewhat stimulated by a specific identifiable feature. This corresponds to thinking of these affordance features as entities that are generated by a corresponding identity feature. • The collection of visual fields V is the support for high-level decisions. Of course, the recognition of specific objects does only involve the field φ i , whereas the abstract affordance semantic labeling is supported by features ψ i .
In order to abstract the notion of affordance even further we can, for instance, proceed as follows: for each κ 1,. . ., n we can consider another set of fields χ κ : Γ → R each of which satisfies the following condition z t χ κ + v j · ∇χ κ 0, j 1, . . . , n.
In this way the variables χ κ do not depend, unlike for ψ, on a particular v i , which contributes to lose the link with its firing feature. Moreover, they need to take into account, during their development, multiple motion fields which results in a motion invariant property with respect to the average velocity n j 1 v j /n and in a greater level of abstraction.
Once the set of the χ κ is given, a method to select the most relevant affordances could simply be achieved through a linear combination. In other words, a subselection of χ 1 , . . ., χ n can be performed by considering for each l 1, . . ., n χ < n the linear combinations where (a lκ ) ∈ R n χ ×n is a matrix of learnable parameters. Notice that since X l ∈ span(χ 1 , . . ., χ n ), as we remarked in Section 3.4, then z t X l + v j ·∇X l 0 for all j 1, . . ., n. It is worth mentioning that the learning of coefficients a lκ does not involve motion invariance principles. Interestingly, they can be used for additional developmental steps like that of object recognition. For example, they can be learned under the classic supervised framework along with the correspondent regularization.

Regularization Issues
We have already discussed the ill-posed definition of features conjugated with their corresponding optical flow. Interestingly, we have also shown that a feature group ϕ (φ i , . . . , φ m ) ′ with conjugate velocity v exhibits an inherent regularization that, however, does not prevent ill-positioning, especially when one is interested in developing abstract features that are likely constant over large regions of the frame spatial support. Let us assume that we are given n feature groups ϕ i , i 1, . . .n, each one composed of m i single features (m i -dimensional feature vector) ϕ i ∈ R m i . Furthermore, let v i be the velocity field shared by each component of feature group ϕ i and let us also denote with ϕ (ϕ 1 , . . ., ϕ n ) and with v (v 1 , . . ., v n ). We can then impose the following generalization of the smoothness term (6), used for the classical optical flow estimation, to the velocities and the corresponding visual features: (20) Here, the notation Z 2 (generic argument matrix Z) means i,j Z 2 ij , while λ φ , λ ∇ are positive constants that express the relative weight of the regularization terms. First of all, notice that E is a functional of the pairs {(ϕ i , v i )}, that is, once they are given, we can compute E(ϕ, v). On the contrary, the index used to regularize the classical Horn-Schunck optical flow Eq. 6 only depends on v. The dependence on visual features and their temporal dynamic in Eq. 20 is explained considering that, while the brightness is given, the features are learned as time goes by, which is just another facet of the feature-velocity conjugation. Moreover, it is worth mentioning that E only involves spatial smoothness whereas it doesn't contain any time regularization term. There is also another difference with respect to the classic optical flow regularization Eq. 6, that is the penalizing term (1/2)|ϕ i | 2 which favors the development of ϕ i 0. Of course, there is no such requirement in classic optical flow since, as already stated, the brightness is given. On the opposite, the discovery of visual features is expected to be driven by motion information, but their "default value" is expected to be null. We can promptly see that the introduction of the regularization term Eq. 20 does not suffice to achieve a well-posed learning problem. The motion invariance condition Eq. 9 is still satisfied by the trivial constant solution φ c φ indeed. Important additional information comes from the need of exhibiting the human visual skill of reconstructing pictures from our symbolic representation. At a certain level of abstraction, the features that are gained by motion invariance possess a certain degree of semantics that is needed to interpret the scene. However, visual agents are also expected to deal with actions and react accordingly. As such, a uniform cognitive task that visual agents are expected to carry out is that of predicting what Frontiers in Artificial Intelligence | www.frontiersin.org December 2021 | Volume 4 | Article 768516 will happen next, which is translated into the capability of guessing the next incoming few frames in the scene. We can think of a predictive computational scheme based on the φ i codes y(x, t) α y (ϕ(·, t), t)(x), where α y : (R Ω ) n × [0, T] → R Ω and the prediction y needs to satisfy the condition established by the index Of course, as the visual agent gains the capability of predicting what will come next, it means that the developed internal representation based on features ϕ cannot correspond with the mentioned trivial solution. Interestingly, it looks like visual perception does not come alone: the typical paired skill of visual prediction that animals exhibit helps regularizing the problem of developing invariant features. Clearly, the φ c φ which satisfies motion invariance is no longer acceptable since it does not reconstruct the input. This motivates the involvement of prediction skills typical of action that, again, seems to be interwound with perception.
Having described all the regularization terms necessary to the well-posedness of the learning problem, we can introduce the following functional.
Basically, the minimization is expected to return the pair (ϕ, v), whose terms should nominally be conjugated. The case in which we reduce to consider only the brightness, that is when the only ϕ is I, corresponds with the classic problem of optical flow estimation. Of course, in this case the term R is absent and the problem has a classic solution. Another special case is when there is no motion, so as the integrand in the definition Eq. 10 of A is simply null ∀i 1, . . ., n. In this case, the learning problem reduces to the unsupervised extraction of features ϕ.
The learning of ψ i can be based on a formulation that closely resembles what has been done for φ i , for which we have already considered the regularization issues. In the case of ψ i we can get rid of the trivial constant solution by minimizing.
which comes from the p-norm translation (Gori and Melacci, 2013;Gnecco et al., 2015) of the logic implication φ i 0 ψ i . Here we are assuming that φ i , ψ i range in [0, 1], so as whenever φ i gets close to 1, it forces the same for ψ i . This yields a well-posed formulation thus avoiding the trivial solution.
As far as the χ are concerned, like for ψ, we ask for the minimization of that comes from the p-norm translation of the logic implication ψ κ 0 χ κ . While this regularization term settles the value of χ κ on the corresponding ψ κ , notice that the motion invariance condition (18) does not assume any privilege with respect to the firing feature ψ κ .

Deep Networks-Based Realization of Vision Fields
In the previous sections we discussed invariance properties of visual features that lead to model the processes of computational vision as transport equations on the visual fields, see Eqs 9, 15, 17, 18. Some of those properties are based on the concept of consistency under motion, others lead to a generalization of the concept of affordance. In this section we will discuss how the features φ, ψ and χ, along with the velocity fields v φ , can be represented in terms of neural networks that operate on a visual stream and how the above theory can be interpreted in a classical framework of machine learning. The first step we need to perform consists in moving to a discretized frame spatial support As a consequence, the fields φ i and the velocities can be viewed as vector-valued functions of time t1φ i (t) ∈ R Ω ◇ and t1v φ i (t) ∈ (R Ω ◇ ) 2 ; similarly the discretized brightness can be seen as a map t1I(t) ∈ R Ω ◇ . 3 Then, features φ i (and similarly the fields ψ and χ) can be modelled as neural networks Φ i : R N × R Ω ◇ → R Ω ◇ that given the brightness I at a certain instant and a set of N weights w Φ i ∈ R N yield the value Φ i (I, w Φ i ) of φ i . Similarly, the velocities v φ can be estimated by a neural network V i : R M × R Ω ◇ × (R Ω ◇ ) 2 → (R Ω ◇ ) 2 that takes as inputs the temporal partial derivative _ I (that is the discrete version of the term z t I), the discrete spatial gradient ∇I of I and a given a set of M weights w vi in order to predict V i (w vi , _ I, ∇I) as the velocity field v φ i . It should be noted that, within this framework, the learning problem for the fields φ, χ, η and v φ , that is based on the principles described in this paper and that is defined by the optimization problem of the form described in Eq. 23, becomes a finitedimensional learning problem on the weights of the neural models.
Thus, learning will be affected by the structure (i.e. the architecture) of the network that we choose. Recent successes of deep learning within the realm of computer vision suggest that natural choices for Φ would be Deep Convolutional Neural Networks (DCNN). More precisely, the features extracted at level ℓ of a DCNN can be identified with a group of Φ i ; in this way we are establishing a hierarchy between features that, in turn, suggests a natural way in which we could perform the grouping operation that we discussed in Section 3.1. In this way, features that are at the same level of a CNN share the same velocity (see Figure 6). In the case of velocities, CNN-based architectures like the one employed by FlowNet (see Fischer et al. (2015)) have been already proven to be suitable to model velocity fields.
It is also important to bear in mind that the choice of a specific neural architecture has strong repercussions on the way the invariance conditions are satisfied. For instance, let us consider the case of Convolutional Networks together with the fundamental condition expressed by Eq. 9. In this case, since CNN are equivariant under translations, any feature that tracks a uniformly translating motion of the brightness will automatically satisfy Eq. 9 with the same velocity of the translation of the input.

DISCUSSION
In this paper we have proposed motion invariance principles that lead to discover identification features and more abstract features that are somewhat inspired to the notion of affordance. Those principles are expressed by motion invariance equations that characterize the interesting visual field interactions. The conjunction with features φ leads to believe that those features and their own velocities represent an indissoluble pair. Basically, the presence of a visual feature in the frame spatial support corresponds with its own optical flow, so as they must be jointly detectable. Apparently, in case of a visual agent without relative movement with respect to an object, this sounds odd. What if the object is fixed? Interestingly, foveate animals always experiment movement in the frame of reference of their eyes, and something similar can be experimented in computers by the simulation of focus of attention Zanca et al. (2019), Faggi et al. (2020. Hence, apart from the case of saccadic movements, foveate animals, like haplorhine primates, are always in front of motion, and conjugation of features with the corresponding optical flow does not result in trivial conditions.
The overall field interaction of features and velocities leads to compose a more abstract picture, since in the extreme case of features that represent objects, as already pointed out, we see the emergence of the classic notion of affordance. Interestingly, the described mechanisms of field interaction go well beyond the connection with such a high-level cognitive notion. We can promptly realize that it is impossible to understand whether the discussed field interactions come from different objects or if they are in fact generated within the same object. Overall, the discussed field interactions represent a natural mechanism for transmitting information from the video by local mechanisms.
It has been shown that in order to get a well-posedness of the motion invariance problems of Eqs 9, 15, 17, 18 we need to involve appropriate regularization. In particular, the development of visual features φ requires the correspondent minimization of Eq. 21, that somewhat indicates the need of involving action together with perception. Indeed, visual perception coupled with gaze shifts should be considered the Drosophila of perception-action loops. Among the variety of active behaviors the organism can fluently engage to purposively act upon and perceive the world (e.g, moving the body, turning the head, manipulating objects), oculomotor behavior is the minimal, least energy, unit. The history of these ideas has been recently reviewed in Bajcsy et al. (2018). At that time, such computational approaches were pervaded by the early work of Gibson (1950) who proposed that perception is due to the combination of the environment in which an agent exists and how that agent interacts with it. He was primarily interested in optic flow that is generated on the frame spatial support when moving through the environment (as when flying) realizing that it was the path of motion itself that enabled the perception of specific elements, while disenabling others. That path of motion was under the control of the agent and thus the agent chooses how it perceives its world and what is perceived within it Bajcsy et al. (2018). The basic idea of Gibson's view was that of the exploratory behaviour of the agent. It is worth noting that despite of the pioneering work of Aloimonos et al. (1988), Ballard (1991), and Bajcsy and Campos (1992), gaze dynamics has been by and large overlooked in computer vision. The current state of affairs is that most effort is spent on salience modelling Borji (2021), Borji and Itti (2013) as a tool for predicting where/what to look at (the tacit though questionable assumption is that, once suitably computed, salience would be predictive of gaze). Interestingly enough, and rooted in the animate vision approach, Ballard set out the idea of predictive coding Rao and Ballard (1999): We describe a model of visual processing in which feedback connections from a higher-to a lower-order visual cortical area carry predictions of lower-level neural activities, whereas the feedforward connections carry the residual errors between the predictions and the actual lower level activities. When exposed to natural images, a hierarchical network of model neurons implementing such a model developed simple cell-like receptive fields. A subset of neurons responsible for carrying the residual errors showed endstopping and other extra-classical receptive field effects. These results suggest that rather than being exclusively feedforward phenomena, nonclassical surround effects in the visual cortex may also result from cortico- cortical feedback as a consequence of the visual system using an efficient hierarchical strategy for encoding natural images. This idea has gained currency in recent research covering many fields from theoretical cognitive neuroscience (e.g., Knill and Pouget, 2004;Ma et al., 2006) to philosophy Clark (2013). Currently, the most influential approach in this perspective has been proposed by Friston (e.g., Feldman and Friston, 2010;Friston, 2010) who considered a variational approximation to Bayesian inference and prediction (free energy minimization, minimization of action functionals, etc).
The principles on visual feature flow introduced in this paper might also have an impact in computer vision, since one can reasonably believe that the proposed invariances might overcome one of the major current limitation of supervised learning paradigms, namely the need of a huge amount of labeled examples. This being said, deep neural networks, along with their powerful approximation capabilities, could provide us with the ideal computational structure to complete the theoretical framework here proposed.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.